lookishops.blogg.se - Webscraper python sql

WEBSCRAPER PYTHON SQL CODE

Though the web scraper works very well and the database of hyperlinks for webpages is populated very quickly, even for 60 links (plus all the webpages within each) the network graph is very crowded.

The resulting network is then read from the sqlite database, plotted and saved to a png. Nx.draw_networkx_labels(G,pos,font_size=8,font_family='sans-serif',font_color='r') Nx.draw_networkx_nodes(G,pos,node_size=100,node_color='b') WHERE id = ?''', (category_id, startpage + i) ) Newtree = omstring(ntent)Ĭur.execute('''INSERT OR IGNORE INTO Category (topic)Ĭur.execute('SELECT id FROM Category WHERE topic = ? ', (categories, ))Ĭur.execute('''UPDATE Links SET topic = ? G.add_edge(categories, categories) # add to graph if not the first in the list While t1.text != None and 'categor' not in categories and categories not in categories:

# while loop to find where catergory list should end T1 = lookup(tree, 'Categories') # look for link after 'categories'Ĭategories = # initialise category list Url = '' + cur.fetchone() # database only contains latter part of link If child.tag = 'a': # make sure it is a linkįor i in range(testpages): # iterate through linksĬur.execute('SELECT link FROM Links WHERE id = ? ', (startpage + i, )) If child.tag = 'a' and (key in child.text): # a finds the links If not found: # keep iterating unitl found is True

# Helper function to find link of given key in the html treeįound = False # want to find the tag after the keyįor child in d.iter(): # iterate through html tree Uncomment out the bottom section is want to reset the tables 'Įach node of the network represents one webpage, and a connection from one node to another represents a hyperlink on one webpage, linking to the other. 'electromagnetism -> subfields of physics -> physics -> physical sciences -> natural sciences -> nature. The script also goes to the link associated with the category and continues, i.e. A python script goes through a given amount of the wikipedia links from the sqlite database and returns their category. Next, the categories of the links are plotted in a network. T1 = get_urls(tree_body, parenturl_id) # grab the urls from the tree Tree_body=tree_body # in list type for some reason Tree_body = tree.find_class('mw-content-ltr') # all the links in the main content in the class 'mw-content-ltr' Init = int(round(cur.fetchone() / 100.)) # start from database length/100 so we dont repeatĬur.execute('SELECT link FROM Links WHERE id = ? ', (init + i, ))Ĭur.execute('SELECT id FROM Links WHERE link = ? ', (init_url, )) Pages = 20 # get all links from this number of pagesĬur.execute('SELECT COUNT(link) FROM Links ') # Initialise at the main page since it changes everyday # Helper function to grab all links from the url tree and insert into databaseĬur.execute('''INSERT OR IGNORE INTO Links (link, parent) Id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, The ultimate goal would be a databse of all the links in wikipedia.Ĭonn = nnect('wikiLinks.sqlite') Obviously this can go on ad infinitum so the pages are limited (to say 60) at a time, so we don't anger the internet service provider. Once it has stored all the links in a given page, it moves to the next page in the database. The web-scraper iterates through the links on a wikipedia page and stores them in a sqlite database. The home page has a variation of different links that change every day so it is a good place to start to ensure a variation of wikipedia pages of different categories. To begin, I create a web-scraper that starts at either a page from a sqlite database, or begins at the wikipedia homepage. To solve this problem I separate the problem into 2.

WEBSCRAPER PYTHON SQL CODE

The complete code can be found on github here. I hypothesize that the links between the categories of the webpages in wikipedia will show the connections between various categories and may reveal origins of fields that may be unintuitive. Using Python and SQL to Map Wikipedia by Page Category