Collective Intelligence Week 3: Crawling, Searching, Ranking Old Dominion University Department of Computer Science CS 795/895 Spring 2009 Michael L. Nelson 1/28/09
Crawling is Messy… >>> import simple >>> pagelist=[' >>> crawler=simple.crawler('') >>> crawler.crawl(pagelist) Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Could not open Indexing Indexing … Indexing Traceback (most recent call last): File " ", line 1, in File "simple.py", line 52, in crawl … Not every URL will open Not every URL will parse Note: this is the code from pp , not the final, distributed version
Crawling Our Local Web >>> import searchengine >>> crawler=searchengine.crawler('mln.db') >>> crawler.createindextables() >>> pagelist=[' >>> crawler.crawl(pagelist) Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Could not open Indexing Indexing … three changes to distributed code needed: 1.s/Null/None/ 2.s/separateWords/separatewords/ 3.check spacing on last line in crawl()
Processing The Page 1.Get the page 2.If HTML, create a “soup” 3.strip out HTML of soup (all terms in 1 string) 4.parse separate terms out of the string & store in index
Schema for the Processed Pages def createindextables(self): self.con.execute('create table urllist(url)') self.con.execute('create table wordlist(word)') self.con.execute('create table wordlocation(urlid,wordid,location)') self.con.execute('create table link(fromid integer,toid integer)') self.con.execute('create table linkwords(wordid,linkid)') self.con.execute('create index wordidx on wordlist(word)') self.con.execute('create index urlidx on urllist(url)') self.con.execute('create index wordurlidx on wordlocation(wordid)') self.con.execute('create index urltoidx on link(toid)') self.con.execute('create index urlfromidx on link(fromid)') self.dbcommit()
Searching Our Index >>> e=searchengine.searcher('mln.db') >>> e.getmatchrows('old dominion') (lots of results) >> e.getmatchrows('monarch') select w0.urlid,w0.location from wordlocation w0 where w0.wordid=1007 ([(23, 371), (3, 609)], [1007]) >>> e.query('monarch') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http://system.cs.odu.edu/?page=faq&id=labhours http:// ([1007], [23, 3]) >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [49, 48, 47, 46, 34, 20, 14, 13]) N.B.: weights dict is null: weights=[]
We Can Do SQL >>> cur=e.con.execute('select * from wordlist') >>> for i in range(3): print cur.next()... (u'doctype',) (u'html',) (u'public',) >>> cur=e.con.execute('select url from urllist') >>> for i in range(8): print cur.next()... (u' (u' (u' (u' (u' (u' (u' (u'
No Stemming in Our Database >>> e.query('test') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// ([363], [63, 49, 40, 16, 1]) >>> e.query('testing') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// ([1628], [20, 16, 10, 7, 3])
Porter Stemmer image from: Original paper, sample input & output, various implementations at:
Ranking Currently documents are returned in order of ingest -- not good The book covers 3 (of a possible ) ranking mechanisms based on the content of the documents themselves –word frequency: the more often a word appears in the document, the more likely it is what the document is “about” cf. Term Frequency (TF in TFIDF) from last lecture –location in document: if the word appears near the “top” of the document it is more likely to capture “aboutness” ex: word in title, intro, abstract –word distance: for multi-word queries, give higher rank that feature the terms in closer proximity ex: d1=“welcome home to New York”, d2=“new homes being built in York County” q1=“new york”; rank d1,d2 q2=“new home”; rank d2, d1
Precision and Recall Precision –“ratio of the number of relevant documents retrieved over the total number of documents retrieved” (p. 10) –how much extra stuff did you get? Recall –“ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database” (p. 10) note: assumes a priori knowledge of the denominator! –how much did you miss?
Precision and Recall Precision Recall figure 1.2 in FBY an increase in 1 dimension is generally accompanied by a decrease in another. ex. stemming increases recall, but at the expense of precision
Why Isn’t Precision Always 100%? What were we really searching for? Science? Games? Music?
Why Isn’t Recall Always 100%? Virginia Agricultural and Mechanical College? Virginia Agricultural and Mechanical College and Polytechnic Institute? Virginia Polytechnic Institute? Virginia Polytechnic Institute and State University? Virginia Tech?
Ranking With Frequency >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [48, 13, 47, 46, 34, 14, 49, 20]) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [49, 48, 47, 46, 34, 20, 14, 13]) getscoredlist() weights=[] getscoredlist() weights=[(1.0,self.frequencyscore(rows))]
Ranking With Location, Location + Frequency >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [48, 34, 47, 14, 20, 13, 49, 46]) getscoredlist() weights=[(1.0,self.locationscore(rows))] >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [48, 47, 34, 13, 14, 20, 49, 46]) getscoredlist() weights=[(1.0,self.locationscore(rows)), 1.0,self.frequencyscore(rows))]
Ranking With Distance >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [49, 48, 47, 46, 20]) >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [47, 20, 49, 48, 46]) getscoredlist() weights=[] getscoredlist() weights=[(1.0,self.distancecore(rows))] 2 “Michael” + 1 “Nelson”, but not in proximity (the “Nelson” is not even visible…)
Link-Based Metrics Content based metrics have an implicit assumption: everyone is telling the truth! –Lynch, “When Documents Deceive” –AIRWeb: Adversarial Information Retrieval on the Web We can mine the collective intelligence of the web community by seeing how they voted with their links –assumption: when choosing a target for their web page links, people do a good job of filtering out spam, poor quality, etc. –result: your document is influenced by the content of documents of others
Want to link “to” a review of DJ Shadow’s “The Outsider”? adow+the+outsider+reviewhttp:// adow+the+outsider+review –where’s the most knowledgeable review ever on ??? –class assignment: everyone go home and create 10 pages that link to: measure.blogspot.com/2009/01/dj-shadow- outsider-lp-review.htmlhttp://f- measure.blogspot.com/2009/01/dj-shadow- outsider-lp-review.html
Ranking by Count In-Links >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [20, 46, 49, 48, 47]) getscoredlist() weights=[(1.0,self.inboundlinkscore(rows))]
But Not All Links Are Equal… You linking to my LP review is nice, but its not as nice as it would be if it were linked to by Spin Magazine, Rolling Stone, MTV, etc. –a page’s “importance” is defined by having other important pages link to it
Calculating Pagerank PR(A) = * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) ) = * ( 0.5/ / /1 ) = * ( ) = * = fig 4-3 needs an extra link from C to match the text “random surfer” model = some guy just following links until he gets bored and randomly jumps to a new page (i.e. arrives via a method other than following links) damping factor (d) =.85 (probability surfer landed on page by following a link) 1-d =.15 (probability surfer landed on page at “random”)
Ranking by Pagerank >>> reload(searchengine) >>> import searchengine >>> crawler=searchengine.crawler('mln.db') >>> crawler.calculatepagerank() Iteration 0 Iteration 1 Iteration 2 … Iteration 18 Iteration 19 >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [46, 20, 49, 48, 47]) getscoredlist() weights=[(1.0,self.pagerankscore(rows))] These 2 URLs swapped positions (but just barely) Code in book always stops after 20 iterations; it could stop when threshold is reached
Ranking by Link (Anchor) Text >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid=2297 Traceback (most recent call last): File " ", line 1, in File "searchengine.py", line 253, in query scores=self.getscoredlist(rows,wordids) File "searchengine.py", line 233, in getscoredlist weights=[(1.0,self.linktextscore(rows,wordids))] File "searchengine.py", line 310, in linktextscore normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()]) ZeroDivisionError: float division >>> e.query('recent') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// ([206], [47, 49, 48, 63, 62, 61, 60, 59, 58, 57]) getscoredlist() weights=[(1.0,self.linktextscore(rows,wordids))] bad error handling 1st three links have word + the word in a link Others have the word, in the page; but the page does not have the word in a link text
Much, Much More A million variations, optimizations, analyses, etc. for Pagerank – –Problems: preferential attachment (rich get richer), “random surfer model” is not accurate (cf. “search dominant model”), etc. see paper by Cho et al.: Page Quality: In Search of an Unbiased Web Ranking – Impact of Web Search Engines on Page Popularity – Alternates to Pagerank –ex: Kleinberg’s Hubs and Authorities
Voting With Your Clicks Counting links mines what page authors do, but what about mining what readers click on? –the holy grail for advertisers –more privacy concerns than I could ever hope to cover… –one minor, nasty little detail: usage data is notoriously hard to get…
Neural Networks Mapping queries (world, river, bank) to URLs (World Bank, River, Earth). The hidden layer is unknown to us; trying to model a user’s cognitive powers.
Building Our NN >> import nn >>> mynet=nn.searchnet('nn.db') >>> mynet.maketables() >>> wWorld,wRiver,wBank =101,102,103 >>> uWorldBank,uRiver,uEarth =201,202,203 >>> mynet.generatehiddennode([wWorld,wBank],[uWorldBank,uRiver,uEarth]) >>> mynet.getresult([wWorld,wBank],[uWorldBank,uRiver,uEarth]) [ , , ] >>> >>> mynet.trainquery([wWorld,wBank],[uWorldBank,uRiver,uEarth],uWorldBank) >>> mynet.getresult([wWorld,wBank],[uWorldBank,uRiver,uEarth]) [ , , ] Without training all outcomes equally likely With training, “world bank” query more likely to map to “WorldBank” URL
More Training >>> allurls=[uWorldBank,uRiver,uEarth] >>> for i in range(30):... mynet.trainquery([wWorld,wBank],allurls,uWorldBank)... mynet.trainquery([wRiver,wBank],allurls,uRiver)... mynet.trainquery([wWorld],allurls,uEarth)... >>> mynet.getresult([wWorld,wBank],allurls) [ , , ] >>> mynet.getresult([wRiver,wBank],allurls) [ , , ] >>> mynet.getresult([wBank],allurls) [ , , ] We’ve never seen just a “Bank” query, but we can predict the results Neat. Book provides mechanism to include nnscore() in weights dict. But where does training data come from?
Another Connectionist Example: Hebbian Learning Not in book, but Bollen & Nelson did some research on using Hebbian Learning & “smart objects” for real-time link adjustment: “Distributed, Real-Time Computation of Community Preferences”, HT 2005 “Dynamic Linking of Smart Digital Objects Based on User Navigation Patterns”, cs.DL/ “Adaptive Network of Smart Objects”, ICCP 2002 Promising, but notable cold-start problems