Collective Intelligence Week 3: Crawling, Searching, Ranking Old Dominion University Department of Computer Science CS 795/895 Spring 2009 Michael L. Nelson.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Overview of Search Engines
Google and the Page Rank Algorithm Székely Endre
Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
SEO. Self Exploding Organs SEO Search Engine Optimisation By Joey Cannon.
Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web Crawling  Web search.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
1 CS 430: Information Discovery Lecture 5 Ranking.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Week 1 Introduction to Search Engine Optimization.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
Information Retrieval in Practice
Search Engine Architecture
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
MR Application with optimizations for performance and scalability
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Web Programming Week 14 Old Dominion University
Data Mining Chapter 6 Search Engines
Web Programming Week 14 Old Dominion University
Junghoo “John” Cho UCLA
Web Search Engines.
Digital Libraries IS479 Ranking
Presentation transcript:

Collective Intelligence Week 3: Crawling, Searching, Ranking Old Dominion University Department of Computer Science CS 795/895 Spring 2009 Michael L. Nelson 1/28/09

Crawling is Messy… >>> import simple >>> pagelist=[' >>> crawler=simple.crawler('') >>> crawler.crawl(pagelist) Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Could not open Indexing Indexing … Indexing Traceback (most recent call last): File " ", line 1, in File "simple.py", line 52, in crawl … Not every URL will open Not every URL will parse Note: this is the code from pp , not the final, distributed version

Crawling Our Local Web >>> import searchengine >>> crawler=searchengine.crawler('mln.db') >>> crawler.createindextables() >>> pagelist=[' >>> crawler.crawl(pagelist) Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Indexing Could not open Indexing Indexing … three changes to distributed code needed: 1.s/Null/None/ 2.s/separateWords/separatewords/ 3.check spacing on last line in crawl()

Processing The Page 1.Get the page 2.If HTML, create a “soup” 3.strip out HTML of soup (all terms in 1 string) 4.parse separate terms out of the string & store in index

Schema for the Processed Pages def createindextables(self): self.con.execute('create table urllist(url)') self.con.execute('create table wordlist(word)') self.con.execute('create table wordlocation(urlid,wordid,location)') self.con.execute('create table link(fromid integer,toid integer)') self.con.execute('create table linkwords(wordid,linkid)') self.con.execute('create index wordidx on wordlist(word)') self.con.execute('create index urlidx on urllist(url)') self.con.execute('create index wordurlidx on wordlocation(wordid)') self.con.execute('create index urltoidx on link(toid)') self.con.execute('create index urlfromidx on link(fromid)') self.dbcommit()

Searching Our Index >>> e=searchengine.searcher('mln.db') >>> e.getmatchrows('old dominion') (lots of results) >> e.getmatchrows('monarch') select w0.urlid,w0.location from wordlocation w0 where w0.wordid=1007 ([(23, 371), (3, 609)], [1007]) >>> e.query('monarch') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http://system.cs.odu.edu/?page=faq&id=labhours http:// ([1007], [23, 3]) >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [49, 48, 47, 46, 34, 20, 14, 13]) N.B.: weights dict is null: weights=[]

We Can Do SQL >>> cur=e.con.execute('select * from wordlist') >>> for i in range(3): print cur.next()... (u'doctype',) (u'html',) (u'public',) >>> cur=e.con.execute('select url from urllist') >>> for i in range(8): print cur.next()... (u' (u' (u' (u' (u' (u' (u' (u'

No Stemming in Our Database >>> e.query('test') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// ([363], [63, 49, 40, 16, 1]) >>> e.query('testing') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// ([1628], [20, 16, 10, 7, 3])

Porter Stemmer image from: Original paper, sample input & output, various implementations at:

Ranking Currently documents are returned in order of ingest -- not good The book covers 3 (of a possible ) ranking mechanisms based on the content of the documents themselves –word frequency: the more often a word appears in the document, the more likely it is what the document is “about” cf. Term Frequency (TF in TFIDF) from last lecture –location in document: if the word appears near the “top” of the document it is more likely to capture “aboutness” ex: word in title, intro, abstract –word distance: for multi-word queries, give higher rank that feature the terms in closer proximity ex: d1=“welcome home to New York”, d2=“new homes being built in York County” q1=“new york”; rank d1,d2 q2=“new home”; rank d2, d1

Precision and Recall Precision –“ratio of the number of relevant documents retrieved over the total number of documents retrieved” (p. 10) –how much extra stuff did you get? Recall –“ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database” (p. 10) note: assumes a priori knowledge of the denominator! –how much did you miss?

Precision and Recall Precision Recall figure 1.2 in FBY an increase in 1 dimension is generally accompanied by a decrease in another. ex. stemming increases recall, but at the expense of precision

Why Isn’t Precision Always 100%? What were we really searching for? Science? Games? Music?

Why Isn’t Recall Always 100%? Virginia Agricultural and Mechanical College? Virginia Agricultural and Mechanical College and Polytechnic Institute? Virginia Polytechnic Institute? Virginia Polytechnic Institute and State University? Virginia Tech?

Ranking With Frequency >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [48, 13, 47, 46, 34, 14, 49, 20]) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [49, 48, 47, 46, 34, 20, 14, 13]) getscoredlist() weights=[] getscoredlist() weights=[(1.0,self.frequencyscore(rows))]

Ranking With Location, Location + Frequency >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [48, 34, 47, 14, 20, 13, 49, 46]) getscoredlist() weights=[(1.0,self.locationscore(rows))] >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// ([2297], [48, 47, 34, 13, 14, 20, 49, 46]) getscoredlist() weights=[(1.0,self.locationscore(rows)), 1.0,self.frequencyscore(rows))]

Ranking With Distance >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [49, 48, 47, 46, 20]) >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [47, 20, 49, 48, 46]) getscoredlist() weights=[] getscoredlist() weights=[(1.0,self.distancecore(rows))] 2 “Michael” + 1 “Nelson”, but not in proximity (the “Nelson” is not even visible…)

Link-Based Metrics Content based metrics have an implicit assumption: everyone is telling the truth! –Lynch, “When Documents Deceive” –AIRWeb: Adversarial Information Retrieval on the Web We can mine the collective intelligence of the web community by seeing how they voted with their links –assumption: when choosing a target for their web page links, people do a good job of filtering out spam, poor quality, etc. –result: your document is influenced by the content of documents of others

Want to link “to” a review of DJ Shadow’s “The Outsider”? adow+the+outsider+reviewhttp:// adow+the+outsider+review –where’s the most knowledgeable review ever on ??? –class assignment: everyone go home and create 10 pages that link to: measure.blogspot.com/2009/01/dj-shadow- outsider-lp-review.htmlhttp://f- measure.blogspot.com/2009/01/dj-shadow- outsider-lp-review.html

Ranking by Count In-Links >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [20, 46, 49, 48, 47]) getscoredlist() weights=[(1.0,self.inboundlinkscore(rows))]

But Not All Links Are Equal… You linking to my LP review is nice, but its not as nice as it would be if it were linked to by Spin Magazine, Rolling Stone, MTV, etc. –a page’s “importance” is defined by having other important pages link to it

Calculating Pagerank PR(A) = * ( PR(B)/links(B) + PR(C)/links(C) + PR(D)/links(D) ) = * ( 0.5/ / /1 ) = * ( ) = * = fig 4-3 needs an extra link from C to match the text “random surfer” model = some guy just following links until he gets bored and randomly jumps to a new page (i.e. arrives via a method other than following links) damping factor (d) =.85 (probability surfer landed on page by following a link) 1-d =.15 (probability surfer landed on page at “random”)

Ranking by Pagerank >>> reload(searchengine) >>> import searchengine >>> crawler=searchengine.crawler('mln.db') >>> crawler.calculatepagerank() Iteration 0 Iteration 1 Iteration 2 … Iteration 18 Iteration 19 >>> e=searchengine.searcher('mln.db') >>> e.query('michael nelson') select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=2296 and w0.urlid=w1.urlid and w1.wordid= http:// http:// http:// http:// http:// ([2296, 2297], [46, 20, 49, 48, 47]) getscoredlist() weights=[(1.0,self.pagerankscore(rows))] These 2 URLs swapped positions (but just barely) Code in book always stops after 20 iterations; it could stop when threshold is reached

Ranking by Link (Anchor) Text >>> reload(searchengine) >>> import searchengine >>> e=searchengine.searcher('mln.db') >>> e.query('nelson') select w0.urlid,w0.location from wordlocation w0 where w0.wordid=2297 Traceback (most recent call last): File " ", line 1, in File "searchengine.py", line 253, in query scores=self.getscoredlist(rows,wordids) File "searchengine.py", line 233, in getscoredlist weights=[(1.0,self.linktextscore(rows,wordids))] File "searchengine.py", line 310, in linktextscore normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()]) ZeroDivisionError: float division >>> e.query('recent') select w0.urlid,w0.location from wordlocation w0 where w0.wordid= http:// http:// http:// http:// http:// http:// http:// http:// http:// http:// ([206], [47, 49, 48, 63, 62, 61, 60, 59, 58, 57]) getscoredlist() weights=[(1.0,self.linktextscore(rows,wordids))] bad error handling 1st three links have word + the word in a link Others have the word, in the page; but the page does not have the word in a link text

Much, Much More A million variations, optimizations, analyses, etc. for Pagerank – –Problems: preferential attachment (rich get richer), “random surfer model” is not accurate (cf. “search dominant model”), etc. see paper by Cho et al.: Page Quality: In Search of an Unbiased Web Ranking – Impact of Web Search Engines on Page Popularity – Alternates to Pagerank –ex: Kleinberg’s Hubs and Authorities

Voting With Your Clicks Counting links mines what page authors do, but what about mining what readers click on? –the holy grail for advertisers –more privacy concerns than I could ever hope to cover… –one minor, nasty little detail: usage data is notoriously hard to get…

Neural Networks Mapping queries (world, river, bank) to URLs (World Bank, River, Earth). The hidden layer is unknown to us; trying to model a user’s cognitive powers.

Building Our NN >> import nn >>> mynet=nn.searchnet('nn.db') >>> mynet.maketables() >>> wWorld,wRiver,wBank =101,102,103 >>> uWorldBank,uRiver,uEarth =201,202,203 >>> mynet.generatehiddennode([wWorld,wBank],[uWorldBank,uRiver,uEarth]) >>> mynet.getresult([wWorld,wBank],[uWorldBank,uRiver,uEarth]) [ , , ] >>> >>> mynet.trainquery([wWorld,wBank],[uWorldBank,uRiver,uEarth],uWorldBank) >>> mynet.getresult([wWorld,wBank],[uWorldBank,uRiver,uEarth]) [ , , ] Without training all outcomes equally likely With training, “world bank” query more likely to map to “WorldBank” URL

More Training >>> allurls=[uWorldBank,uRiver,uEarth] >>> for i in range(30):... mynet.trainquery([wWorld,wBank],allurls,uWorldBank)... mynet.trainquery([wRiver,wBank],allurls,uRiver)... mynet.trainquery([wWorld],allurls,uEarth)... >>> mynet.getresult([wWorld,wBank],allurls) [ , , ] >>> mynet.getresult([wRiver,wBank],allurls) [ , , ] >>> mynet.getresult([wBank],allurls) [ , , ] We’ve never seen just a “Bank” query, but we can predict the results Neat. Book provides mechanism to include nnscore() in weights dict. But where does training data come from?

Another Connectionist Example: Hebbian Learning Not in book, but Bollen & Nelson did some research on using Hebbian Learning & “smart objects” for real-time link adjustment: “Distributed, Real-Time Computation of Community Preferences”, HT 2005 “Dynamic Linking of Smart Digital Objects Based on User Navigation Patterns”, cs.DL/ “Adaptive Network of Smart Objects”, ICCP 2002 Promising, but notable cold-start problems