Web Search Week 6 LBSC 796/INFM 718R October 15, 2007.

Web Search Week 6 LBSC 796/INFM 718R October 15, 2007

What “Caused” the Web? Affordable storage –300,000 words/$ by 1995 Adequate backbone capacity –25,000 simultaneous transfers by 1995 Adequate “last mile” bandwidth –1 second/screen (of text) by 1995 Display capability –10% of US population could see images by 1995 Effective search capabilities –Lycos and Yahoo! achieved useful scale in 1994-1995

Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?

Number of Web Sites

What’s a Web “Site”? Any server at port 80? –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp

Estimated Web Size (2005) Gulli and Signorini, 2005

Crawling the Web

Web Crawling Algorithm Put a set of known sites on a queue Repeat the until the queue is empty: –Take the first page off of the queue –Check to see if this page has been processed –If this page has not yet been processed: Add this page to the index Add each link on the current page to the queue Record that this page has been processed

Link Structure of the Web

Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

Web Crawl Challenges Discovering “islands” and “peninsulas” Duplicate and near-duplicate content –30-40% of total content Server and network loads Dynamic content generation Link rot –Changes at 1% per week Temporary server interruptions

Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)

Robots Exclusion Protocol Depends on voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers

Adversarial IR Search is user-controlled suppression –Everything is known to the search system –Goal: avoid showing things the user doesn’t want Other stakeholders have different goals –Authors risk little by wasting your time –Marketers hope for serendipitous interest Metadata from trusted sources is more reliable

Index Spam Goal: Manipulate rankings of an IR system Multiple strategies: –Create bogus user-assigned metadata –Add invisible text (font in background color, …) –Alter your text to include desired query terms –“Link exchanges” create links to your page

Hands on: The Internet Archive Web crawls since 1997 –http://archive.orghttp://archive.org Check out Maryland’s Web site in 1997 Check out the history of your favorite site

Estimating Authority from Links Authority Hub

Rethinking Ranked Retrieval We ask “is this document relevant?” –Vector space: we answer “somewhat” –Probabilistic: we answer “probably” The key is to know what “probably” means –First, we’ll formalize that notion –Then we’ll apply it to ranking

Retrieval w/ Language Models Build a model for every document Rank document d based on P(M D | q) Expand using Bayes’ Theorem P(q) is same for all documents; doesn’t change ranks P(M D ) [the prior] comes from the PageRank score

“Noisy-Channel” Model of IR Information need Query User has a information need, “thinks” of a relevant document… and writes down some queries Information retrieval: given the query, guess the document it came from. d1d1 d2d2 dndn document collection …

Where do the probabilities fit? Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing P(d is Rel | q)

Probability Ranking Principle Assume binary relevance, document independence –Each document is either relevant or it is not –Relevance of one doc reveals nothing about another Assume the searcher works down a ranked list –Seeking some number of relevant documents Documents should be ranked in order of decreasing probability of relevance to the query, P(d relevant-to q)

Probabilistic Inference Suppose there’s a horrible, but very rare disease But there’s a very accurate test for it Unfortunately, you tested positive… The probability that you contracted it is 0.01% The test is 99% accurate Should you panic?

Bayes’ Theorem You want to find But you only know –How rare the disease is –How accurate the test is Use Bayes’ Theorem (hence Bayesian Inference) P(“have disease” | “test positive”) Prior probability Posterior probability

Applying Bayes’ Theorem Two cases: 1.You have the disease, and you tested positive 2.You don’t have the disease, but you tested positive (error) Case 1: (0.0001)(0.99) = 0.000099 Case 2: (0.9999)(0.01) = 0.009999 Case 1+2 = 0.010098 P(“have disease” | “test positive”) = (0.99)(0.0001) / 0.010098 = 0.009804 < 1% Don’t worry!

Another View In a population of one million people 100 are infected999,900 are not 99 test positive 1 test negative 9999 test positive 989901 test negative 10098 will test positive… Of those, only 99 really have the disease!

Language Models Probability distribution over strings of text –How likely is a string in a given “language”? Probabilities depend on what language we’re modeling p 1 = P(“a quick brown dog”) p 2 = P(“dog quick a brown”) p 3 = P(“быстрая brown dog”) p 4 = P(“быстрая собака”) In a language model for English: p 1 > p 2 > p 3 > p 4 In a language model for Russian: p 1 < p 2 < p 3 < p 4

Unigram Language Model Assume each word is generated independently –Obviously, this is not true… –But it seems to work well in practice! The probability of a string, given a model: The probability of a sequence of words decomposes into a product of the probabilities of individual words

A Physical Metaphor Colored balls are randomly drawn from an urn (with replacement) P ( )P ( ) = = (4/9)  (2/9)  (4/9)  (3/9) words M P ( )   

Comparing Language Models P(w)w 0.2the 0.0001yon 0.01class 0.0005maiden 0.0003sayst 0.0001pleaseth … Model M 1 P(w)w 0.2the 0.1yon 0.001class 0.01maiden 0.03sayst 0.02pleaseth … Model M 2 maidenclasspleasethyonthe 0.00050.010.0001 0.2 0.010.0010.020.10.2 P(s|M 2 ) > P(s|M 1 ) What exactly does this mean?

Retrieval w/ Language Models Build a model for every document Rank document d based on P(M D | q) Expand using Bayes’ Theorem P(q) is same for all documents; doesn’t change ranks P(M D ) [the prior] comes from the PageRank score

Visually … Ranking by P(M D | q)… Hey, what’s the probability this query came from you? model 1 Hey, what’s the probability that you generated this query? model 1 is the same as ranking by P(q | M D ) Hey, what’s the probability this query came from you? model 2 Hey, what’s the probability that you generated this query? model 2 Hey, what’s the probability this query came from you? model n Hey, what’s the probability that you generated this query? model n ……

Ranking Models? Hey, what’s the probability that you generated this query? model 1 Ranking by P(q | M D ) Hey, what’s the probability that you generated this query? model 2 Hey, what’s the probability that you generated this query? model n … … is a model of document 1 … is a model of document 2 … is a model of document n … is the same as ranking documents

Building Document Models How do we build a language model for a document? What’s in the urn? M Physical metaphor: What colored balls and how many of each?

Zero-Frequency Problem Suppose some event is not in our observation S –Model will assign zero probability to that event M P ( ) = 1/2 P ( ) = 1/4 Sequence S P ( )P ( ) = = (1/2)  (1/4)  0  (1/4) = 0 P ( )    !!

Smoothing P(w) w Maximum Likelihood Estimate The solution: “smooth” the word probabilities Smoothed probability distribution

Implementing Smoothing Assign some small probability to unseen events –But remember to take away “probability mass” from other events Some techniques are easily understood –Add one to all the frequencies (including zero) More sophisticated methods improve ranking

Recap: LM for IR Indexing-time: –Build a language model for every document Query-time Ranking –Estimate the probability of generating the query according to each model –Rank the documents according to these probabilities

Key Ideas Probabilistic methods formalize assumptions –Binary relevance –Document independence –Term independence –Uniform priors –Top-down scan Natural framework for combining evidence –e.g., non-uniform priors

A Critique Most of the assumptions are not satisfied! –Searchers want utility, not relevance –Relevance is not binary –Terms are clearly not independent –Documents are often not independent Smoothing techniques are somewhat ad hoc

But It Works! Ranked retrieval paradigm is powerful –Well suited to human search strategies Probability theory has explanatory power –At least we know where the weak spots are –Probabilities are good for combining evidence Good implementations exist (e.g., Lemur) –Effective, efficient, and large-scale

Comparison With Vector Space Similar in some ways –Term weights based on frequency –Terms often used as if they were independent Different in others –Based on probability rather than similarity –Intuitions are probabilistic rather than geometric

Conjunctive Queries Perform an initial Boolean query –Balancing breadth with understandability Rerank the results –Using either Okapi or a language model –Possibly also accounting for proximity, links, …

Doubling 18.9 Million Weblogs Tracked Doubling in size approx. every 5 months Consistent doubling over the last 36 months Blogs Doubling

Blue = Mainstream Media Red = Blog

Kryptonite Lock Controversy US Election Day Indian Ocean Tsunami Superbowl Schiavo Dies Newsweek Koran Deepthroat Revealed Justice O’Connor Live 8 Concerts London Bombings Katrina Daily Posting Volume 1.2 Million legitimate Posts/Day Spam posts marked in red On average, additional 5.8% are spam posts Some spam spikes as high as 18%

The “Deep Web” Dynamic pages, generated from databases Not easily discovered using crawling Perhaps 400-500 times larger than surface Web Fastest growing source of new information

Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp://www.ncdc.noaa.gov/ol/satellite/satellitereso urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940 AlexaPublic (partial) http://www.alexa.com/15,860 Right-to-Know Network (RTK Net)Publichttp://www.rtk.net/14,640 MP3.comPublichttp://www.mp3.com/

Content of the Deep Web

Semantic Web RDF provides the schema for interchange Ontologies support automated inference –Similar to thesauri supporting human reasoning Ontology mapping permits distributed creation –This is where the magic happens

Web Search Week 6 LBSC 796/INFM 718R October 15, 2007.

Similar presentations

Presentation on theme: "Web Search Week 6 LBSC 796/INFM 718R October 15, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Search Week 6 LBSC 796/INFM 718R October 15, 2007.

Similar presentations

Presentation on theme: "Web Search Week 6 LBSC 796/INFM 718R October 15, 2007."— Presentation transcript:

Similar presentations

About project

Feedback