CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.

CIS 430 November 6, 2008 Emily Pitler

 Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4

6 Mei and Church, WSDM 2008

 Beitzel et. al. SIGIR 2004  America Online, week in December 2003  Popular queries: ◦ 1.7 words  Overall: ◦ 2.2 words 7

 Lempel and Moran WWW2003  AltaVista, summer 2001  7,175,151 queries  2,657,410 distinct queries  1,792,104 queries occurred only once 63.7%  Most popular query: 31,546 times 8

Saraiva et. al. SIGIR 2001 9

Lempel and Moran WWW2003 10

American Airlines? or Alcoholics Anonymous? 12

 Clarity score ~ low ambiguity  Cronen-Townsend et. al. SIGIR 2002  Compare a language model ◦ over the relevant documents for a query ◦ over all possible documents  The more difference these are, the more clear the query is  “programming perl” vs. “the” 13

 Query Language Model  Collection Language Model (unigram) 14

 Relative entropy between the two distributions  Cost in bits of coding using Q when true distribution is P 15

 Navigational ◦ greyhound bus ◦ compaq  Informational ◦ San Francisco ◦ normocytic anemia  Transactional ◦ britney spears lyrics ◦ download adobe reader Broder SIGIR 2002 19

 The more webpages that point to you, the more important you are  The more important webpages point to you, the more important you are  These intuitions led to PageRank  PageRank led to… Page et. al. 1998 22

cnn.com Nytimes.com washingtonpost.com Mtv.com vh1.com 23

 Assume our surfer is on a page  In the next time step she can: ◦ Choose a link on the current page uniformly at random ◦ Or ◦ Go somewhere else in the web uniformly at random  After a long time, what is the probability she is on a given page? 24

Pages that point to v Spread out their probability over outgoing links 25

 Could also “get bored” with probability d and jump somewhere else completely 27

 Google, obviously  Given objects and links between them, measures importance  Summarization (Erkan and Radev, 2004) ◦ Nodes = sentences, edges = thresholded cosine similarity  Research (Mimno and McCallum, 2007) ◦ Nodes = people, edges = citations  Facebook? 29

 Words on the page  Title  Domain  Anchor text—what other sites say when they link to that page 31

Title: Ani Nenkova - Home Domain: www.cis.upenn.edu 32

 Ontology of webpages  Over 4 million webpages are categorized  Like WordNet for webpages  Search engines use this  Where is www.cis.upenn.edu?www.cis.upenn.edu  Computers ◦ Computer Science  Academic Departments  North America  United States  Pennsylvania 33

 What OTHER webpages say about your webpage  Very good descriptions of what’s on a page Link to: www.cis.upenn.edu/~nenkova “Ani Nenkova” is anchor text for that page 34

 10,000 documents  10 of them are relevant  What happens if you decide to return absolutely nothing?  99.9% accuracy 36

 Standard metrics in Information Retrieval  Precision: Of what you return, how many are relevant?  Recall: Of what is relevant, how many do you return? 37

 Not always clear-cut binary classification: relevant vs. not relevant  How do you measure recall over the whole web?  How many of the 2.7 billion results will get looked at? Which ones actually need to be good? 38

 Very relevant > Somewhat relevant > Not relevant  Want most relevant documents to be ranked first  NDCG = DCG / ideal ordering DCG  Ranges from 0 to 1 39

 Proposed ordering:  DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) ◦ = 6.5  IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) ◦ = 6.63  NDCG = 6.5/6.63 =.98 1024 40

 Documents—hundreds of words  Queries—1 or 2, often ambiguous, words  It would be much easier to compare documents and documents  How can we turn a query into a document?  Just find ONE relevant document, then use that to find more 42

 New Query = Original Query  +Terms from Relevant Docs  - Terms from Irrelevant Docs  Original query = “train”  Relevant ◦ www.dog-obedience-training-review.com www.dog-obedience-training-review.com  Irrelevant ◦ http://en.wikipedia.org/wiki/Caboose http://en.wikipedia.org/wiki/Caboose  New query = train +.3*dog -.2*railroad 43

 Explicit feedback ◦ Ask the user to mark relevant versus irrelevant ◦ Or, grade on a scale (like we saw for NDCG)  Implicit feedback ◦ Users see list of top 10 results, click on a few ◦ Assume clicked on pages were relevant, rest weren’t  Pseudo-relevance feedback ◦ Do search, assume top results are relevant, repeat 44

 Have query logs for millions of users  “hybrid car”  ”toyota prius” is more likely than “hybrid car”-> “flights to LA”  Find statistically significant pairs of queries (Jones et. al. WWW 2006) using: 45

 Make a bipartite graph of queries and URLs  Cluster (Beeferman and Berger, KDD 2000) 46

 Suggest queries in the same cluster 47

 A lot of ambiguity is removed by knowing who the searcher is  Lots of Fernando Pereira’s ◦ I (Emily Pitler) only know one of them  Location matters ◦ “Thai restaurants” from me means “Thai restaurants Philadelphia, PA” 49

 Mei and Church, WSDM 2008  H(URL|Q) = H(URL,Q)-H(Q) = 23.88-21.14=2.74  H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)=27.17-26=1.17 50

 Powerset trying to apply NLP to Wikipedia 52

 Descriptive searches: “pictures of mountains” ◦ I don’t want a document with the words: ◦ {“picture”, “of”, “mountains”}  Link farms: trying to game PageRank  Spelling correction: a huge portion of queries are misspelled  Ambiguity 53

 Text normalization, documents as vectors, document similarity, log likelihood ratio, relative entropy, precision and recall, tf-idf, machine learning…  Choosing relevant documents/content  Snippets = short summaries 54

CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.

Similar presentations

Presentation on theme: "CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.

Similar presentations

Presentation on theme: "CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4."— Presentation transcript:

Similar presentations

About project

Feedback