Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.

Similar presentations


Presentation on theme: "CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4."— Presentation transcript:

1 CIS 430 November 6, 2008 Emily Pitler

2

3 3

4  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4

5 5

6 6 Mei and Church, WSDM 2008

7  Beitzel et. al. SIGIR 2004  America Online, week in December 2003  Popular queries: ◦ 1.7 words  Overall: ◦ 2.2 words 7

8  Lempel and Moran WWW2003  AltaVista, summer 2001  7,175,151 queries  2,657,410 distinct queries  1,792,104 queries occurred only once 63.7%  Most popular query: 31,546 times 8

9 Saraiva et. al. SIGIR 2001 9

10 Lempel and Moran WWW2003 10

11

12 American Airlines? or Alcoholics Anonymous? 12

13  Clarity score ~ low ambiguity  Cronen-Townsend et. al. SIGIR 2002  Compare a language model ◦ over the relevant documents for a query ◦ over all possible documents  The more difference these are, the more clear the query is  “programming perl” vs. “the” 13

14  Query Language Model  Collection Language Model (unigram) 14

15  Relative entropy between the two distributions  Cost in bits of coding using Q when true distribution is P 15

16 16

17 17

18 18

19  Navigational ◦ greyhound bus ◦ compaq  Informational ◦ San Francisco ◦ normocytic anemia  Transactional ◦ britney spears lyrics ◦ download adobe reader Broder SIGIR 2002 19

20

21

22  The more webpages that point to you, the more important you are  The more important webpages point to you, the more important you are  These intuitions led to PageRank  PageRank led to… Page et. al. 1998 22

23 cnn.com Nytimes.com washingtonpost.com Mtv.com vh1.com 23

24  Assume our surfer is on a page  In the next time step she can: ◦ Choose a link on the current page uniformly at random ◦ Or ◦ Go somewhere else in the web uniformly at random  After a long time, what is the probability she is on a given page? 24

25 Pages that point to v Spread out their probability over outgoing links 25

26 26

27  Could also “get bored” with probability d and jump somewhere else completely 27

28 28

29  Google, obviously  Given objects and links between them, measures importance  Summarization (Erkan and Radev, 2004) ◦ Nodes = sentences, edges = thresholded cosine similarity  Research (Mimno and McCallum, 2007) ◦ Nodes = people, edges = citations  Facebook? 29

30

31  Words on the page  Title  Domain  Anchor text—what other sites say when they link to that page 31

32 Title: Ani Nenkova - Home Domain: www.cis.upenn.edu 32

33  Ontology of webpages  Over 4 million webpages are categorized  Like WordNet for webpages  Search engines use this  Where is www.cis.upenn.edu?www.cis.upenn.edu  Computers ◦ Computer Science  Academic Departments  North America  United States  Pennsylvania 33

34  What OTHER webpages say about your webpage  Very good descriptions of what’s on a page Link to: www.cis.upenn.edu/~nenkova “Ani Nenkova” is anchor text for that page 34

35

36  10,000 documents  10 of them are relevant  What happens if you decide to return absolutely nothing?  99.9% accuracy 36

37  Standard metrics in Information Retrieval  Precision: Of what you return, how many are relevant?  Recall: Of what is relevant, how many do you return? 37

38  Not always clear-cut binary classification: relevant vs. not relevant  How do you measure recall over the whole web?  How many of the 2.7 billion results will get looked at? Which ones actually need to be good? 38

39  Very relevant > Somewhat relevant > Not relevant  Want most relevant documents to be ranked first  NDCG = DCG / ideal ordering DCG  Ranges from 0 to 1 39

40  Proposed ordering:  DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) ◦ = 6.5  IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) ◦ = 6.63  NDCG = 6.5/6.63 =.98 1024 40

41

42  Documents—hundreds of words  Queries—1 or 2, often ambiguous, words  It would be much easier to compare documents and documents  How can we turn a query into a document?  Just find ONE relevant document, then use that to find more 42

43  New Query = Original Query  +Terms from Relevant Docs  - Terms from Irrelevant Docs  Original query = “train”  Relevant ◦ www.dog-obedience-training-review.com www.dog-obedience-training-review.com  Irrelevant ◦ http://en.wikipedia.org/wiki/Caboose http://en.wikipedia.org/wiki/Caboose  New query = train +.3*dog -.2*railroad 43

44  Explicit feedback ◦ Ask the user to mark relevant versus irrelevant ◦ Or, grade on a scale (like we saw for NDCG)  Implicit feedback ◦ Users see list of top 10 results, click on a few ◦ Assume clicked on pages were relevant, rest weren’t  Pseudo-relevance feedback ◦ Do search, assume top results are relevant, repeat 44

45  Have query logs for millions of users  “hybrid car”  ”toyota prius” is more likely than “hybrid car”-> “flights to LA”  Find statistically significant pairs of queries (Jones et. al. WWW 2006) using: 45

46  Make a bipartite graph of queries and URLs  Cluster (Beeferman and Berger, KDD 2000) 46

47  Suggest queries in the same cluster 47

48

49  A lot of ambiguity is removed by knowing who the searcher is  Lots of Fernando Pereira’s ◦ I (Emily Pitler) only know one of them  Location matters ◦ “Thai restaurants” from me means “Thai restaurants Philadelphia, PA” 49

50  Mei and Church, WSDM 2008  H(URL|Q) = H(URL,Q)-H(Q) = 23.88-21.14=2.74  H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)=27.17-26=1.17 50

51 51

52  Powerset trying to apply NLP to Wikipedia 52

53  Descriptive searches: “pictures of mountains” ◦ I don’t want a document with the words: ◦ {“picture”, “of”, “mountains”}  Link farms: trying to game PageRank  Spelling correction: a huge portion of queries are misspelled  Ambiguity 53

54  Text normalization, documents as vectors, document similarity, log likelihood ratio, relative entropy, precision and recall, tf-idf, machine learning…  Choosing relevant documents/content  Snippets = short summaries 54


Download ppt "CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4."

Similar presentations


Ads by Google