CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Personalization and Search Jaime Teevan Microsoft Research.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Precision and Recall.
Evaluating Search Engine
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Chapter 5 Searching for Truth: Locating Information on the WWW.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Query Suggestions Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Post-Ranking query suggestion by diversifying search Chao Wang.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Mathematics of the Web Prof. Sara Billey University of Washington.
Evaluation of IR Systems
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Lecture 10 Evaluation.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Detecting Online Commercial Intention (OCI)
Lecture 6 Evaluation.
INF 141: Information Retrieval
Presentation transcript:

CIS 430 November 6, 2008 Emily Pitler

3

 Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4

5

6 Mei and Church, WSDM 2008

 Beitzel et. al. SIGIR 2004  America Online, week in December 2003  Popular queries: ◦ 1.7 words  Overall: ◦ 2.2 words 7

 Lempel and Moran WWW2003  AltaVista, summer 2001  7,175,151 queries  2,657,410 distinct queries  1,792,104 queries occurred only once 63.7%  Most popular query: 31,546 times 8

Saraiva et. al. SIGIR

Lempel and Moran WWW

American Airlines? or Alcoholics Anonymous? 12

 Clarity score ~ low ambiguity  Cronen-Townsend et. al. SIGIR 2002  Compare a language model ◦ over the relevant documents for a query ◦ over all possible documents  The more difference these are, the more clear the query is  “programming perl” vs. “the” 13

 Query Language Model  Collection Language Model (unigram) 14

 Relative entropy between the two distributions  Cost in bits of coding using Q when true distribution is P 15

16

17

18

 Navigational ◦ greyhound bus ◦ compaq  Informational ◦ San Francisco ◦ normocytic anemia  Transactional ◦ britney spears lyrics ◦ download adobe reader Broder SIGIR

 The more webpages that point to you, the more important you are  The more important webpages point to you, the more important you are  These intuitions led to PageRank  PageRank led to… Page et. al

cnn.com Nytimes.com washingtonpost.com Mtv.com vh1.com 23

 Assume our surfer is on a page  In the next time step she can: ◦ Choose a link on the current page uniformly at random ◦ Or ◦ Go somewhere else in the web uniformly at random  After a long time, what is the probability she is on a given page? 24

Pages that point to v Spread out their probability over outgoing links 25

26

 Could also “get bored” with probability d and jump somewhere else completely 27

28

 Google, obviously  Given objects and links between them, measures importance  Summarization (Erkan and Radev, 2004) ◦ Nodes = sentences, edges = thresholded cosine similarity  Research (Mimno and McCallum, 2007) ◦ Nodes = people, edges = citations  Facebook? 29

 Words on the page  Title  Domain  Anchor text—what other sites say when they link to that page 31

Title: Ani Nenkova - Home Domain: 32

 Ontology of webpages  Over 4 million webpages are categorized  Like WordNet for webpages  Search engines use this  Where is  Computers ◦ Computer Science  Academic Departments  North America  United States  Pennsylvania 33

 What OTHER webpages say about your webpage  Very good descriptions of what’s on a page Link to: “Ani Nenkova” is anchor text for that page 34

 10,000 documents  10 of them are relevant  What happens if you decide to return absolutely nothing?  99.9% accuracy 36

 Standard metrics in Information Retrieval  Precision: Of what you return, how many are relevant?  Recall: Of what is relevant, how many do you return? 37

 Not always clear-cut binary classification: relevant vs. not relevant  How do you measure recall over the whole web?  How many of the 2.7 billion results will get looked at? Which ones actually need to be good? 38

 Very relevant > Somewhat relevant > Not relevant  Want most relevant documents to be ranked first  NDCG = DCG / ideal ordering DCG  Ranges from 0 to 1 39

 Proposed ordering:  DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4) ◦ = 6.5  IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4) ◦ = 6.63  NDCG = 6.5/6.63 =

 Documents—hundreds of words  Queries—1 or 2, often ambiguous, words  It would be much easier to compare documents and documents  How can we turn a query into a document?  Just find ONE relevant document, then use that to find more 42

 New Query = Original Query  +Terms from Relevant Docs  - Terms from Irrelevant Docs  Original query = “train”  Relevant ◦  Irrelevant ◦  New query = train +.3*dog -.2*railroad 43

 Explicit feedback ◦ Ask the user to mark relevant versus irrelevant ◦ Or, grade on a scale (like we saw for NDCG)  Implicit feedback ◦ Users see list of top 10 results, click on a few ◦ Assume clicked on pages were relevant, rest weren’t  Pseudo-relevance feedback ◦ Do search, assume top results are relevant, repeat 44

 Have query logs for millions of users  “hybrid car”  ”toyota prius” is more likely than “hybrid car”-> “flights to LA”  Find statistically significant pairs of queries (Jones et. al. WWW 2006) using: 45

 Make a bipartite graph of queries and URLs  Cluster (Beeferman and Berger, KDD 2000) 46

 Suggest queries in the same cluster 47

 A lot of ambiguity is removed by knowing who the searcher is  Lots of Fernando Pereira’s ◦ I (Emily Pitler) only know one of them  Location matters ◦ “Thai restaurants” from me means “Thai restaurants Philadelphia, PA” 49

 Mei and Church, WSDM 2008  H(URL|Q) = H(URL,Q)-H(Q) = =2.74  H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)= =

51

 Powerset trying to apply NLP to Wikipedia 52

 Descriptive searches: “pictures of mountains” ◦ I don’t want a document with the words: ◦ {“picture”, “of”, “mountains”}  Link farms: trying to game PageRank  Spelling correction: a huge portion of queries are misspelled  Ambiguity 53

 Text normalization, documents as vectors, document similarity, log likelihood ratio, relative entropy, precision and recall, tf-idf, machine learning…  Choosing relevant documents/content  Snippets = short summaries 54