Lecture 12 IR in Google Age. Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Chapter 5: Information Retrieval and Web Search
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Chapter 23: Probabilistic Language Models April 13, 2004.
Understanding Search Engines. Basic Defintions: Search Engine Search engines are information retrieval (IR) systems designed to help find specific information.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
1 Information Retrieval LECTURE 1 : Introduction.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Text & Web Mining 9/22/2018.
Introduction to Information Retrieval
The Search Engine Architecture
Junghoo “John” Cho UCLA
Restructuring Sparse High Dimensional Data for Effective Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Lecture 12 IR in Google Age

Traditional IR Traditional IR examples – Searching a university library – Finding an article in a journal archive – Searching your own computer file space Spotlight in OS X Windows Desktop Search Lucene – In these cases, often an expert such as a librarian is used. (Hopefully, the expert in your own files is you).

Traditional IR Models 3 basic search techniques for traditional IR – Boolean models – Vector models – Probabilistic models

Boolean One of the earliest Variations still in many libraries Boolean operators – AND, OR, NOT – Remember DeMorgan’s Theorem ? Operates by analyzing whether keywords are absent or present in a document There are no partial matches – A document is either relevant or irrelevant – Fuzzy set techniques are used to attempt to lessen this black & whiteness Has problems with synonymy & polysemy – Cases of many words having same meaning – Cases of single word meaning many things

Boolean (continued) Synonymy examples – Something that is described as ‘academic’ might also be described as theoretical, scholarly, or pedantic Polysemy examples – Hot Could mean high temperature Could mean spicy Could be an adjective for a person’s attractiveness

On the upside – – Relatively easy to create & program a boolean engine – Fast; easy to process in parallel (eg scanning through multiple document keyword files at the same time – Scales readily to large document collections (corpora) Boolean (continued)

Vector Space Model Have already seen some of its features Developed in early 60’s to address some of the shortcomings of the Boolean model Advanced Vector Space Models such as LSI (Latent Semantic Indexing) can identify hidden semantic meaning – For example, an LSI search engine will also return documents containing “automobile” when the query term “car” is used 2 particular advantages to Vector Space Model – Relevance Scoring – Relevance Feedback

Vector Space Model (cont) Relevance Scoring – VSM allows documents to partially match a query – This allows an assignment of a degree, or score, of relevancy which, in turn, can be sorted Relevance Feedback – VSM permits ‘tuning’ of query User can select a subset of the retrieved documents and resubmit them Query is then resubmitted with this additional information A revised, generally more useful documents, is retrieved

Vector Space Model On the downside … – Drawback to Vector Space Model is computational expense Distance measures, aka similarity measures, between query & document must be computed for each document Big matrix computations Remember the length of a vector Vector length likely grows with collection growth because of more terms (& also more documents to search)

Probabilistic Models Attempt to estimate probability of a document’s relevancy to a particular user Retrieved documents ranked by odds of relevance – Ratio of probability of is relevant to probability that the document is not relevant After an initial ‘guess’ by the algorithm, the model operates recursively, seeking to improve the accuracy of the probabilities Google’s Page Rank & Beyond; Langville, Meyer

Upside – Can be tuned to researcher/user’s preferences Researcher can set or drive probabilities as they desire – Potentially offers strong tailorability Downside – Difficult to build & program – Does not scale well; complexity grows quickly Probabilistic Models

Web IR Web is world’s largest & linked document collection (corpus) Per Langville & Meyer, 4 particular characteristics of Web are: – Enormous – Dynamic – Self-organized – Hyperlinked

Web IR Enormous – Speaks for itself Dynamic – Virtually anyone can do almost anything on the web at any time Self-organized – No top down governance or rules (or at least not much) on: Content Structure format – Hyperlinked Documents point to & reference each other in a robust, knowable way

Web IR Web Search process components – Crawler/spider Software to collect the documents – Page Repository Complete web pages are temporarily stored in total Stored until indexing component parses needed data Frequently accessed pages might be stored indefinitely – Indexing component Strips out & stores needed data – In effect creating a compressed page Original page is tossed – unless frequently accessed

Web IR Web Search process components (cont) – Indexes themselves Content indexes using Inverted File Structure – eg, this word found in these documents – Query module Converts users natural language into a query – A Query object in Lucene’s case Runs this query against the indices from the document collection Returns relevant documents – A Hit object in Lucene’s case This set of relevant pages is passed to the Ranking module – Ranking module Combines content score for relevance and also popularity score Popularity score steps us into Link Analysis & Googleness

Link Analysis In 1998, intense link analysis research was being done by two different groups – Jon IBM in Silicon Valley – Sergey Brin & Larry Page, two PhD Stanford Kleinberg model called HITS – Hypertext Induced Topic Search Brin/Page model called PageRank

Sergey/Brin began developing a search business out of their dorm rooms – Took academic leave to pursue the commercial aspects of their company Kleinberg remained with academia Cornell) and did not pursue a company Sergey & Brin are still on academic leave Link Analysis

Page Rank Google’s Page Rank & Beyond; Langville, Meyer ∑ r( P j ) | P j | P j ε B Pi r( P i ) = The PageRank of a particular page is the sum of the PageRanks of all pages pointing to that page. r( P i ) is the PageRank of page P i B pi is the set of pages pointing into page P i | P j | is the number of all outlinks from P j