INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Information Retrieval in Practice
Information Retrieval Review
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
Roy McElmurry EXPLORATION SEMINAR 2 SEARCHING AND GOOGLE.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
CS 430: Information Discovery
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
1 Searching the Web Representation and Management of Data on the Internet.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Search Engines By: Faruq Hasan.
Information Retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Lucene Jianguo Lu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
INFO 320: Information Needs, Searching, and Presentation (aka… Search)
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Text Based Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Anatomy of a search engine
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Web Search Engines.
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith Lectures: Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs : Wed.: 1:30 - 2:20 pm, MGH 030

INFO 320 William Jones, a T 2 Module 2: Basics of Search Learn … how to create a Lucene index. how to inspect indexes using Solr more on page rank. about Zipf’s law. considerations of and variations in word- breaking, stemming, support of the vector model. Boolean search vs. the vector space model

INFO 320 William Jones, a T 3 And?? B-trees? The matrix algebra behind PageRank

INFO 320 William Jones, a T 4 This week (10/6/2013) 2.1 T  Review quiz; cool tool presentation & sign-up sheet  Basics of indexing  Team term project ideas 2.1 W  Build an test index  Inspect; Build again? 2.1 Th  A little more on special-purpose search _apps_  Cool tool presentations;  Elevator speeches & discussion  More on indexing & query processing

INFO 320 William Jones, a T 5 Next week (of 10/13): Module 2, Basics of Search 2.2 T  Steps in building an index  Steps in processing a query 2.2 W  Do your own crawl, build your own index 2.2 Th  Guest lecture on SEO  Cool tool presentations (4) Also  Quiz (F); Lab report (S); Essay comments (S.)

INFO 320 William Jones, a T 6 Announcements & discussion  See Canvas.  Also my visit yesterday to Highspot Robert Wahbe highspot inc. Enterprise search sucks  Succeeds 3 / 10 times vs. Web search success rate of 9 / 10 times..

INFO 320 William Jones, a T 7

INFO 320 William Jones, a T 8 Lucene et al. Four fundamental concepts… 1. An index contains a sequence of documents. 2. A document is a sequence of fields. 3. A field is a named sequence of terms. 4. A term is a string (of text).

INFO 320 William Jones, a T 9 Indexing, the basics: Some definitions URI – Uniform Resource Identifier  A generalization on a URL Information item  Any packaging of information – especially to which we might want to return.  Files, documents, web pages, messages…  Addressable points of return (via URI). An index term  A word, feature, attribute/value combination – anything we might recall and want to specify in a search query.  Author:Smith, “the”,, “quantiative easing”…

INFO 320 William Jones, a T 10 Content is organized into information items In a file system  We call the items “documents” or “files” On the Web  We call the items “pages”

INFO 320 William Jones, a T 11 Think of a file system or the Web as a simple table where access is through the rows URIContent ~/document-1“The old dog barks backwards without getting up. I can remember when he was a pup.”* ~/document-2The quick brown fox jumps over the lazy dog …… *Robert Frost

INFO 320 William Jones, a T 12 A index for search “inverts” TermURIs (for items) dog~/document-1; ~/document-2 barks~/document-1 fox~/document-2 lazy~/document-2 ……

INFO 320 William Jones, a T 13 “Index” Etymology  From Latin index (“a discoverer, informer, spy; of things, an indicator, the forefinger, a title, superscription”) < indicō (“point out, show”); see indicate.indexindicō indicate Variations  Index finger  Book index.  Index of economic activity.

INFO 320 William Jones, a T 14 Why an index? SPEED! TermURIs (for items) dog~/document-1; ~/document-2 barks~/document-1 fox~/document-2 lazy~/document-2 …… Terms structured for fast access – via hashing, ordered list or tree. Search time without an index is ~ O (P * W). Where P is >25 billion web pages; W is an average of 475 words per page Search time with an index of ordered terms is ~ O (log(T)). Even for a trillion items, this number is under 40 (with log base = 2)

INFO 320 William Jones, a T 15 But notice something… Indexing means “touching” every information item in a corpus.  Indexing and data-mining go hand-in-hand.

INFO 320 William Jones, a T 16 Variations and tradeoffs In general, we trade storage for speed. One variation: use the index to narrow the scope of a follow-on sequential search.  Example: Locate all documents containing the words “fox”, “jumped”, “dog”.  Then search within these for variations of the phrase “The quick brown fox jumps over the lazy dog”.

INFO 320 William Jones, a T 17 Inverted index construction *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 18 Inverted Indexes *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 19 How Are Inverted Files Created *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 20 Then terms are sorted *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 21 Multiple term entries for a single document are merged. Within-document term frequency information is compiled. *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 22 Then the file can be split into a Dictionary and Postings file *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 23 Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of:  document ID  frequency of term in doc (optional)  position of term in doc (optional) These lists can be used to solve Boolean queries: Also used for statistical ranking algorithms *From slides by Efthimis Efthimiadis

INFO 320 William Jones, a T 24 Stages in the construction of an index Locate the items to be indexed.  Crawl for the Web, enumerate for the desktop Analyze and “filter” each item for indexable content  Content words, attribute/value pairs (metadata), features. For text  Word breaking  Stop-word removal (“the”)  Stemming (“walk”, “walks”, “walked”, “walking” => “walk”

INFO 320 William Jones, a T 25 Similar steps are generally followed for the query Word breaking & parsing Stop-word removal (“the”) Stemming (“walk”, “walks”, “walked”, “walking” => “walk”

INFO 320 William Jones, a T 26 Crawling Challenges Crawlers need to be “polite”. Servers are often down or slow Hyperlinks can get the crawler into cycles Some websites have junk in the web pages Many pages have dynamic content.  The “hidden” web  E.g., myuw.washington.edu The web is HUGE *adapted from Hearst

INFO 320 William Jones, a T 27 What really gets crawled? A small fraction of the Web that search engines know about; no search engine is exhaustive Not the “live” Web, but the search engine’s index Not the “Deep Web” Mostly HTML pages but other file types too: PDF, Word, PPT, etc. *adapted from Lew & Davis

INFO 320 William Jones, a T 28 A need for parallelism Buttcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15 “the crawler must download pages from multiple sites concurrently because it may take an individual site several seconds to respond to a single download request. As it downloads pages, the crawler must track its progress, avoiding pages it has already downloaded and retrying pages when a download attempt fails”

INFO 320 William Jones, a T 29 Components of a web crawler fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15

INFO 320 William Jones, a T 30 Questions?