Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Inverted Index Hongning Wang
Modern Information Retrieval Chapter 8 Indexing and Searching.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Architecture of a Search Engine
Modern Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Indexing and Searching
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
With Windows 7 Comprehensive© 2012 Pearson Education, Inc. Publishing as Prentice Hall1 PowerPoint Presentation to Accompany GO! with Windows 7 Comprehensive.
Search Engines and Information Retrieval Chapter 1.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
INTRODUCTION TO JAVASCRIPT AND DOM Internet Engineering Spring 2012.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Evaluating IR (Web) Systems Study of Information Seeking & IR Pragmatics of IR experimentation The dynamic Web Cataloging & understanding Web docs Web.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
ITGS Databases.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
WIRED Future Quick review of Everything What I do when searching, seeking and retrieving Questions? Projects and Courses in the Fall Course Evaluation.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Information Architecture & Design Week 9 Schedule - Web Research Papers Due Now - Questions about Metaphors and Icons with Labels - Design 2- the Web -
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
Text Based Information Retrieval
Implementation Issues & IR Systems
Federated & Meta Search
Search Search Engines Search Engine Optimization Search Interfaces
WIRED Week 2 Syllabus Update Readings Overview.
Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas

Indexing and Searching Queries models work against the index - Find words, word counts, phrases - Sequential search, indexed search Inverted Files & Other Indices Boolean Queries Sequential Searching Pattern Matching Structural Queries Data structures - The infrastructure of search - Varied per data set and query contexts

Inverted Files - Vocabulary with counts for occurrences - Positions where each word (or char) is stored Character positions – exact – full inverted indexes Words referenced – local – block addressing – then words are found - Pointers can be made efficient, compressed - Fixed size or content based - Dependent on text location and access

Searching through Inverted Files 1.Vocabulary search - Query is parsed and words/phrases are determined 2.Retrieval of occurrences - List of words and locations 3.Manipulation of occurrences Occurrences are processed for IR operation Block search if needed Vocabulary is small(er) and starting point Sequence queries use more processing Fast in general Easy to update

Other Indices for Text - Suffix Trees Helps with phrases, patterns (ACGT) File, text is like a huge string Blocks of text seen as suffixes (from X to EOT) Each virtual block is different Great at finding word beginnings - Suffix Arrays Pointers to text (matches) Comparison to each pointer - Look, stop, look to EOF, EOT - Sorting and Merging can occur

Signature Files A different method (or is it?) Vocabulary indexed hashes (a signature) - Cut blocks of text of b words each to bit masks - Bit masks are just bins of bits - Signature is sequence of bit masks for all the blocks and a pointer to each block Search by converting (hashing) the query to bit mask and then comparing Finds sets of words easier (phrases) Block tweaking can go on forever Easy to update… add new bit masks

Boolean Queries A set of operations to manipulate the indices and produce results Algorithm based - Find document qualifiers - Make relevance judgment (value) of document - Retrieve query matches for review Mulit-phase, levels of processing (merging) Narrows scope of the query Parsing leaves from the syntax tree (AND/OR) Optimization is in the details & situations

Sequential Searching Search through a free-form data set (no indexed specific structure) Find where the query pattern occurs - Query shorter than text - Number of potential matches for query in text Simple case: - Start at first character - Proceed linearly through the text Huge variety of methods

String Search Methods Brute Force – search over all position in the text - Match? No, Next Yes, Note location - Next - Sequential Windows (framed) text scanning - Knuth-Morris-Pratt: slides frame over text the size of the search string, starting on the next character from each confirmed miss - Boyer-Moore: slides frame over text, moving ahead the length of the whole search string

More String Search Shift-Or - Uses binary representation, bit masks for searching with a frame - Strengths and weaknesses of hashes Phrase search - Look for longest word first using Window scan - Look for less frequent term first Which method is best? - It depends on the document and IR model for queries

Putting it all together… 1.Query 2.Index 3.Search Method - Vocabulary is matched - Results of matches are merged into a list of documents and text locations in documents You can see how the IR model, the query and the index all combine to enable the IR system

Structural Queries Working with a document that has explicit or implicit structure - Explicit – HTML, tags, spaces, numbers… - Implicit – Genre, Language, Formats… Tags count as vocabulary words (mostly) Index represents structure IR models allow acting on structure in index Do we ever not have some kind of structure? - Should structure be ignored or removed? - Do structures scale and have consistency?

Compression Search is possible directly in compressed text This is a huge win for disk access (and CPU) Language compression: - Heap - vocabulary - Zipf – document scaling Indexes compress well, sequence and differences can be small overall sizes

Evaluating Indices & Search Fixed indices are best for static or predictable text One, single search method may be tuned for static text Hybrid solutions are best for varied documents Hybrid solutions are best as you want to be prepared for document and query type diversity

Users and Searching Document sets and therefore indices are growing rapidly Users have increasing expectations for both accuracy and speed Compression methods help please users Structured information will transform search selection and IR model preferences Computation power solves some problems (mostly)

Keeping Up with the Changing Web Building Indices is difficult enough in theory What about a continuously changing huge volume of information? Is old information good? What does up-to-date mean anymore? Is Knowledge a depreciating commodity? - Correctness + Value over time Different information changes at different rates - Really it’s new information How do you update an index with constantly changing information?

Changing Web Properties Known distributions for information change Sites and pages may have easily identifiable patterns of update - 4% change on every observation - Some don’t ever change (links too) If you check and a page hasn’t changed, what is the probability it will ever change? Rate of change is related to rate of attention - Machines vs. Users - Measures can be compared along with information

Dynamic Maint. of Indexes w/Landmarks Web Crawlers do the work in gathering pages Incremental crawling means incremented indices - Rebuild the whole index more frequently - Devise a scheme for updates (and deletions) - Use supplementary indices (i.e. date) New documents Changed documents 404 documents

Landmarks for Indexing Difference-based method Documents that don’t change are landmarks - Relative addressing - Clarke: block-based - Glimpse: chunking Only update pointers to pages Tags and document properties are landmarked Broader pointers mean less updates Faster indexing – Faster access?

Yahoo! Cataloging the Web How do information professionals build an “index” of the Web? Cataloging applies to the Web Indexing with synonyms Browsing indexes vs searching them Comprehensive index not the goal - Quality - Information Density Yahoo’s own ontology – points to site for full info Subject Trees with aliases to other locations “More like this” comparisons as checksums

Yahoo uses tools for indexing

Investigation of Documents from the WWW What properties do Web documents have? What structure and formats do Web documents use? What properties do Web documents have? - Size – 4K avg. - Tags – ratio and popular tags - MIME types (file extensions) - URL properties and formats - Links – internal and external - Graphics - Readability

WWW Documents Investigation How do you collect data like this? - Web Crawler URL identifier, link follower - Index-like processing Markup parser, keyword identifier Domain name translation (and caching) How do these facts help with indexing? Have general characteristics changed? (This would be a great project to update.)

Properties of Highly-Rated Web Sites What about whole Web sites? What is a Web site? - Sub-sites? - Specific contextual, subject-based parts of a Web site? - Links from other Web pages: on the site and off - Web site navigation effects Will experts (like Yahoo catalogers) like a site?

Properties Links & formatting Graphics – one, but not too many Text formatting – 9 pt. with normal style Page (layout) formatting – min. colors Page performance (size and acess) Site architecture (pages, nav elements) - More links within and external - Interactive (search boxes, menus) Consistency within a site is key How would a user or index builder make use of these?

Extra Discussion Little Words, Big Difference - The difference that makes a difference - Singular and plural noun identification can change indices and retrieval results - Language use differences Decay and Failures - Dead links - Types of errors - Huge amount of dead links (PageRank effective) 28% in Computer & CACM 41% in 2002 articles Better than the average Web page?

Topic Discussions Set Leading WIRED Topic Discussions - About 20 minutes reviewing issues from the week’s readings Key ideas from the readings Questions you have about the readings Concepts from readings to expand on - PowerPoint slides - Handouts - Extra readings (at least a few days before class) – send to wired listserv

Web IR Evaluation - 5 page written evaluation of a Web IR System - technology overview (how it works) Not an eval of a standard search engine Only main determinable diff is content - a brief overview of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use and its overall effectiveness

How can (Web) IR be better? - Better IR models - Better User Interfaces More to find vs. easier to find Scriptable applications Projects and/or Papers Overview

Project Ideas Searchable Personal Digital Library Browser hacks for searching Mozilla keeps all the pages you surf so you can search through them later - Mozilla hack - Local search engin

Paper Ideas New datasets for IR - Search on the Desktop – issues, previous research and ideas - Collaborative searching – advantages and potential, but what about privacy?