INFO 320: Information Needs, Searching, and Presentation (aka… Search)

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Architecture of a Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
INFORMATION RETRIEVAL VECTOR SPACE MODEL IN-DEPTH PART 2 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
INFORMATION RETRIEVAL VECTOR SPACE MODEL IN-DEPTH PART 1 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
INFORMATION RETRIEVAL VECTOR SPACE MODEL IN-DEPTH PART 5 Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
IRTools Software Overview Gregory B. Newby UNC Chapel Hill
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Evidence from Content INST 734 Module 2 Doug Oard.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.
INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones TA: Brennen Smith.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Text Indexing and Search
Indexing & querying text
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Vector Space Model Seminar Social Media Mining University UC3M
CSCE 561 Information Retrieval System Models
Basic Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Search Engines.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Term Frequency–Inverse Document Frequency
Presentation transcript:

INFO 320: Information Needs, Searching, and Presentation (aka… Search) Instructor: William Jones Email: williamj@uw.edu TA: Brennen Smith Email: brennentsmith@gmail.com Lectures:  Tuesdays & Thursdays: 1:30 – 3:20 pm, MGH 238 Labs :  Wed.: 1:30 - 2:20 pm, MGH 030

For this Week 3 (10/13/2013) (Basics of Search) Add Word exercise in class Boolean search vs. the vector space model B-trees 2.2 W One-minute madness – each team gets one minute to describe progress on lab exercises & issues encountered. On-going work in lab.

And also for this Week 3 (of 10/13) Cool tool presentations; Essay review Wrap-up Guest speaker on SEO; 2.2 F Quiz on Module 2.

Components of a web crawler fromButtcher, Clarke & Cormack, 2010, Information Retrieval, Chapter 15

Parsing a document What format is it in? What language is it in? pdf/word/excel/html? What language is it in? How to handle “and”? What character set is in use? …

What you see.. *from http://en.wikipedia.org/wiki/Lawrence_Massacre

Is not what the crawler gets *from http://en.wikipedia.org/wiki/Lawrence_Massacre

What character set is in use? ISO-8859-1. Latin alphabet part 1 covers North America, Western Europe, Latin America, the Caribbean, Canada, Africa; the default for Web pages. UTF-8. A character set implementation of Unicode. A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard. UTF-8 is backwards compatible with ASCII. UTF-8 is the preferred encoding for e-mail and web pages. *from http://www.w3schools.com/TAGS/ref_charactersets.asp

An HTML sample *from http://en.wikipedia.org/wiki/Lawrence_Massacre

Typical Stop Word List

Ambiguity of Natural Language (NL) Synonomy: Different Words, Same Meaning “car” ~= “automobile” “stomach pain after eating” ~= “post-prandial abdominal discomfort” Polysemy: Same Words, Different Meanings “jaguar” as animal vs. kind of automobile. “juvenile victims of crime” vs. “victims of juvenile crime” Venetian blinds vs. blind Venetians

How to handle synonyms? car= automobile When the document contains automobile, index it under car as well (also vice-versa) Or expand query. When the query contains automobile, look under car too. Or form concept, <automobile> When “car” is encountered, index under “<automobile>” (and “car” too?) Likewise for “automobile”. When either “car” or “automobile” are encountered in a query, add the term “<automobile>”.

Term Weighting TF .IDF Binary –presence or absence of term TF IDF Simple count “Sublinear” TF scaling IDF TF .IDF

A matrix as a way to understand the index, the vector model and more.   Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

Cells can have weights. Terms can be composites Cells can have weights. Terms can be composites. Documents can have sections…   Doc 1.1 Doc 1.2 Doc 2.1 Doc 2.2 Doc 6 … William.title 3 4 William.abstract 2 1 William.intro 7

The index has 3 essential components 1. A term list – structured for fast access to individual terms   Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

The index has 3 essential components 2. For each term, a list of associations to documents.   Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

The index has 3 essential components 3. a list of documents that are indexed.   Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 … Term 1 1 Term 2 Term 3 Term 4 Term 5 Term 6

The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

The index can store information for each component For terms – overall frequency in corpus (IDF), methods to identify or compute the term (e.g., variations in spelling, sound wave transformations, checksums, etc.) For term-to-doc associations – weights, # of occurrences, occurrence offsets, etc. For documents – address (by which to access content), summary, overall popularity (e.g., using PageRank).

Methods for fast access to terms Simple sort If updates are few; or term list can reside in RAM. Hashing* B-trees (more next thursday) *From http://wapedia.mobi/en/Hash_function

Term Weighting TF .IDF Binary –presence or absence of term TF IDF Simple count “Sublinear” TF scaling IDF TF .IDF

Zipf’s law If documents of a corpus are ranked (r) by the frequency (f) of their occurrence, then… r · f = k Relates to the Pareto principle aka the "80-20 rule“. Schütze, Hinrich; Christopher D. Manning; Prabhakar Raghavan (2008) Introduction to Information Retrieval

An sample Zipf distribution The graph “hugs” the y and x-axes. Much is accounted for by top-ranked items but much is also hidden in a looong tail. *from http://www.celtnet.org.uk/info/long_tail.php

Questions?