Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Information Retrieval in Practice
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Intelligent Information Retrieval CS 336 –Lecture 2: Query Language Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
1 Information Retrieval and Web Search Introduction.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Chapter 6 Text and Multimedia Languages and Properties
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Search Engines and Information Retrieval Chapter 1.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Search Engine Architecture
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Text Based Information Retrieval
Information Retrieval and Web Search
Search Engine Architecture
CS 430: Information Discovery
CS 430: Information Discovery
Information Retrieval and Web Search
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Thanks to Bill Arms, Marti Hearst
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Web Design
Information Retrieval and Web Design
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

Properties of Text CS336 Lecture 3:

2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents also possible –Images –Music –etc

3 What does an IR system do? Indexing: generate a representation of each document –Want to automatically generate with little human intervention –Use significant terms to build representations Process query –Representation of ‘information need’ –if documents processed specially, queries must also be –possibly weight query words Match queries and documents –find relevant documents Rank and sort documents

4 Text No computer understanding of document or query text Uses “bag of words” approach –Pay little heed to inter-word dependencies: syntax, semantics –Bag does characterize document –Not perfect: words are ambiguous used in different forms or synonymously

5 Document Loosely defined, denotes a single unit of information Can be any logical unit –an article, –a book, or –a manual Can be any physical unit –a file, –an , or –a Web Page

6 Document has: Syntax: dictates structure –Implicit, or expressed in a declarative language (e.g.TeX) –Powerful languages: (i) easier to parse, (ii) difficult to convert to other formats. –Generic languages are better (interchange & flexible) –Trend: languages which provide information on structure, format and semantics being readable by human & computers, e.g. SGML, XML Semantics –Semantics of texts in natural language are not easy for a computer to understand Information about itself, i.e., meta-data

7 Parsing Normalizing format –Process different document formats PDF, DOC HTML –Can be very noisy, need robust parser –Brin, S., Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine – Word segmentation Word normalization

8 Text Representation Text can be represented as –a string –words (statistical IR) –linguistic units (e.g., nouns, phrases) Simple representation (single terms, or individual words) perform surprisingly well –many previous studies actually showed that phrase indexing performs poorer than word indexing phrases may be too specific (different phrases with same meaning)

9 Word Segmentation English is easy –Space character? Well… –It is said that Google is indexing not just words, but common queries too “mp3 downloads” Other languages present problems –Chinese no space character –Japanese Four alphabets –Romanji, Hiragana, Katakana, Kanji –German, Finish, URLs, etc. compound words “Donaudampfschiffahrtsgesellschaftsoberkapitän” –Arabic, Latvian, etc, large number of cases to normalize European languages

10 Indexes Indexing choices (there is no “right” answer) What is a word? –Embedded punctuation (e.g., DC-10, long-term) Break into distinct terms: long-term => long and term Single term with hyphen –Chemical/abstracts service: treats as single term –LEXIS/NEXIS: breaks into 2 terms if they occur in a title or abstract –Punctuation and Case folding Punctuation is sometimes important –“command.com” –“OS/2” Case folding: convert to lower case or not –Smith vs smith –Apple vs apple –New vs new

11 Indexes Indexing choices (there is no “right” answer) What is a word? –Stopwords (e.g., the, a, its) –Morphology (e.g., computer, computers, computing, computed) –Numbers? - not good discriminators But … important in some contexts Usually systems allow tokens to include digits but not to begin with one –So B6 (vitamin) but not 6

12 Generating Document Representations Indexing: use significant terms to build representations of documents Manual indexing: professional indexers –Usually from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Machine selects the non-objective terms –Terms can be single words or phrases