What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Web Intelligence Text Mining, and web-related Applications
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
WMES3103 : INFORMATION RETRIEVAL
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
CS 430 / INFO 430 Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Developing a Basic Web Page Posting Files on UMBC
Chapter 5: Information Retrieval and Web Search
Introduction to XML This material is based heavily on the tutorial by the same name at
Overview of Search Engines
Writing a Research Paper. Step 1: Define your topic.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Today’s Topic Language of web page - HTML (Hypertext Markup Language)
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Modern Information Retrieval Computer engineering department Fall 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
XP 2 HTML Tutorial 1: Developing a Basic Web Page.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
XML The Extensible Markup Language (XML ), which is comparable to SGML and modeled on it, describes how to describe a collection of data. A standard way.
Basic HTML Document Structure. Slide 2 Goals (XHTML HTML5) XHTML Separate document structure and content from document formatting HTML 5 Create a formal.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
XP 2 HTML Tutorial 1: Developing a Basic Web Page.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
XP 1 HTML Tutorial 1: Developing a Basic Web Page.
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval in Practice
Search Engine Architecture
Modern Information Retrieval
HTML5 – Heading, Paragraph
CS 430: Information Discovery
Searching for and Accessing Information
Thanks to Bill Arms, Marti Hearst
Text Categorization Assigning documents to a fixed set of categories
Tutorial Developing a Basic Web Page
CSE 635 Multimedia Information Retrieval
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Recuperação de Informação
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like herding cats.” Dave Platt or… paper/article? video?

Basic IR: Documents Assume: free text from a quotation through a book (unstructured or semi-structured data) English available electronically (on-line repositories) generally, too many documents to store locally in an index. generally, infer semantics through low level units (e.g., terms) and metadata

Logical View of Documents structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs structureFull text Index terms (Figure taken from on-line course resources for Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto)

Structure Metadata is information on the organization of the data. external to meaning: length, author, date… subject matter: subject codes, keywords, taxonomic indicators Organizational Conventions: articles have a title, author list, abstract, sections, etc. web pages have headings, title, keywords, etc. structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs

Markup Languages Markup is extra syntax that describes formatting, attributes, semantics, etc. Tags provide direction and delineate beginning and end of marks. Examples: TeX, Standard Generalized Markup Language (SGML), eXtensible Markup Language (XML), HyperText Markup Language (HTML).

Term Separators: Accents, Spacing, etc… Lexical analysis divides text into distinct terms. usually disregard punctuation, numbers, spaces Decisions: how to treat case and hyphens? disregard comments? how to use or not formatting directives? structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs

Information in Terms Information entropy quantifies information content: where there are a set  of terms and p is the relative frequency (%) of a term.

Term Distribution Zipf’s Law approximates the distribution of term frequencies in a text. Frequency of ith most frequent term is times that of most frequent term where 1.5 <  < 2.0 Freq Terms

Stop Words words that either appear so frequently that they do not distinguish documents (e.g., “www”) or have more syntactic than semantic role (e.g., “the”). Advantage: Filtering out stop words reduces document description and focuses attention on terms that convey more information. Disadvantage: May reduce recall structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs

Vocabulary Size Heap’s Law models the size of vocabulary as a function of: the size of the text ( n ), a baseline ( 10<K<100 ), a growth factor (  < 1 ). Voc Text Size

Noun Groups Further focus term set by filtering for particular subsets selected manually (e.g., classifications or index terms). Discard terms that are not nouns*. Fix spelling errors. Use a thesaurus to combine similar words. *From Google web site, Top 20 gaining queries 2002 contain only nouns. structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs

Stemming Grammars permit minor modifications of terms that change their type rather than meaning, e.g., plurals, gerunds, some prefixes and suffixes… Stemming reduces term to just the core (stem). Advantages: reduces set of terms, combines same meaning Disadvantage: may reduce recall by incorrectly combining meanings (e.g., “skies” and “ski”) structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs

Putting it together: Document The purpose of the course is to teach theory and practice underlying the construction of Web based information systems. As such, the course will devote equal time to information retrieval and software engineering topics. The theory will be put into practice through a semester long team programming project. 48 words, 307 characters

Putting it together: Stop Word Removal purpose course teach theory practice underlying construction Web based information course devote equal time information retrieval software engineering topics theory practice semester long team programming project 26 words, 213 chars

Putting it together: Only Nouns purpose course theory practice construction Web information course equal time information retrieval software engineering topics theory practice semester team programming project 21 words, 179 chars

Putting it together: Stemming & Alphabetizing construct course course engineer equal informat informat practice practice program project purpose retrieve semester software team theory theory time topic web 21 words, 161 chars

Indexing Terms remaining after document processing must be stored to facilitate retrieval. Typically, they are stored in an inverted index. More on that later… structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs