Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Character and String definitions, algorithms, library functions Characters and Strings.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Properties of Text CS336 Lecture 4:. 2 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
COMP205 Comparative Programming Languages Part 1: Introduction to programming languages Lecture 2: Structure of programs and programming languages as communication.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Modern Information Retrieval Chapter 7: Text Processing.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Information Retrieval
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Information Retrieval in Practice
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
Natural Language Processing (NLP)
CS 430: Information Discovery
CS 430: Information Discovery
CSE 3302 Programming Languages
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Chapter 7 Lexical Analysis and Stoplists
Inf 722 Information Organisation
Content Analysis of Text
Natural Language Processing (NLP)
Information Retrieval and Web Design
Information Retrieval and Web Design
Natural Language Processing (NLP)
Presentation transcript:

Properties of Text CS336 Lecture 3:

2 Generating Document Representations Want to automatically generate with little human intervention Use significant terms to build representations –referred to as indexing

3 Indexes Indexing choices (there is no “right” answer) What is a word? –Embedded punctuation (e.g., DC-10, long-term) –Case folding (e.g., New vs new, Apple vs apple) –Stopwords (e.g., the, a, its) –Morphology (e.g., computer, computers, computing, computed)

4 Conclusions Text is the main form of communicating knowledge Documents have syntax, structure, and semantics Metadata: information about data Formats of text Modeling Natural Language –Statistical properties Entropy Distribution of symbols –Structural properties e.g.) morphology

5 Generating Document Representations Use significant terms to build representations of documents –referred to as indexing Manual indexing: professional indexers –Usually from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Machine selects the non-objective terms –Terms can be single words or phrases

6 Indexes Indexing choices (there is no “right” answer) –What is a word? Embedded punctuation (e.g., DC-10, long-term) Case folding (e.g., New vs new, Apple vs apple) Stopwords (e.g., the, a, its) Morphology (e.g., computer, computers, computing, computed) Three basic steps: 1.Lexical analysis 2.Stopword Removal 3.Morphological analysis/Stemming

7 Lexical analysis Turn stream of characters into stream of words What is a word? –Strings separated by white space / punctuation? languages like Chinese need segmentation record positional information for proximity operators –Embedded punctuation? –Case sensitive? –numbers, dates? –other special cases?

8 Include hyphens? e.g.) long-term, DC-10 Break into distinct terms –long and term Single term with hyphen –Chemical/abstracts service-treats as single term –LEXIS/NEXIS - break apart into two terms if they occur in a title or abstract

9 Punctuation and Case Punctuation is sometimes important –“command.com” –“OS/2” Case folding: convert to lower case or not –Smith vs smith –Apple vs apple –New vs new

10 Include numbers? Numbers - not good discriminators But … important in some contexts Usually systems allow tokens to include digits but not to begin with one –So B6 (vitamin) but not 6

11 Stop lists List of terms which are not included in an index Why use stop words? –Lunh 1957 observed that many of the most frequently occurring words worthless as index terms –The 10 most frequently occurring terms account for 20-30% of the word occurrences (Zipf) –Eliminating them saves index space and computation time

12 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200 are words such as “time” “war” “home” etc. –May be collection specific “computer, machine, program, source, language” in a computer science collection Removal can be problematic (e.g. “Mr. The”, “and-or gates”)

13 Stop lists Commercial systems use only few stop words ORBIT uses only 8, “and, an, by, from, of, the, with” –patents,scientific and technical (sci-tech) information, trademarks and Internet domain names

14 Special Cases? Name Recognition –People’s names - “Bill Clinton” –Company names - IBM & big blue –Places New York City, NYC, the big apple

Stemming Commonly used to conflate morphological variants –combine non identical words referring to same concept compute, computation, computer, … Stemming is used to: –Enhance query formulation (and improve recall) by providing term variants –Reduce size of index files by combining term variants into single index term