Presentation is loading. Please wait.

Presentation is loading. Please wait.

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

Similar presentations


Presentation on theme: "Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents."— Presentation transcript:

1 Properties of Text CS336 Lecture 3:

2 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents also possible –Images –Music –etc

3 3 What does an IR system do? Indexing: generate a representation of each document –Want to automatically generate with little human intervention –Use significant terms to build representations Process query –Representation of ‘information need’ –if documents processed specially, queries must also be –possibly weight query words Match queries and documents –find relevant documents Rank and sort documents

4 4 Text No computer understanding of document or query text Uses “bag of words” approach –Pay little heed to inter-word dependencies: syntax, semantics –Bag does characterize document –Not perfect: words are ambiguous used in different forms or synonymously

5 5 Document Loosely defined, denotes a single unit of information Can be any logical unit –an article, –a book, or –a manual Can be any physical unit –a file, –an email, or –a Web Page

6 6 Document has: Syntax: dictates structure –Implicit, or expressed in a declarative language (e.g.TeX) –Powerful languages: (i) easier to parse, (ii) difficult to convert to other formats. –Generic languages are better (interchange & flexible) –Trend: languages which provide information on structure, format and semantics being readable by human & computers, e.g. SGML, XML Semantics –Semantics of texts in natural language are not easy for a computer to understand Information about itself, i.e., meta-data

7 7 Parsing Normalizing format –Process different document formats PDF, DOC HTML –Can be very noisy, need robust parser –Brin, S., Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine –http://www-db.stanford.edu/pub/papers/google.pdfhttp://www-db.stanford.edu/pub/papers/google.pdf Word segmentation Word normalization

8 8 Text Representation Text can be represented as –a string –words (statistical IR) –linguistic units (e.g., nouns, phrases) Simple representation (single terms, or individual words) perform surprisingly well –many previous studies actually showed that phrase indexing performs poorer than word indexing phrases may be too specific (different phrases with same meaning)

9 9 Word Segmentation English is easy –Space character? Well… –It is said that Google is indexing not just words, but common queries too “mp3 downloads” Other languages present problems –Chinese no space character http://www.sighan.org/bakeoff2003/ –Japanese Four alphabets –Romanji, Hiragana, Katakana, Kanji –German, Finish, URLs, etc. compound words “Donaudampfschiffahrtsgesellschaftsoberkapitän” –Arabic, Latvian, etc, large number of cases to normalize European languages

10 10 Indexes Indexing choices (there is no “right” answer) What is a word? –Embedded punctuation (e.g., DC-10, long-term) Break into distinct terms: long-term => long and term Single term with hyphen –Chemical/abstracts service: treats as single term –LEXIS/NEXIS: breaks into 2 terms if they occur in a title or abstract –Punctuation and Case folding Punctuation is sometimes important –“command.com” –“OS/2” Case folding: convert to lower case or not –Smith vs smith –Apple vs apple –New vs new

11 11 Indexes Indexing choices (there is no “right” answer) What is a word? –Stopwords (e.g., the, a, its) –Morphology (e.g., computer, computers, computing, computed) –Numbers? - not good discriminators But … important in some contexts Usually systems allow tokens to include digits but not to begin with one –So B6 (vitamin) but not 6

12 12 Generating Document Representations Indexing: use significant terms to build representations of documents Manual indexing: professional indexers –Usually from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Machine selects the non-objective terms –Terms can be single words or phrases


Download ppt "Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents."

Similar presentations


Ads by Google