Properties of Text CS336 Lecture 4:. 2 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
CS 430 / INFO 430 Information Retrieval
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
CS336: Intelligent Information Retrieval
This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages
Evaluating the Performance of IR Sytems
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Chapter 6: Information Retrieval and Web Search
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Chapter 23: Probabilistic Language Models April 13, 2004.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Information Retrieval
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
CS 430: Information Discovery
Automatic Indexing Ref:
Chapter 5: Information Retrieval and Web Search
INF 141: Information Retrieval
Presentation transcript:

Properties of Text CS336 Lecture 4:

2 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200 are words such as “time” “war” “home” etc. –May be collection specific “computer, machine, program, source, language” in a computer science collection Removal can be problematic (e.g. “Mr. The”, “and-or gates”)

3 Stop lists Commercial systems use only few stop words ORBIT uses only 8, “and, an, by, from, of, the, with” –patents,scientific and technical (sci-tech) information, trademarks and Internet domain names

4 Special Cases? Name Recognition –People’s names - “Bill Clinton” –Company names - IBM & big blue –Places New York City, NYC, the big apple

5 Text Goal: –Identify what can be inferred about text based on structural features statistical features of language Statistical Language Characteristics –convert text to form more easily manipulable via computer –reduce storage space and processing time –store and process in encrypted form text compression

6 Zipf’s Law p r = (freq of word of rank r)/N –Probability that a word chosen randomly will be the word of rank r –N = total word occurrences –for D distinct words,  p r = 1 –r * p r = A A ≈ 0.1 –e.g.) the rank of a word is inversely proportional to its frequency The probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while many others occur rarely.

7 Employing Zipf’s Law Identify significant words and ineffectual words –A few words occur very often 2 most frequent words can account for 10% of occurrences top 6 words are 20% top 50 words are 50% –Many words are infrequent

8 Most frequent words r Word f(r) r*f(r)/N 1 the 69, of 36, and 28, to 26, a 23, in 21, that 10, is 10, was 9, he 9, N~1,000,000

9 Employing Zipf’s Law Estimate technical needs –Estimating storage space saved by excluding stop words from index 10 most frequently occurring words in English make up about 25%-30% of text Deleting very low frequency words from index yields a large saving Estimate number of words n(1) that occur 1 times, n(2) that occur 2 times, etc –Words that occur at most twice comprise about 2/3 of a text Estimating the size of a term’s inverted index list Zipf is quite accurate except at very high and very low rank

10 Modeling Natural Language Length of the words –defines total space needed for vocabulary each character requires 1 byte Heaps’ Law: length increases logarithmically with text size.

11 Vocabulary Growth New words occur less frequently as collection grows Empirically t = kN , where –t is the number of unique words –k and  are constants k    As the total text size grows, the predictions of theHeaps’ Law become more accurate Sublinear growth rate

12 Information Theory Shannon studied theoretical limits for data compression and transmission rate –“…problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point." Compression limits given by Entropy (H) Transmission limits given by Channel Capacity (C) Many language tasks have been formulated as a “noisy channel” problem –determine most likely input given noisy output OCR Speech recognition Machine translation etc.

13 Shannon Game How should we complete the following? –The president of the United States is George W. … –The winner of the $10K prize is … –Mary had a little … –The horse raced past the barn … Period? etc

14 Information Theory Information content of a message is dependent on both –the receiver’s prior knowledge –the message itself How much of the receiver’s uncertainty (entropy) is reduced How predictable is the message

15 Information Theory Think of information content, H, as a measurement of our ability to guess rest of message, given only a portion of the message –if predict with probability 1, information content is zero –if predict with probability 0, infinite information content –H(p) = -log p Logs in base 2, unit of information content (entropy) is 1 bit If message is a priori predictable with pr = 0.5 Information content = -log 2 (1/2) = 1 bit

16 Information Theory Given n messages, the average or expected information content to be gained from receiving one of the messages is: where  : # of symbols in an alphabet, p i : probability of a symbol’s appearance (freq i /all occurrences) –Amount of information in a message is related to the distribution of symbols in the message.

17 Entropy Average entropy is a maximum when messages are equally probable –e.g. average entropy associated with characters assuming equal probabilities For alphabet, H = log 1/26 = 4.7 bits With actual probabilities, H = 4.14 bits With bigram probabilites, H reduces to 3.56 bits People predict next letter with ~ 40% accuracy, H = 1.3 bits Better models reduce the relative entropy In text compression, entropy (H) specifies the limit on how much the text can be compressed –the more regularity (e.g. less uncertain) a data sequence, the more it can be compressed

18 Information Theory Let t = number of unique words in a vocabulary –For t = 10,000H = , , bits Information theory has been used for –Compression –Term weighting –Evaluation measures

19 Text Modeling Natural Language –Length of the words defines total space needed for vocabulary –each character requires 1 byte –Heaps’ Law: length increases logarithmically with text size.

Stemming Commonly used to conflate morphological variants –combine non identical words referring to same concept compute, computation, computer, … Stemming is used to: –Enhance query formulation (and improve recall) by providing term variants –Reduce size of index files by combining term variants into single index term

21 Stemmer correctness Two ways to be incorrect – Under-stemming Prevents related terms from being conflated “consideration” to “considerat” prevents conflating it with “consider” Under-stemming affects recall – Over-stemming Terms with different meanings are conflated “ considerate”, “consider” and “consideration” should not be stemmed to “con”, with “contra”, “contact”, etc. Over-stemming can reduce precision

22 The Concept of Relevance Relevant => does the document fulfill the query? Relevance of a document D to a query Q is subjective –Different users will have different judgments –Same users may judge differently at different times –Degree of relevance of different documents will vary In IR system evaluation it is assumed: –A subset of database documents (DB) are relevant –A document is either relevant or not relevant

23 Recall and precision Most common measures for evaluating IR systems Recall: % of relevant documents retrieved. – Measures ability to get ALL of the good documents. Precision: % of retrieved documents that are in fact relevant. – Measures amount of junk that is included in the results. Ideal Retrieval Results –100% recall (All good documents are retrieved ) –100% precision (No bad document is retrieved)

24 Evaluating stemmers In information retrieval stemmers are evaluated by their: –effect on retrieval improvements in recall or precision –compression rate –Not linguistic correctness

Stemmers 4 basic types –Affix removing stemmers –Dictionary lookup stemmers –n-gram stemmers –Corpus analysis Studies have shown that stemming has a positive effect on retrieval. Performance of different algorithms comparable Results vary between test collections

26 Affix removal stemmers Remove suffixes and/or prefixes leaving a stem –In English remove suffixes What might you remove if you were designing a stemmer? –In other languages, e.g. Hebrew, remove both prefix and suffix Keshehalachnu --> halach Nelechna --> halach –some languages are more difficult, e.g. Arabic –iterative: consideration => considerat => consider –longest match: use a set of stemming rules arranged on a ‘longest match’ principal (Lovins)

27 A simple stemmer (Harman) if word ends in “ies” but not “eies” or “aies” then “ies”->“y”; else in “es” but not “aes”, “ees” or “oes” then “es”->e; else in “s” but not “us” or “ss” then “s”->NULL endif Algorithm changes: – “skies” to “sky”, –“retrieves” to “retrieve” –“doors” to “door” –but not “corpus” or “wellness” –“dies” to “dy”?

Stemming w/ Dictionaries Avoid collapsing words with different meaning to same root Word is looked up and replaced by the best stem Typical stemmers consist of rules and/or dictionaries –simplest stemmer is “suffix s” –Porter stemmer is a collection of rules –KSTEM uses lists of words plus rules for inflectional and derivational morphology

29 Stemming Examples Original text: Document will describe marketing strategies carried out by U.S. companies for their agricultural chemicals, report predictions for market share of such chemicals, or report market statistics for agrochemicals, pesticide, herbicide, fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand, price cut, volume of sales Porter Stemmer: market strateg carr compan agricultur chemic report predict market share chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil predict sale stimul demand price cut volum sale KSTEM: marketing strategy carry company agriculture chemical report prediction market share chemical report market statistic agrochemic pesticide herbicide fungicide insecticide fertilizer predict sale stimulate demand price cut volume sale

30 n-grams Fixed length consecutive series of “n” characters –Bigrams: Sea colony -> (se ea co ol lo on ny) –Trigrams Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#) Conflate words based on overlapping series of characters

31 Problems with Stemming Lack of domain-specificity and context can lead to occasional serious retrieval failures Stemmers are often difficult to understand and modify Sometimes too aggressive in conflation (over-stem) –e.g. “execute”/“executive”, “university”/“universe”, “policy”/“police”, “organization”/“organ” conflated by Porter Miss good conflations (under-stem) –e.g. “European”/“Europe”, “matrices”/“matrix”, “machine”/“machinery” are not conflated by Porter Stems that are not words are often difficult to interpret – e.g. with Porter, “iteration” produces “iter” and “general” produces “gener”

Corpus-Based Stemming Corpus analysis can improve/replace a stemmer Hypothesis: Word variants that should be conflated will co-occur in context Modify stem classes generated by a stemmer or other “aggressive” techniques such as initial n- grams –more aggressive classes mean less conflations missed Prune class by removing words that don’t co-occur sufficiently often Language independent

33 Equivalence Class Examples Some Porter Classes for a WSJ Database Classes refined through corpus analysis

Corpus-Based Stemming Results Both Porter and KSTEM stemmers are improved slightly by this technique Ngram stemmer gives same performance as “linguistic” stemmers for –English –Spanish –Not shown to be the case for Arabic

35 Stemmer Summary All automatic stemmers are sometimes incorrect –over-stemming –understemming In general, improves effectiveness May use varying levels of language specific information –morphological stemmers use dictionaries –affix removal stemmers use information about prefixes, suffixes, etc. n-gram and corpus analysis methods can be used for different languages

36 Generating Document Representations Use significant terms to build representations of documents –referred to as indexing Manual indexing: professional indexers –Assign terms from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Terms can be single words, phrases, or other features from the text of documents

37 Index Languages Language used to describe docs and queries Exhaustivity # of different topics indexed, completeness or breadth –increased exhaustivity => higher recall/ lower precision Specificity - accuracy of indexing, detail –increased specificity => higher precision/lower recall retrieved output size increases because documents are indexed by any remotely connected content information When doc represented by fewer terms, content may be lost. A query that refers to the lost content,will fail to retrieve the document

38 Index Languages Pre-coordinate indexing – combinations of terms (e.g. phrases) used as an indexing label Post-coordinate indexing - combinations generated at search time Faceted classification - group terms into facets that describe basic structure of a domain, less rigid than predefined hierarchy Enumerative classification - an alphabetic listing, underlying order less clear –e.g. Library of Congress class for “socialism, communism and anarchism” at end of schedule for social sciences, after social pathology and criminology