7CCSMWAL Algorithmic Issues in the WWW

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

CES 514 Data Mining Feb 17, 2010 Lecture 2: The term vocabulary and postings lists.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The term vocabulary and postings lists
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
WMES3103 : INFORMATION RETRIEVAL
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
IN350: Text properties, Zipf’s Law,and Heap’s Law. Judith A. Molka-Danielsen September 12, 2002 Notes are based on Chapter 6 of the Article Collection.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Chapter 5: Information Retrieval and Web Search
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
In general, whitespace is insignificant.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Web- and Multimedia-based Information Systems Lecture 2.
Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Statistical Properties of Text
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Basic Text Processing Word tokenization.
Intelligent Information Retrieval
Information Retrieval in Practice
CSE 538 MRS BOOK – CHAPTER II The term vocabulary and postings lists
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Ch 2 Term Vocabulary & Postings List
Lecture 1: Introduction and the Boolean Model Information Retrieval
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Text Based Information Retrieval
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
CS 430: Information Discovery
CS 430: Information Discovery
Intro to PHP & Variables
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing.
Chapter 15 QUERY EXECUTION.
Document ingestion.
Core Concepts Lecture 1 Lexical Frequency.
CS276: Information Retrieval and Web Search
CS 430: Information Discovery
Recap of the previous lecture
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
PageRank GROUP 4.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

7CCSMWAL Algorithmic Issues in the WWW Lecture 6

See chapter 2 of Intro to IR Text Processing See chapter 2 of Intro to IR (Introduction to Information Retrieval: Manning, Raghavan Schutze. Available online)

Text processing Given a document or a collection of documents What do we identify as index terms? Process the documents Tokenization Token Normalization Elimination of stop words Stemming

Tokenization Convert a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) Identification of the words (called tokens) Simplest way is to recognise spaces as word separators and remove punctuation characters E.g., Input: “Friends, Romans, Countrymen, lend me your ears;” Output: Friends Romans Countrymen lend me your ears

Problems with tokenization? Tokenization is relatively easy in languages with word boundaries separated by spaces What about eg East Asian languages?

Chinese and Japanese have no spaces between words: Sec. 2.2.1 Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled Words interchangeably written in 2 alphabets フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana!

Tokenization The major question of the tokenization is what are the correct tokens to use Language specific Tricky cases Apostrophe Hyphens (and white space) Case of the letters (lower and upper case) Domain specific tokens HTML

Apostrophes Consider the text Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. For “O’Neill”, which of the following is the desired tokenization? neil, oneill, o’neill, o’ neill, o neill For “aren’t” aren’t, arent, are n’t, aren t

Hyphens and white space Hyphenation is used for various purposes Splitting words over lines Splitting up vowels in words, e.g., co-education Joining nouns as names, e.g., Hewlett-Packard Copyediting device to show word grouping, e.g., the hold-him-back-and-drag-him-away manouver Complex to handle hyphens automatically The first 2 cases should be regarded as one token and the third one should be separated into words, and the second one is unclear Splitting on white space can also split what should be regarded as a single token Occur most commonly with names, e.g., San Francisco, Los Angeles Phone numbers and date, e.g., (800) 234-2333, 11 Mar 1983 Bad retrieval results may occur, e.g., if a search for York University mainly returns documents containing New York University

Case of the letters The case of letters is usually not important for the identification of index terms Normally, all the text is converted to either lower or upper case Particular scenarios that might require the distinction to be made, e.g., Windows (an operating system) General Motors (a company) Bush (a person)

Domain specific tokens For example Programming languages: C++ and C# Aircraft names: B-52 Television show names: M*A*S*H Email addresses: jblack@mail.yahoo.com Web URLs: http://stuff.big.com/new/ Numeric IP addresses: 142.32.48.231 Package tracking numbers: 1Z9999W99845399981

Tokenizing HTML Should text in HTML commands not typically seen by the user be included as tokens? Words appearing in URLs. Words appearing in “meta text” of images. Simplest approach is to exclude all HTML tag information (between “<” and “>”) from tokenization

Token Normalization Elimination of stop words

Token Normalization Canonicalize tokens, so that matches occur despite superficial differences in the character sequences of the tokens Create equivalence classes of terms E.g., anti-discriminatory and antidiscriminatory are both mapped onto the term antidiscriminatory This method can be extended to hand-constructed lists of synonyms such as car and automobile

Elimination of stop words Stop words – words which are too frequent among the documents in the collection  Not good discriminators Normally filtered out as potential index terms Articles, prepositions, and conjunctions are natural candidates for a list of stop words E.g., a, the, and, in, but, who, that, or

Elimination of stop words Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.

Elimination of stop words The list of stop words (stop list) might be extended to include words other than articles, preposition, and conjunctions E.g., some verbs, adverbs, and adjectives could be treated as stop words Typically a few hundred terms take up 50% of the text. Elimination of stop words can reduce the size of the indexing structure

Elimination of stop words But again care is required E.g., queries like “To be or not to be” and “Let it be” are terms containing entirely stop words The general trend over time has been to move from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever Web search engines generally do not use stop lists (keep a mixture of words and phrases)

Stemming Different forms of a word are used in documents E.g., organize, organizes, and organizing Families of derivationally related words with similar meanings E.g., democracy, democratic, and democratization It would be useful for a search for one of these words to return documents that contain another word in the set E.g., am, are, is  be car, cars, car’s, cars’  car E.g. for text the boy’s cars are different colors  the boy car be differ color

Stemming Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm Porter’s algorithm is long. We cover general principle only You can make up your own stemming rules

Porter’s algorithm Consist of five phases of word reduction Within each phase, there are various conventions to select rules. Eg: select the rule from each rule group that applies to the longest suffix In the first phase, this convention is used with the following rule groups Rule Example SSES  SS caresses  caress IES  I ponies  poni SS  SS caress  caress S  cats  cat

Porter’s algorithm Many of the later rules use a concept of the measure of a word. This checks the number of syllables, to see if a word is long enough to reasonably regard ‘the matching portion of a rule’ as a suffix, rather than as part of the stem Example: (m>1) EMENT  replacement to replac, but not cement to c See www.tartarus.org/~martin/PorterStemmer/ for the entire Porter’s algorithm Try it for yourself (see next page)

Stemming

Stemming

Example Within each phase, there are various conventions to select rules, such as selecting the rule from each rule group that applies to the longest suffix Stemmed Word phase phase various variou conventions convent select select rules rule selecting select rule rule rule rule group group applies appli longest longest suffix suffix

Google uses stemming http://en.wikipedia.org/wiki/Stemming http://www.googleguide.com/interpreting_queries.html “Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms. If you search for pet lemur dietary needs, Google will also search for pet lemur diet needs, and other related variations of your terms. Any variants of your terms that were searched for will be highlighted in the snippet of text accompanying each result.” (older version of page) http://en.wikipedia.org/wiki/Stemming An extensive list of stemming approaches

Text Properties

Text properties Text is composed of symbols from a finite alphabet Symbols can be divided in two disjoint subsets: symbols that separate words and symbols that belong to words Issues: How are the different words distributed inside each document? How many distinct words in a document? What is the average length of a word?

Vocabulary distribution A few words are very common 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences Most words are very rare A “heavy tailed” distribution. Most of the words are in the “tail” (power law on rank) Rank: largest, next largest and so on en.wikipedia.org/wiki/Rank-size_distribution Zipf distribution (en.wikipedia.org/wiki/Zipf's_law)

TIME magazine collection 423 short articles 245412 words Example Top 10 frequent words Number of occurrences Percentage of total the 15861 6.46 of 7239 2.95 to 6331 2.58 a 5878 2.40 and 5614 2.29 in 5294 2.16 that 2507 1.02 for 2228 0.91 was 2149 0.88 with 1839 0.75 TIME magazine collection 423 short articles 245412 words

Distribution of top 50 frequent words TIME magazine collection 423 short articles 245412 words

Zipf’s law Frequency of occurrence of some event F(r) is a function of its rank r as F(r) = C / ra where C and a are constants Let P(r) be the probability of rank r word, P(r) = F(r) / N where N is the total number of word occurrences. Then P(r) = (C / N) / ra

How to find the constant c Frequency of occurrence of some event F(r) is a function of its rank r as F(r) = C / ra where C and a are constants Given we can estimate the value of a we can get C from Zeta(a)= (Sum r=1,2,…) (1 / ra ) Zeta functions are tabulated Eg if a=2, Zeta(2)= 1.645, so if we set C=1/1.645 then P(r)= 1/(1.645 r^2)

Vocabulary (a=1) For vocabulary distribution, a1 and C/N0.1 So P(r)  0.1/r Decreases directly with rank The values are content independent This means r * P(r) = r*F(r)/N  0.1 See examples overleaf Why might this matter? storage requirements of index, generate artificial test documents, spot exceptional occurrence of rare words, general curiosity

TIME magazine collection r Word F(r) P(r)*r P(r) 1 the 15861 0.065 18 it 1290 0.095 35 week 793 0.113 2 of 7239 0.059 19 from 1228 36 they 697 0.102 3 to 6331 0.077 20 but 1138 0.093 37 govern 687 0.104 4 a 5878 0.096 21 u 955 0.082 38 all 672 5 and 5614 0.114 22 had 940 0.084 39 year 0.107 6 in 5294 0.129 23 last 930 0.087 40 its 620 0.101 7 that 2507 0.072 24 be 915 0.089 41 britain 589 0.098 8 for 2228 0.073 25 have 914 42 when 579 0.099 9 was 2149 0.079 26 who 894 43 out 577 10 with 1839 0.075 27 not 882 0.097 44 would 0.103 11 his 1815 0.081 28 has 880 0.100 45 new 572 0.105 12 is 1810 29 an 873 46 up 559 13 he 1700 0.090 30 s 865 0.106 47 been 554 14 as 1581 31 were 848 48 more 540 15 on 1551 32 their 815 49 which 539 0.108 16 by 1467 33 are 812 0.109 50 into 518 17 at 1333 0.092 34 one 811 0.112 P(r) = F(r) / N where N=245412

Wall Street Journal (WSJ87) collection 46,449 newspaper articles, N = 19 Million r Word F(r) P(r)*r 1 the 1130021 0.059 18 from 96900 0.092 35 or 54958 0.101 2 of 547311 0.058 19 he 94585 0.095 36 about 53713 0.102 3 to 516635 0.082 20 million 3515 0.098 37 market 52110 4 a 464736 21 year 90104 0.100 38 they 51359 0.103 5 in 390819 22 its 86774 39 this 50933 0.105 6 and 387703 0.122 23 be 85588 0.104 40 would 50828 0.107 7 that 204351 0.075 24 was 83398 41 u 49281 0.106 8 for 199340 0.084 25 company 3070 0.109 42 which 48273 9 is 152483 0.072 26 an 76974 43 bank 47940 10 said 148302 0.078 27 has 74405 44 stock 47401 0.110 11 it 134323 28 are 74097 45 trade 47310 0.112 12 on 121173 0.077 29 have 73132 46 his 47116 0.114 13 by 118863 0.081 30 but 71887 47 more 46244 14 as 109135 0.080 31 will 71494 0.117 48 who 42142 15 at 101779 32 say 66807 0.113 49 one 41635 16 mr 101679 0.086 33 new 64456 50 their 40910 0.108 17 with 101210 0.091 34 share 63925

Average word length TREC (Text REtrieval Conference) has collected documents since 1992 as a test bed for participants. In the TREC (English language) collection, average word length is 5 (range 4.8-5.3 in sub-collections) 6-7 if stop words are eliminated 8-9 over all vocabulary entries (long words occur less frequently)

Vocabulary growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc.

Heaps’ law If V is the size of the vocabulary and the n is the length of the corpus in words: V = K n with constant K, 0 <  <1 Typical constants: K  10100   0.40.6 (approx. square-root)

Heaps’ law data (From James Allan, UMass) V Words in vocabulary, V (thousands) Words in Collection, n (millions)