Download presentation
Presentation is loading. Please wait.
Published byMarvin Moore Modified over 8 years ago
1
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003
2
(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Mondays, 1-4 PM in 409 West Hall
3
(C) 2003, The University of Michigan3 Queries and documents
4
(C) 2003, The University of Michigan4 Queries Single-word queries Context queries –Phrases –Proximity Boolean queries Natural Language queries
5
(C) 2003, The University of Michigan5 Pattern matching Words, prefixes, suffixes, substrings, ranges, regular expressions Structured queries (e.g., XML)
6
(C) 2003, The University of Michigan6 Relevance feedback Query expansion Term reweighting Pseudo-relevance feedback Latent semantic indexing Distributional clustering
7
(C) 2003, The University of Michigan7 Document processing Lexical analysis Stopword elimination Stemming Index term identification Thesauri
8
(C) 2003, The University of Michigan8 Porter’s algorithm 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V where the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute 2. * - stem ends with letter X 3. *v* - stem ends in a vowel 4. *d - stem ends in double consonant 5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
9
(C) 2003, The University of Michigan9 Porter’s algorithm Suffix conditions take the form current_suffix = = pattern Actions are in the form old_suffix -> new_suffix Rules are divided into steps to define the order of applying the rules. The following are some examples of the rules: STEP CONDITION SUFFIX REPLACEMENT EXAMPLE 1a NULL sses ss stresses->stress 1b *v* ing NULL making->mak 1b1 NULL at ate inflat(ed)->inflate 1c *v* y I happy->happi 2 m>0 aliti al formaliti->formal 3 m>0 icate ic duplicate->duplic 4 m>1 able NULL adjustable->adjust 5a m>1 e NULL inflate->inflat 5b m>1 and NULL single letter controll->control
10
(C) 2003, The University of Michigan10 Porter’s algorithm Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied.
11
(C) 2003, The University of Michigan11 Porter’s algorithm
12
(C) 2003, The University of Michigan12 Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|
13
(C) 2003, The University of Michigan13 Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non- relevant R = ? S = ? Q’ = ?
14
(C) 2003, The University of Michigan14 Automatic query expansion Thesaurus-based expansion Distributional similarity-based expansion
15
(C) 2003, The University of Michigan15 WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /clair3/tools/relatedwords/relate reason
16
(C) 2003, The University of Michigan16 Related (substitutable) words Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Wordnet Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper
17
(C) 2003, The University of Michigan17 Indexing and searching
18
(C) 2003, The University of Michigan18 Computing term salience Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF)
19
(C) 2003, The University of Michigan19 Scripts to compute tf and idf cd /clair4/class/ir-w03/hw2./tf.pl 053.txt | sort -nr +1 | more./tfs.pl 053.txt | sort -nr +1 | more./stem.pl reasonableness./build-idf.pl./idf.pl | sort -n +2 | more
20
(C) 2003, The University of Michigan20 Applications of TFIDF Cosine similarity Indexing Clustering
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.