Download presentation
Presentation is loading. Please wait.
Published byAlison Hart Modified over 8 years ago
1
COMP423: Intelligent Agent Text Representation
2
Menu – Bag of words – Phrase – Semantics Semantic distance between two words
3
Bag of words Vector Space model – Documents are term vectors – Tf.Idf for term weights – Cosine similarity Limitations: – Word order – Word importance: key words, word ranking – … – Words semantics Word form and word meaning: not 1 to 1 matching – Semantic distance between words – ….
4
Consider Word Order N-grams model – Bi-grams: two words as a phrase, some are not really phrases – Tri-games, three words, not worth it Phrase based – Use Part of speech, e.g. select noun phrases – Regular expression, chunking: expensive for writing the patterns – Suffix tree for shared phrases Mixed results One example [Furnkranz98] – The representation is evaluated on a Web categorization task (university pages classified as STUDENT, FACULTY, STAFF, DEPARTMENT, etc. – A Naive Bayes (NB) classifier and Ripper (a rule induction algorithm) used – Results (words vs. words+phrases) are mixed Accuracy improved for NB and not for Ripper Precision at low recall highly improved Some phrasal features are highly predictive for certain classes, but in general have low coverage More recent work by [Yuefeng Li 2010, KDD] – Applied on classification, positive results
5
Word importance Feature selection: need a corpus for training – Document frequency – Information Gain (IG) – Chi-square – LSA, ICA Keyword extraction Feature extraction Ideas – Using Wikipedia as training data and testing data – Using the Web – Bring order to words
6
Other issues Time Word categories – Common words – Academic words – Domain specific words ….
7
Word semantics Using external resources – Early work WordNet Cyc – Wikipedia – The Web Mixed results – Recall is usually improved, but precision is hurt Disambiguation is critical
8
Wordnet WordNet’s organization – The basic unit is the synset = synonym set – A synset is equivalent to a concept – E.g. Senses of “car” (synsets to which “car” belongs) – {car, auto, automobile, machine, motorcar} – {car, railcar, railway car, railroad car} – {cable car, car} – {car, gondola} – {car, elevator car}
9
WordNet is useful for IR Indexing with synsets has proven effective [Gonzalo98] It improves recall because involves mapping – synonyms into the same indexing object It improves precision if only relevant senses are considered – E.g. A query for “jaguar” in the car sense causes retrieving only documents with “jaguar car”
10
Mixed results Concept indexing with WordNet [Scott98, Scott99] ¯ Using synsets and hypernyms with Ripper Fail because they do not perform WSD (Word Sense Disambiguation) [Junker97] ¯¯ Using synsets and hypernyms as generalization operators in a specialized rule learner Fail because the proposed learning method gets lost in the hypothesis space [Fukumoto01] Sysnets and (limited) hypernyms for SVM, no WSD Improvement on less populated categories In general Given that there is not a reliable WSD algorithm for (fine-grained) WordNet senses, current approaches do not perform WSD Improvements in small categories But I believe full, perfect WSD is not required.
11
Semantic distance between word pairs – Wikipedia based Wikipedia Link Similarity (WLM) Explicit semantic analysis: ESA (State-of-art) – Google normalised distance (GND) – Thesaurus based : WordNet – Corpus based: e.g. Latent Semantic Analysis Statistical Thesaurus: Co- occurrence
12
GND: Motivation and Goals To represent meaning in a computer- digestable form To establish semantic relations between common names of objects Utilise the largest database in the word – the web
13
NGD definition x = word one (eg 'horse') y = word two (eg 'rider') N = normalising factor (often M) M = the cardinality of the set of all pages on the web f(x) = frequency x occurs in the total set of documents Because of LogN, NGD is stable as the web grows
14
Example NGD(horse, rider) Horse returns 46,700,000 Rider returns 12,200,000 Horse Rider returns 2,630,000 Google indexed 8,058,044,651
15
Explicit Semantic Analysis (ESA) Use Wikipedia as the external source
16
Wikipedia Link similarity measure Inlinks Outlinks Shared inlinks and outlinks, average of the two – Inlinks: formula borrowed from GND – Outlinks: w(l,A) the weight of a link, similar to the inversed document similarity – The total number of web pages divided by the number of pages with that link.
17
Bag of concepts WikiMiner, WLM, by Ian Witten, David Milne, Anna Huang – Wikipedia based approach – Concepts are anchor texts Can be phrases Also is a way to select important words – Use shared inlinks, outlinks to estimate the semantic distance between concepts, – New document similarity measure. There should be other ways – to define concepts – to select concepts – to compare concepts – …
18
Statistical Thesaurus Existing human-developed thesauri are not easily available in all languages. Human thesuari are limited in the type and range of synonymy and semantic relations they represent. Semantically related terms can be discovered from statistical analysis of corpora. 18
19
Automatic Global Analysis Determine term similarity through a pre- computed statistical analysis of the complete corpus. Compute association matrices which quantify term correlations in terms of how frequently they co-occur. 19
20
Association Matrix 20 w 1 w 2 w 3 …………………..w n w1w2w3..wnw1w2w3..wn c 11 c 12 c 13 …………………c 1n c 21 c 31. c n1 c ij : Correlation factor between term i and term j f ik : Frequency of term i in document k
21
Normalized Association Matrix Frequency based correlation factor favors more frequent terms. Normalize association scores: Normalized score is 1 if two terms have the same frequency in all documents. 21
22
Metric Correlation Matrix Association correlation does not account for the proximity of terms in documents, just co- occurrence frequencies within documents. Metric correlations account for term proximity. 22 V i : Set of all occurrences of term i in any document. r(k u,k v ): Distance in words between word occurrences k u and k v ( if k u and k v are occurrences in different documents).
23
Normalized Metric Correlation Matrix Normalize scores to account for term frequencies: 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.