COMP423: Intelligent Agent Text Representation
Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words
Bag of words Vector Space model – Documents are term vectors – Tf.Idf for term weights – Cosine similarity Limitations: – Words semantics – Semantic distance between words – Word order – Word importance – ….
Consider Word Order N-grams model – Bi-grams: two words as a phrase, some are not really phrases – Tri-games, three words, not worth it Phrase based – Use Part of speech, e.g. select noun phrases – Regular expression, chunking: expensive for writing the patterns Mixed results One example [Furnkranz98] – The representation is evaluated on a Web categorization task (university pages classified as STUDENT, FACULTY, STAFF, DEPARTMENT, etc. – A Naive Bayes (NB) classifier and Ripper used – Results (words vs. words+phrases) are mixed Accuracy improved for NB and not for Ripper Precision at low recall highly improved Some phrasal features are highly predictive for certain classes, but in general have low coverage More recent work by [Yuefeng Li 2010, KDD] – Applied on classification, positive results
Word semantics Using external resources – Early work WordNet Cyc – Wikipedia – The Web Mixed results – Recall is usually improved, but precision is hurt Disambiguation is critical
Wordnet WordNet’s organization – The basic unit is the synset = synonym set – A synset is equivalent to a concept – E.g. Senses of “car” (synsets to which “car” belongs) – {car, auto, automobile, machine, motorcar} – {car, railcar, railway car, railroad car} – {cable car, car} – {car, gondola} – {car, elevator car}
WordNet is useful for IR Indexing with synsets has proven effective [Gonzalo98] It improves recall because involves mapping – synonyms into the same indexing object It improves precision if only relevant senses are considered – E.g. A query for “jaguar” in the car sense causes retrieving only documents with “jaguar car”
Mixed results Concept indexing with WordNet [Scott98, Scott99] ¯ Using synsets and hypernyms with Ripper Fail because they do not perform WSD (Word Sense Disambiguation) [Junker97] ¯¯ Using synsets and hypernyms as generalization operators in a specialized rule learner Fail because the proposed learning method gets lost in the hypothesis space [Fukumoto01] Sysnets and (limited) hypernyms for SVM, no WSD Improvement on less populated categories In general Given that there is not a reliable WSD algorithm for (fine-grained) WordNet senses, current approaches do not perform WSD Improvements in small categories But I believe full, perfect WSD is not required.
Word importance Feature selection: need a corpus for training – Document frequency – Information Gain (IG) – Chi-square Keyword extraction Feature extraction Others – Using Wikipedia as training data and testing data – Using the Web – Bring order to words
Other issues Time Word categories – Common words – Academic words – Domain specific words ….
Semantic distance between word pairs – Thesaurus based : WordNet – Corpus based: e.g. Latent Semantic Analysis Statistical Thesaurus: Co- occurrence – Google normalised distance (GND) – Wikipedia based Wikipedia Link Similarity (WLM) Explicit semantic analysis: ESA (State-of-art)
GND: Motivation and Goals To represent meaning in a computer- digestable form To establish semantic relations between common names of objects Utilise the largest database in the word – the web
NGD definition x = word one (eg 'horse') y = word two (eg 'rider') N = normalising factor (often M) M = the cardinality of the set of all pages on the web f(x) = frequency x occurs in the total set of documents Because of LogN, NGD is stable as the web grows
Example NGD(horse, rider) Horse returns 46,700,000 Rider returns 12,200,000 Horse Rider returns 2,630,000 Google indexed 8,058,044,651
Wikipedia Link similarity measure Inlinks Outlinks Shared inlinks and outlinks, average of the two – Inlinks: formula borrowed from GND – Outlinks: w(l,A) the weight of a link, similar to the inversed document similarity
Bag of concepts WikiMiner, WLM, by Ian Witten, David Milne, Anna Huang – Wikipedia based approach – Concepts are anchor texts Can be phrases Also is a way to select important words – Use shared inlinks, outlinks to estimate the semantic distance between concepts, – New document similarity measure. There should be other ways – to define concepts – to select concepts – to compare concepts – …
Statistical Thesaurus Existing human-developed thesauri are not easily available in all languages. Human thesuari are limited in the type and range of synonymy and semantic relations they represent. Semantically related terms can be discovered from statistical analysis of corpora. 17
Automatic Global Analysis Determine term similarity through a pre- computed statistical analysis of the complete corpus. Compute association matrices which quantify term correlations in terms of how frequently they co-occur. 18
Association Matrix 19 w 1 w 2 w 3 …………………..w n w1w2w3..wnw1w2w3..wn c 11 c 12 c 13 …………………c 1n c 21 c 31. c n1 c ij : Correlation factor between term i and term j f ik : Frequency of term i in document k
Normalized Association Matrix Frequency based correlation factor favors more frequent terms. Normalize association scores: Normalized score is 1 if two terms have the same frequency in all documents. 20
Metric Correlation Matrix Association correlation does not account for the proximity of terms in documents, just co- occurrence frequencies within documents. Metric correlations account for term proximity. 21 V i : Set of all occurrences of term i in any document. r(k u,k v ): Distance in words between word occurrences k u and k v ( if k u and k v are occurrences in different documents).
Normalized Metric Correlation Matrix Normalize scores to account for term frequencies: 22