Probabilistic Language Processing Chapter 23
Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: – add-one –Linear interpolation Evaluate with Perplexity measure E.g. segmentwordswithoutspaces w/ Viterbi
PCFGs Rewrite rules have probabilities. Prob of a string is sum of probs of its parse trees. Context-freedom means no lexical constraints. Prefers short sentences.
Learning PCFGs Parsed corpus -- count trees. Unparsed corpus –Rule structure known -- use EM (inside-outside algorithm) –Rules unknown -- Chomsky normal form… problems.
Information Retrieval Goal: Google. Find docs relevant to user’s needs. IR system has doc. Collection, query in some language, set of results, and a presentation of results. Ideally, parse docs into knowledge base… too hard.
IR 2 Boolean Keyword Model -- in or out? Problem -- single bit of “relevance” Boolean combinations a bit mysterious How compute P(R=true | D,Q)? Estimate language model for each doc, computes prob of query given the model. Can rank documents by P(r|D,Q)/P(~r|D,Q)
IR3 For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. Good example pp
Evaluating IR Precision is proportion of results that are relevant. Recall is proportion of relevant docs that are in results ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”
IR Refinements Case Stems Synonyms Spelling correction Metadata --keywords
IR Presentation Give list in order of relevance, deal with duplicates Cluster results into classes –Agglomerative –K-means How describe automatically-generated clusters? Word list? Title of centroid doc?
IR Implementation CSC172! Lexicon with “stop list”, “inverted” index: where words occur Match with vectors: vectorof freq of words dotted with query terms.
Information Extraction Goal: create database entries from docs. Emphasis on massive data, speed, stylized expressions Regular expression grammars OK if stylized enough Cascaded Finite State Transducers,,,stages of grouping and structure-finding
Machine Translation Goals Rough Translation (E.g. p. 851) Restricted Doman (mergers, weather) Pre-edited (Caterpillar or Xerox English) Literary Translation -- not yet! Interlingua-- or canonical semantic representation like Conceptual Dependency Basic Problem != languages, != categories
MT in Practice Transfer -- uses data base of rules for translating small units of language Memory -based. Memorize sentence pairs Good diagram p. 853
Statistical MT Bilingual corpus Find most likely translation given corpus. Argmax_F P(F|E) = argmax_F P(E|F)P(F) P(F) is language model P(E|F) is translation model Lots of interesting problems: fertility (home vs. a la maison). Horrible drastic simplfications and hacks work pretty well!
Learning and MT Stat. MT needs: language model, fertility model, word choice model, offset model. Millions of parameters Counting, estimate, EM.