Download presentation
Presentation is loading. Please wait.
Published byAshlyn Freeman Modified over 9 years ago
1
Probabilistic Language Processing Chapter 23
2
Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: – add-one –Linear interpolation Evaluate with Perplexity measure E.g. segmentwordswithoutspaces w/ Viterbi
3
PCFGs Rewrite rules have probabilities. Prob of a string is sum of probs of its parse trees. Context-freedom means no lexical constraints. Prefers short sentences.
4
Learning PCFGs Parsed corpus -- count trees. Unparsed corpus –Rule structure known -- use EM (inside-outside algorithm) –Rules unknown -- Chomsky normal form… problems.
5
Information Retrieval Goal: Google. Find docs relevant to user’s needs. IR system has doc. Collection, query in some language, set of results, and a presentation of results. Ideally, parse docs into knowledge base… too hard.
6
IR 2 Boolean Keyword Model -- in or out? Problem -- single bit of “relevance” Boolean combinations a bit mysterious How compute P(R=true | D,Q)? Estimate language model for each doc, computes prob of query given the model. Can rank documents by P(r|D,Q)/P(~r|D,Q)
7
IR3 For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. Good example pp 842-843.
8
Evaluating IR Precision is proportion of results that are relevant. Recall is proportion of relevant docs that are in results ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”
9
IR Refinements Case Stems Synonyms Spelling correction Metadata --keywords
10
IR Presentation Give list in order of relevance, deal with duplicates Cluster results into classes –Agglomerative –K-means How describe automatically-generated clusters? Word list? Title of centroid doc?
11
IR Implementation CSC172! Lexicon with “stop list”, “inverted” index: where words occur Match with vectors: vectorof freq of words dotted with query terms.
12
Information Extraction Goal: create database entries from docs. Emphasis on massive data, speed, stylized expressions Regular expression grammars OK if stylized enough Cascaded Finite State Transducers,,,stages of grouping and structure-finding
13
Machine Translation Goals Rough Translation (E.g. p. 851) Restricted Doman (mergers, weather) Pre-edited (Caterpillar or Xerox English) Literary Translation -- not yet! Interlingua-- or canonical semantic representation like Conceptual Dependency Basic Problem != languages, != categories
14
MT in Practice Transfer -- uses data base of rules for translating small units of language Memory -based. Memorize sentence pairs Good diagram p. 853
15
Statistical MT Bilingual corpus Find most likely translation given corpus. Argmax_F P(F|E) = argmax_F P(E|F)P(F) P(F) is language model P(E|F) is translation model Lots of interesting problems: fertility (home vs. a la maison). Horrible drastic simplfications and hacks work pretty well!
16
Learning and MT Stat. MT needs: language model, fertility model, word choice model, offset model. Millions of parameters Counting, estimate, EM.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.