Download presentation
Presentation is loading. Please wait.
Published byJennifer Stafford Modified over 9 years ago
1
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A
2
Smoothing We often want to make estimates from sparse statistics: Smoothing flattens spiky distributions so they generalize better Very important all over NLP, but easy to do badly! We’ll illustrate with bigrams today (h = previous word, could be anything). P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations chargesmotionbenefits … allegationsreportsclaims charges request motionbenefits … allegationsreports claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total
3
Kneser-Ney Kneser-Ney smoothing combines these two ideas Absolute discounting Lower order continuation probabilities KN smoothing repeatedly proven effective Why should things work like this?
4
Predictive Distributions Parameter estimation: With parameter variable: Predictive distribution: abca θ = P(w) = [a:0.5, b:0.25, c:0.25] abca Θ abca Θ W
5
“Chinese Restaurant” Processes [Teh, 06, diagrams from Teh] Dirichlet ProcessPitman-Yor Process
6
Hierarchical Models cadb bba Θ0Θ0 ΘcΘc ΘdΘd ΘaΘa ab a ΘgΘg [MacKay and Peto, 94, Teh 06] eabfab bb ΘeΘe a ΘfΘf a ef ΘbΘb d
7
What Actually Works? Trigrams and beyond: Unigrams, bigrams generally useless Trigrams much better (when there’s enough data) 4-, 5-grams really useful in MT, but not so much for speech Discounting Absolute discounting, Good- Turing, held-out estimation, Witten-Bell Context counting Kneser-Ney construction oflower-order models See [Chen+Goodman] reading for tons of graphs! [Graphs from Joshua Goodman]
8
Data >> Method? Having more data is better… … but so is using a better estimator Another issue: N > 3 has huge costs in speech and MT decoders
9
Tons of Data? [Brants et al, 2007]
10
Large Scale Methods Language models get big, fast English Gigawords corpus: 2G tokens, 0.3G trigrams, 1.2G 5-grams Google N-grams: 13M unigrams, 0.3G bigrams, ~1G 3-, 4-, 5-grams Need to access entries very often, ideally in memory What do you do when language models get too big? Distributing LMs across machines Quantizing probabilities Random hashing (e.g. Bloom filters) [Talbot and Osborne 07]
11
A Simple Java Hashmap? Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing
12
Word+Context Encodings
14
Compression
15
Memory Requirements
16
Speed and Caching Full LM
17
LM Interfaces
18
Approximate LMs Simplest option: hash-and-hope Array of size K ~ N (optional) store hash of keys Store values in direct-address or open addressing Collisions: store the max What kind of errors can there be? More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc
19
Beyond N-Gram LMs Lots of ideas we won’t have time to discuss: Caching models: recent words more likely to appear again Trigger models: recent words trigger other words Topic models A few other classes of ideas Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98] Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT] Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero if the look like real zeros [Mohri and Roark, 06] Bayesian document and IR models [Daume 06]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.