From Word Spotting to OOV Modeling OOV OOV spotting OOV OOV OOV [f r ah m] OOV spotting [t uw] OOV OOV [f r ah m] [w er d] spotting [t uw] OOV [m aa d el ih ng] From Word Spotting to OOV Modeling Paul Fitzpatrick (6345g11) Goal To automatically extract filler vocabulary for word- spotting Why? So language model has something to work with May improve recognition accuracy on keywords Gives earlier payoff in domain-specific training Scenario Start with small lexicon (e.g. 5-50 words) Start with weak language model Bootstrap by clustering filler vocabulary from large collection of untranscribed data
Methodology Run recognizer Extract OOV fragments Identify competition Identify rarely-used additions Remove from lexicon Add to lexicon Update lexicon, baseforms Hypothesized transcript N-Best hypotheses Update Language Model
Results Initial lexicon email, phone, room, office, address Top 10 OOV clusters found (ranked by frequency) 1. n ah m b er 6. p l iy z 2. w eh r ih z 7. ae ng k y uw 3. w ah t ih z 8. n ow 4. t eh l m iy 9. hh aw ax b aw 5. k ix n y uw 10. g r uw p Example sentence hypothesis (w ah t ih z) (ih t er z uw) room (n ah m b er) What is Victor Zue’s room number?