August 6 th ISAAC 2008 Word Prediction in Hebrew Preliminary and Surprising Results Yael Netzer Meni Adler Michael Elhadad Department of Computer Science Ben Gurion University, Israel
August 6 th ISAAC 2008 Outline Objectives and example. Methods of Word Prediction Hebrew Morphology Experiments and Results Conclusions? Outline
August 6 th ISAAC 2008 Word Prediction - Objectives Ease word insertion in textual software –by guessing the next word –by giving a list of possible options for the next word –by completing a word given a prefix General idea: guess the next word given the previous ones [Input w 1 w 2 ] [guess w 3 ] Objectives
August 6 th ISAAC 2008 (Example) I s_____ Word Prediction Example
August 6 th ISAAC 2008 (Example) I s_____ verb, adverb? Word Prediction Example
August 6 th ISAAC 2008 (Example) I s_____ verb sang? maybe. singularized? hopefully Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a _____ Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a _____ noun / adjective Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a b____ Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a b____ brown? big? bear? barometer? Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a bird in the _____ Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a bird in the _____ [semantics will do good] Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a bird in the z____ Word Prediction Example
August 6 th ISAAC 2008 (Example) I saw a bird in the z____ obvious (?) Word Prediction Example
August 6 th ISAAC 2008 (hidden) Hebrew example הילדה שרצה כל היום התעייפה the-girl that-run all the-day got-tired the-girl swarmed all the-day
August 6 th ISAAC 2008 Statistical Methods Statistical information –Unigrams: probability of isolated words Independent of context, offer the most likely words as candidates –More complex language models (Markov Models) Given w 1..w n, determine most likely candidate for w n+1 –Most common method in applications is the unigram (see references in [Garay-Vitoria and Abascal, 2004]) Word Prediction Methods
August 6 th ISAAC 2008 Syntactic Methods Syntactic knowledge –Consider sequences of part of speech tags [Article] [Noun] predict [Verb] –Phrase structure [Noun Phrase] predict [Verb] –Syntactic knowledge can be statistical or based on hand-coded rules Word Prediction Methods
August 6 th ISAAC 2008 Semantic Methods Semantic knowledge –Assign semantic categories to words –Find a set of rules which constrain the possible candidates for the next word [ eat verb] predict [word of category food ] –Not widely used in word prediction, mostly because it requires complex hand coding and is too inefficient for real-time operation Word Prediction Methods
August 6 th ISAAC 2008 Word Prediction Knowledge Sources Corpora: texts and frequencies Vocabularies (Can be domain specific) Lexicons with syntactic and/or semantic knowledge User’s history Morphological analyzers Unknown words models Word Prediction Methods
August 6 th ISAAC 2008 Supporting methods Recency promotion: prefer words that have been used recently Trigger-target method: the occurrence of a specific word rises the rank of another word Capitalization of proper nouns (not good for Hebrew) Morphological support: automatically add inflections to words Distinguish fringe/core words in prediction Word Prediction Methods
August 6 th ISAAC 2008 Drawbacks of Word Prediction Overt action is required to verify selection Cognitive load: –“wrong candidates” distract user from the message he is composing. –Switch between 2 modes of operation: typing and selecting Word Prediction Methods
August 6 th ISAAC 2008 Evaluation of Word Prediction Keystroke savings Time savings Overall satisfaction –Cognitive overload (length of choice list vs. accuracy). A predictor is considered adequate if its hit ratio is high as the required number of selections decreases. 1-(# of actual keystrokes/# of expected keystrokes) Word Prediction Evaluation
August 6 th ISAAC 2008 Work in non-English Languages Languages with rich morphology: –n-gram-based methods offer quite reasonable prediction [Trost et al. 2005] but can be improved with more sophisticated syntactic/semantic tools Suggestions for inflected languages ( e.g. Basque) –Use two lexicons: stems and suffixes –Add syntactic information to dictionaries and grammatical rules to the system, offer stems and suffixes –Combine these two approaches: offer inflected nouns. Hebrew Word Prediction
August 6 th ISAAC 2008 Motivation for Hebrew We need word prediction for Hebrew –No known previous published research for Hebrew. We wanted to test our morphological analyzer in a useful application. Hebrew
August 6 th ISAAC 2008 Initial Hypothesis Word prediction in Hebrew will be complicated, morphological and syntactic knowledge will be needed.
August 6 th ISAAC 2008 Hebrew Specificity Unvocalized writing causes high level of ambiguity Prefixes and suffixes: prepositions, definiteness, possessives are agglutinated Rich morphology: inflectional, non-regular
August 6 th ISAAC 2008 Hebrew Ambiguity Unvocalized writing: most vowels are “dropped” inherent inhrnt Affixation: prepositions and possessives are attached to nouns in her note inhrnt in her net inhrnt Rich Morphology –‘inhrnt’ could be inflected into different forms according to sing/pl, masc/fem properties. inhrnti, inhrntit, inhrntiot –Other morphological properties may leave ‘inherent’ unmodified (construct/absolute forms for noun compounding). Hebrew
August 6 th ISAAC 2008 Ambiguity Level These variations create a high level of ambiguity: –English lexicon: inherent inherent. adj –With Hebrew word formation rules: inhrnt in. prep her. pro.fem.poss note. noun in. prep her. pro.fem net. noun inherent. adj.masc.absolute inherent. adj.masc.construct Parts of speech tagset: –Hebrew: Theoretically: ~300K, In practice: ~3.6K distinct forms –English: tags Number of possible morphological analyses per word: –English: 1.4(Average # words / sentence: 12) –Hebrew: 2.7(Average # words / sentence: 18) Hebrew
August 6 th ISAAC 2008 (Real Hebrew) Morphological Ambiguity בצלם bzlm – בְּצֶלֶם bzelem (name of an association) – בְּצַלֵּם b-zalem (while taking a picture) – בְּצָלָם bzalam (their onion) – בְּצִלָּם b-zila-m (under their shades) – בְּצַלָּם b-zalam (in a photographer) – בַּצַּלָּם ba-zalam (in the photographer( – בְּצֶלֶם b-zelem (in an idol( – בַּצֶּלֶם ba-zelem (in the idol( Hebrew Morphology
August 6 th ISAAC 2008 Morphological Analysis Given a written form, recover the following information: Lexical category (part-of-speech) –noun, verb adjective, adverb, preposition… Inflectional properties –gender, number, person, tense, status… Affixes –Prefixes: מ ש ה ו כ ל ב (prepositions, conjunctions, definiteness) –Pronoun suffix: accusative, possessive, nominative Hebrew Morphology
August 6 th ISAAC 2008 Morphological Analysis Example: given the form בצלם propose the following analyses: בְּצֶלֶם – בצלם proper-noun בְּצַלֵּם – בצלם verb, infinitive בְּצָלָם – בצל - ם noun, singular, masculine בְּצִלָּם – ב - צל - ם noun, singular, masculine בְּצַלָּם בְּצֶלֶם – ב - צלם noun, singular, masculine, absolute – ב - צלם noun, singular, masculine, construct בַּצַּלָּם בַּצֶּלֶם – ב - צלם noun, definitive singular, masculine Hebrew Morphology
August 6 th ISAAC 2008 Morphological Disambiguation A difficult task in Hebrew: Given a written form, select in context the correct morphological analysis out of all possible analyses. We have developed a successful* system to perform morphological disambiguation in Hebrew [Adler et al, ACL06, ACL07, ACL08]. * 93% for POS tagging and 90% for full morphology analysis, which was used in this test) Hebrew Morphology
August 6 th ISAAC 2008 Word Prediction in Hebrew We looked at Word Prediction as a sample task to show off the quality of our Morphological Disambiguator But first… we checked a simple baseline Hebrew Word Prediction
August 6 th ISAAC 2008 Baseline: n-gram methods Check n-gram methods (unigram, bigram, trigram) Four sizes of selection menus: 1, 5, 7 and 9 Various training sets of 1M, 10M and 27M words to learn the probabilities of n-grams. Various genres. Hebrew Word Prediction
August 6 th ISAAC 2008 Prediction results using n-grams only Hebrew Word Prediction Keystrokes needed to enter a message in % (Smaller is better) For tri-grams model trained on 27M corpus – very good results!
August 6 th ISAAC 2008 Adding Syntactic Information P(w n |w 1,…,w n-1 ) = λ 1 P(w n-i,…,w n |LM) + λ 2 P(w 1,…,w n | μ ), – μ is the morpho-syntactic HMM (morphological disambiguator) –Combine P(w 1,…,w n | μ ) with the probabilistic language model LM in order to rank each word candidate given previous typed words. –if the user typed I saw, and the next word candidates are { him, hammer } we use the HMM model, for calculating: p(I saw him| μ ) p(I saw hammer| μ ), in order to tune the probability given by the n-gram. * Trained on a 1M sized corpus. Hebrew Word Prediction
August 6 th ISAAC 2008 Results with morpho-syntactic knowledge Hebrew Word Prediction Model sequences of parts of speech with morphological features Results w/o syntactic knowledge
August 6 th ISAAC 2008 Some Notes on Results n-grams perform very well (high level of keystroke saving) High rate for all genres And the expected: –Better prediction when trained on more data –Better prediction with tri-grams –Better prediction with larger window Morpho-syntactic information did not improve results (in fact, it hurt!) Results
August 6 th ISAAC 2008 Conclusion Statistical data on a language with rich morphology yields good results –up to 29% with nine word proposals –34% for seven proposals –54% for a single proposal Syntactic information did not improve the prediction. Explanation - morphology didn't improve due the use of p(w 1,…,w n | μ ) of an unfinished sentence Hebrew Word Prediction - Conclusions
August 6 th ISAAC 2008 תודה Thank you
August 6 th ISAAC 2008 Technical Information CMU – N-grams Storage – Berkeley DB to store knowledge for WP: Mapping n-grams More questions on technology – Hebrew Word Prediction