javaConLib GSLT: Java Development for HLT Leif Grönqvist – 11. June :30
11 juni 2002Java Development for HLT: Leif Grönqvist 2 What have I done? I have implemented a library useful for various word sense disambiguation based on contexts From the beginning I have had a test method trying to provoke errors in each part of the implementation A command line application using the library, implementing Yarowsky 1995 I have tried to make final code at once
11 juni 2002Java Development for HLT: Leif Grönqvist 3 What is left to do? One very simple test implementation A tutorial based documentation Adjust things Lars pointed out in the last iteration Make an ANT build script The final report
11 juni 2002Java Development for HLT: Leif Grönqvist 4 Project Background Several methods for word disambiguation based on context. For example: Yarowsky’s unsupervised algorithm from 1995 is based on two general observations: One sense per collocation: nearby words provide strong and consistent clues One sense per discourse: the sense for a target word is highly consistent within any document
11 juni 2002Java Development for HLT: Leif Grönqvist 5
11 juni 2002Java Development for HLT: Leif Grönqvist 6
11 juni 2002Java Development for HLT: Leif Grönqvist 7 A much simpler supervised approach Start with a disambiguated set of occurrences Count all word types within a +-5 word context for each sense To disambiguate a new occurrence: compare the context to the possible sense’s distributions
11 juni 2002Java Development for HLT: Leif Grönqvist 8 javaConLib These two algorithms have a lot in common There are many more similar algorithms javaConLib includes classes that simplify implementation and tuning a lot Higher order and intuitive methods – the main class will look more like an algorithm description
11 juni 2002Java Development for HLT: Leif Grönqvist 9 Typical parts of a main class Yarowsky y=new Yarowsky(5); Corpus trainCorp=new Corpus (“train.txt”); SenseSet s1=new SenseSet(“äger|ägde, “Abs”, y.posl1); DecisionList decList=y.train95(s1, s2, “rum”, trainCorp); ContextList testCont=y.test95(decList, testCorpus, s1, s2, word); print(testCont.toString());
11 juni 2002Java Development for HLT: Leif Grönqvist 10 The Classes Context: An array of words with specific size and the main word at position 0. ContextList: A set of Contexts around a certain word type extracted from a corpus Corpus: A corpus is basically a vector containing words read from a file Decision: A decision contains a word, a position, and a score deciding how good it is to decide the sense for the main word in a context DecisionList: A DecisionList like the one used in Yarowsky's algorithm from FreqList: A frequency list for strings in a corpus Positions: Holds a list of positions (integers) relative to the center word when working with words and contexts. SenseSet: A set of the necessary components for each sense when using the Yarowsky -95 algorithm for word sense disambiguation Yarowsky: A class with some structures and classes useful when implementing Yarowsky's disambiguation algorithm from 1995, and similar.
11 juni 2002Java Development for HLT: Leif Grönqvist 11 We are done And probably out of time