2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00

2002-05-02 2Växjö: Statistical Methods I Background NordTalk and SweDanes: NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & Magnus Comparable Danish and Swedish corpora Comparable Danish and Swedish corpora 1.3 MToken each, natural spoken interaction 1.3 MToken each, natural spoken interaction We are mainly working with Spoken language – not written We are mainly working with Spoken language – not written

2002-05-02 3Växjö: Statistical Methods I Peter Juel Henrichsen’s ideas Words with similar context distibutions are called Siblings Words with similar context distibutions are called Siblings Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins Groups of siblings in each corpus together with seed pairs gives new probable cousins. Groups of siblings in each corpus together with seed pairs gives new probable cousins.

2002-05-02 4Växjö: Statistical Methods I Siblings as word groups Drop the Cousins for now – focus on Siblings Drop the Cousins for now – focus on Siblings Traditional parts-of-speech are not necessarily valid Traditional parts-of-speech are not necessarily valid What we have is the corpus. Only the corpus What we have is the corpus. Only the corpus We will take information from the 1+1 words context We will take information from the 1+1 words context Nothing else like morphology or lexica Nothing else like morphology or lexica

2002-05-02 5Växjö: Statistical Methods I The original Sibling formula

2002-05-02 6Växjö: Statistical Methods I Improvements of the Sibling measure Symmetry: sib(x 1, x 2 )= sib(x 2, x 1 ) Symmetry: sib(x 1, x 2 )= sib(x 2, x 1 ) Similarity should be possible even if the context on one of the sides is different Similarity should be possible even if the context on one of the sides is different

2002-05-02 7Växjö: Statistical Methods I Trees instead of groups Iterative use of the ggsib similarity measure Iterative use of the ggsib similarity measure 1. Calculate ggsib between all word pairs above a frequency threshold 2. Pairs with similarity above a rather high score threshold S th are collected in a list L 3. For each pair in L: replace the less frequent of the words with the other, in the corpus

2002-05-02 8Växjö: Statistical Methods I Trees instead of groups (forts) 4. If L is empty: decrement S th slightly 5. Run from step 1 again if S th is above a lowest score threshold. The result may be interpreted as trees The result may be interpreted as trees

2002-05-02 9Växjö: Statistical Methods I An example tree

2002-05-02 10Växjö: Statistical Methods I Implementation Easy to implement: Peter made a Perl script Easy to implement: Peter made a Perl script But… One step in the iteration with ~5000 word types took 100 hours But… One step in the iteration with ~5000 word types took 100 hours Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours

2002-05-02 11Växjö: Statistical Methods I Most important optimizations Starting point: we have enough memory but not enough time A compiled low level language instead of an interpreted high level A compiled low level language instead of an interpreted high level Frequencies for words and word pairs are stored in letter trees instead of hash tables Frequencies for words and word pairs are stored in letter trees instead of hash tables Try to move computation and counting out in the loop hierarchy Try to move computation and counting out in the loop hierarchy

2002-05-02 12Växjö: Statistical Methods I Optimizations (letter trees) Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) But in linear time to the average length of the words, but this is constant when the lexicon grows. But in linear time to the average length of the words, but this is constant when the lexicon grows. Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares. Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares.

2002-05-02 13Växjö: Statistical Methods I Optimizations (more) An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree

2002-05-02 14Växjö: Statistical Methods I Personal pronouns

2002-05-02 15Växjö: Statistical Methods I

2002-05-02 16Växjö: Statistical Methods I Colours

2002-05-02 17Växjö: Statistical Methods I Problems Sparse data Sparse data Homonyms Homonyms When to stop When to stop Memory and time complexity Memory and time complexity

2002-05-02 18Växjö: Statistical Methods I Conclusions Our method is an interesting way of finding word groups Our method is an interesting way of finding word groups It works for all kinds of words (syncategorematic as well as categorematic) It works for all kinds of words (syncategorematic as well as categorematic) Difficult to handle low frequent words and homonyms Difficult to handle low frequent words and homonyms

2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

Similar presentations

Presentation on theme: "2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

Similar presentations

Presentation on theme: "2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist."— Presentation transcript:

Similar presentations

About project

Feedback