Download presentation
Presentation is loading. Please wait.
1
2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, 2002-05-02: 16:00
2
2002-05-02 2Växjö: Statistical Methods I Background NordTalk and SweDanes: NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & Magnus Comparable Danish and Swedish corpora Comparable Danish and Swedish corpora 1.3 MToken each, natural spoken interaction 1.3 MToken each, natural spoken interaction We are mainly working with Spoken language – not written We are mainly working with Spoken language – not written
3
2002-05-02 3Växjö: Statistical Methods I Peter Juel Henrichsen’s ideas Words with similar context distibutions are called Siblings Words with similar context distibutions are called Siblings Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins Groups of siblings in each corpus together with seed pairs gives new probable cousins. Groups of siblings in each corpus together with seed pairs gives new probable cousins.
4
2002-05-02 4Växjö: Statistical Methods I Siblings as word groups Drop the Cousins for now – focus on Siblings Drop the Cousins for now – focus on Siblings Traditional parts-of-speech are not necessarily valid Traditional parts-of-speech are not necessarily valid What we have is the corpus. Only the corpus What we have is the corpus. Only the corpus We will take information from the 1+1 words context We will take information from the 1+1 words context Nothing else like morphology or lexica Nothing else like morphology or lexica
5
2002-05-02 5Växjö: Statistical Methods I The original Sibling formula
6
2002-05-02 6Växjö: Statistical Methods I Improvements of the Sibling measure Symmetry: sib(x 1, x 2 )= sib(x 2, x 1 ) Symmetry: sib(x 1, x 2 )= sib(x 2, x 1 ) Similarity should be possible even if the context on one of the sides is different Similarity should be possible even if the context on one of the sides is different
7
2002-05-02 7Växjö: Statistical Methods I Trees instead of groups Iterative use of the ggsib similarity measure Iterative use of the ggsib similarity measure 1. Calculate ggsib between all word pairs above a frequency threshold 2. Pairs with similarity above a rather high score threshold S th are collected in a list L 3. For each pair in L: replace the less frequent of the words with the other, in the corpus
8
2002-05-02 8Växjö: Statistical Methods I Trees instead of groups (forts) 4. If L is empty: decrement S th slightly 5. Run from step 1 again if S th is above a lowest score threshold. The result may be interpreted as trees The result may be interpreted as trees
9
2002-05-02 9Växjö: Statistical Methods I An example tree
10
2002-05-02 10Växjö: Statistical Methods I Implementation Easy to implement: Peter made a Perl script Easy to implement: Peter made a Perl script But… One step in the iteration with ~5000 word types took 100 hours But… One step in the iteration with ~5000 word types took 100 hours Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours
11
2002-05-02 11Växjö: Statistical Methods I Most important optimizations Starting point: we have enough memory but not enough time A compiled low level language instead of an interpreted high level A compiled low level language instead of an interpreted high level Frequencies for words and word pairs are stored in letter trees instead of hash tables Frequencies for words and word pairs are stored in letter trees instead of hash tables Try to move computation and counting out in the loop hierarchy Try to move computation and counting out in the loop hierarchy
12
2002-05-02 12Växjö: Statistical Methods I Optimizations (letter trees) Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) But in linear time to the average length of the words, but this is constant when the lexicon grows. But in linear time to the average length of the words, but this is constant when the lexicon grows. Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares. Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares.
13
2002-05-02 13Växjö: Statistical Methods I Optimizations (more) An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree
14
2002-05-02 14Växjö: Statistical Methods I Personal pronouns
15
2002-05-02 15Växjö: Statistical Methods I
16
2002-05-02 16Växjö: Statistical Methods I Colours
17
2002-05-02 17Växjö: Statistical Methods I Problems Sparse data Sparse data Homonyms Homonyms When to stop When to stop Memory and time complexity Memory and time complexity
18
2002-05-02 18Växjö: Statistical Methods I Conclusions Our method is an interesting way of finding word groups Our method is an interesting way of finding word groups It works for all kinds of words (syncategorematic as well as categorematic) It works for all kinds of words (syncategorematic as well as categorematic) Difficult to handle low frequent words and homonyms Difficult to handle low frequent words and homonyms
19
2002-05-02 19Växjö: Statistical Methods I
20
2002-05-02 20Växjö: Statistical Methods I
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.