Leif Grönqvist 21 Jan th International Symposium on Social Communication 1 Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist ( Växjö University, School of Mathematics and Systems Engineering, Sweden The National Graduate School of Language Technology (GSLT) Magnus Gunnarsson ( Göteborg University, Department of Linguistics, Sweden
Leif Grönqvist 21 Jan th International Symposium on Social Communication 2 Background NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif Grönqvist, Magnus Gunnarsson Comparable Danish and Swedish corpora 1.3 MToken each, natural spoken interaction We are mainly working with Spoken language – not written
Leif Grönqvist 21 Jan th International Symposium on Social Communication 3 Siblings as word groups Traditional parts-of-speech are not necessarily valid for spoken language Few serious attempts to build a spoken language grammar (Jens Allwood’s talk tomorrow 10 am) What we have is the corpus - only the corpus, nothing else like morphology or lexica We will take information from the 1+1 words context Words with similar context distributions are called Siblings (Peter Juel Henrichsen)
Leif Grönqvist 21 Jan th International Symposium on Social Communication 4 Typical context distributions for: couple, lot and moment 32 couple#2 that3 coupleof25 acouple lot#18 lot´s6 lotof110 lotmore10 ´slot4 a 142 wholelot5 awfullot11 76 moment#33 momentin6 momentis3 themoment57 thismoment3 a 9 particularmoment3 ggsib(lot,couple)=0.74 ggsib(lot,moment)=0.15 ggsib(couple,moment)=0.12
Leif Grönqvist 21 Jan th International Symposium on Social Communication 5 Typical context distributions for: we, they and I #they21.8 andthey8.6 thatthey5.9 ifthey5.5 they#7.0 they´ve6.1 theywere6.6 they´re11.6 #we21.9 andwe6.2 thatwe8.4 ifwe5.1 we#7.0 wedo5.1 we‘ve9.5 wehave7.1 wecan5.0 we‘re6.3 #I39.3 andI7.9 Ido6.6 I´ve7.1 I´m9.1 Ithink12.3 Imean10.1 ggsib(we,they)=0.71 ggsib(we,I)=0.53 ggsib(they,I)=0.51
Leif Grönqvist 21 Jan th International Symposium on Social Communication 6 Our use of the Sibling measure We made it symmetric to avoid ‘sibling chains’ Another change was not to demand similar context on both sides Iterative use: –Run the similarity check between pairs –Collapse word pairs with similarity above a threshold –Run again with a lower threshold until a lowest threshold is reached
Leif Grönqvist 21 Jan th International Symposium on Social Communication 7 Henrichsen’s and our formulas
Leif Grönqvist 21 Jan th International Symposium on Social Communication 8 Comparison to other clustering algorithms We take all context words into account – not just a selected set –We get ‘natural’ similarities in the sense that they are only based on the corpus –But computationally it’s very complex. We had to optimize the program a lot using tries and even arrays instead of hash tables The iterative approach give us trees instead of just clusters
Leif Grönqvist 21 Jan th International Symposium on Social Communication 9 Some small examples
Leif Grönqvist 21 Jan th International Symposium on Social Communication 10
Leif Grönqvist 21 Jan th International Symposium on Social Communication 11 Further Research Evaluation is difficult – there are no ‘correct’ trees, just our language intuition Homonyms are not handled in a good way How can we find the interesting sections of the clustering? When should the iteration stop? Without stopping, all words will form a big tree Sparse data is still a problem, bigger contexts gives other problems
Leif Grönqvist 21 Jan th International Symposium on Social Communication 12 Conclusions Our method is an interesting way of finding word groups close to our language intuition It works for all kinds of words (syncategorematic as well as categorematic) It is to a high degree theory independent Difficult to handle low frequent words and homonyms