Leif Grönqvist 21 Jan 2003 8th International Symposium on Social Communication 1 Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities.

Slides:



Advertisements
Similar presentations
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
Advertisements

Two Segments Intersect?
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
JavaConLib GSLT: Java Development for HLT Leif Grönqvist – 11. June :30.
Lecture3: Algorithm Analysis Bohyung Han CSE, POSTECH CSED233: Data Structures (2014F)
Problem Solving Agents A problem solving agent is one which decides what actions and states to consider in completing a goal Examples: Finding the shortest.
Solving Problems by Searching Currently at Chapter 3 in the book Will finish today/Monday, Chapter 4 next.
Recursion. Recursion is a powerful technique for thinking about a process It can be used to simulate a loop, or for many other kinds of applications In.
Using Statistics to Analyze your Results
© 2007 Pearson Education 8- 1 Managing Quality Integrating the Supply Chain S. Thomas Foster Chapter 8 Data Analyses Using Pivot Tables 10/11 – 5:30PM.
Introduction to Scientific Computing ICE / ICE 508 Prof. Hyuckjae Lee KAIST- ICC
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Graph & BFS.
Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist.
Dialogue Act Coding and Modalities GSLT: Dialogue Systems Leif Grönqvist – 11. June :30.
Hashing Text Read Weiss, §5.1 – 5.5 Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision.
CS /29/2004 (Recitation Objectives) and Computer Science and Objects and Algorithms.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Corpora as norms in language pathology Elisabeth Ahlsén Department of Linguistics Göteborg University.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Programming Language Theory Leif Grönqvist The national Graduate School of Language Technology (GSLT) MSI.
Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.
Towards Learning Dialogue Structures from Speech Data and Domain Knowledge: Challenges to Conceptual Clustering using Multiple and Complex Knowledge Source.
Analysis of Algorithms
Guidelines for Examination Candidates Raymond Hickey English Linguistics University of Duisburg and Essen (August 2015)
CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.
Data Structures and Algorithms Semester Project – Fall 2010 Faizan Kazi Comparison of Binary Search Tree and custom Hash Tree data structures.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
A Review of Recursion Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
Jay Anderson. 4.5 th Year Senior Major: Computer Science Minor: Pre-Law Interests: GT Rugby, Claymore, Hip Hop, Trance, Drum and Bass, Snowboarding etc.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
Program Development Life Cycle (PDLC)
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Essay and Report Writing. Learning Outcomes After completing this course, students will be able to: Analyse essay questions effectively. Identify how.
Algorithms and their Applications CS2004 ( ) Dr Stephen Swift 1.2 Introduction to Algorithms.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Formal Models in AGI Research Pei Wang Temple University Philadelphia, USA.
Applying Genetic Algorithm to the Knapsack Problem Qi Su ECE 539 Spring 2001 Course Project.
Algorithms CS139 – Aug 30, Problem Solving Your roommate, who is taking CS139, is in a panic. He is worried that he might lose his financial aid.
Robustness of complex networks with the local protection strategy against cascading failures Jianwei Wang Adviser: Frank,Yeong-Sung Lin Present by Wayne.
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Introduction to Linguistics Class # 1. What is Linguistics? Linguistics is NOT: Linguistics is NOT:  learning to speak many languages  evaluating different.
Numbers, Expressions, and Simple Programs. Today’s Goals Discipline! Order! Developing programs requires care! Programs are useful but delicate entities.
Hierarchical Clustering for POS Tagging of the Indonesian Language Derry Tanti Wijaya and Stéphane Bressan.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
CELLULAR MANUFACTURING. Definition Objectives of Cellular Manufacturing  To reduce WIP inventory  To shorten manufacturing lead times  To simplify.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
“Adventures in IRIS” Using film to explore how we question students.
Principle of Programming Lanugages 3: Compilation of statements Statements in C Assertion Hoare logic Department of Information Science and Engineering.
The Audio-Lingual Method(ALM) Goals and Techniques
Algorithms CS280 – 10/20/05. Announcement  Part 1 of project 2 due.  Read chapters 10, 7 for this unit  Tuesday we will also be in the classroom We.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
A Different Solution  alternatively we can use the following algorithm: 1. if n == 0 done, otherwise I. print the string once II. print the string (n.
1 ACCURACY AND CORRECTING MISTAKES Penny Ur 2006.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Aims: To learn about some simple sorting algorithms. To develop understanding of the importance of efficient algorithms. Objectives: All:Understand how.
Language Learning for Busy People These documents are private and confidential. Please do not distribute.. Intermediate: I Disagree.
1 Along & across algorithm for routing events and queries in wireless sensor networks Tat Wing Chim Department of Electrical and Electronic Engineering.
Department of Electrical Engineering, Southern Taiwan University 1 Robotic Interaction Learning Lab The ant colony algorithm In short, domain is defined.
UCL Linguistics workshop on mixed-effects modelling in R
Data Structures and Algorithms
Bubble Sort Key Revision Points.
PRIME FACTORIZATION USING FACTOR TREES!.
Software Development Techniques
Presentation transcript:

Leif Grönqvist 21 Jan th International Symposium on Social Communication 1 Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist ( Växjö University, School of Mathematics and Systems Engineering, Sweden The National Graduate School of Language Technology (GSLT) Magnus Gunnarsson ( Göteborg University, Department of Linguistics, Sweden

Leif Grönqvist 21 Jan th International Symposium on Social Communication 2 Background NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif Grönqvist, Magnus Gunnarsson Comparable Danish and Swedish corpora 1.3 MToken each, natural spoken interaction We are mainly working with Spoken language – not written

Leif Grönqvist 21 Jan th International Symposium on Social Communication 3 Siblings as word groups Traditional parts-of-speech are not necessarily valid for spoken language Few serious attempts to build a spoken language grammar (Jens Allwood’s talk tomorrow 10 am) What we have is the corpus - only the corpus, nothing else like morphology or lexica We will take information from the 1+1 words context Words with similar context distributions are called Siblings (Peter Juel Henrichsen)

Leif Grönqvist 21 Jan th International Symposium on Social Communication 4 Typical context distributions for: couple, lot and moment 32 couple#2 that3 coupleof25 acouple lot#18 lot´s6 lotof110 lotmore10 ´slot4 a 142 wholelot5 awfullot11 76 moment#33 momentin6 momentis3 themoment57 thismoment3 a 9 particularmoment3 ggsib(lot,couple)=0.74 ggsib(lot,moment)=0.15 ggsib(couple,moment)=0.12

Leif Grönqvist 21 Jan th International Symposium on Social Communication 5 Typical context distributions for: we, they and I #they21.8 andthey8.6 thatthey5.9 ifthey5.5 they#7.0 they´ve6.1 theywere6.6 they´re11.6 #we21.9 andwe6.2 thatwe8.4 ifwe5.1 we#7.0 wedo5.1 we‘ve9.5 wehave7.1 wecan5.0 we‘re6.3 #I39.3 andI7.9 Ido6.6 I´ve7.1 I´m9.1 Ithink12.3 Imean10.1 ggsib(we,they)=0.71 ggsib(we,I)=0.53 ggsib(they,I)=0.51

Leif Grönqvist 21 Jan th International Symposium on Social Communication 6 Our use of the Sibling measure We made it symmetric to avoid ‘sibling chains’ Another change was not to demand similar context on both sides Iterative use: –Run the similarity check between pairs –Collapse word pairs with similarity above a threshold –Run again with a lower threshold until a lowest threshold is reached

Leif Grönqvist 21 Jan th International Symposium on Social Communication 7 Henrichsen’s and our formulas

Leif Grönqvist 21 Jan th International Symposium on Social Communication 8 Comparison to other clustering algorithms We take all context words into account – not just a selected set –We get ‘natural’ similarities in the sense that they are only based on the corpus –But computationally it’s very complex. We had to optimize the program a lot using tries and even arrays instead of hash tables The iterative approach give us trees instead of just clusters

Leif Grönqvist 21 Jan th International Symposium on Social Communication 9 Some small examples

Leif Grönqvist 21 Jan th International Symposium on Social Communication 10

Leif Grönqvist 21 Jan th International Symposium on Social Communication 11 Further Research Evaluation is difficult – there are no ‘correct’ trees, just our language intuition Homonyms are not handled in a good way How can we find the interesting sections of the clustering? When should the iteration stop? Without stopping, all words will form a big tree Sparse data is still a problem, bigger contexts gives other problems

Leif Grönqvist 21 Jan th International Symposium on Social Communication 12 Conclusions Our method is an interesting way of finding word groups close to our language intuition It works for all kinds of words (syncategorematic as well as categorematic) It is to a high degree theory independent Difficult to handle low frequent words and homonyms