2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

Slides:



Advertisements
Similar presentations
Summer Computing Workshop. Introduction to Variables Variables are used in every aspect of programming. They are used to store data the programmer needs.
Advertisements

compilers and interpreters
Symbol Table.
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
JavaConLib GSLT: Java Development for HLT Leif Grönqvist – 11. June :30.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
1 Theory I Algorithm Design and Analysis (10 - Shortest paths in graphs) T. Lauer.
Language of the Month If it’s December, it must be Ruby! Adam Coffman and Brent Beer.
the fourth iteration of this loop is shown here
Modern Information Retrieval
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
Complexity Analysis (Part I)
Chapter 4: Trees Radix Search Trees Lydia Sinapova, Simpson College Mark Allen Weiss: Data Structures and Algorithm Analysis in Java.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Hashing Text Read Weiss, §5.1 – 5.5 Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
CS 117 Spring 2002 Repetition Hanly Chapter 4 Friedman-Koffman Chapter 5.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
E.G.M. Petrakissearching1 Searching  Find an element in a collection in the main memory or on the disk  collection: (K 1,I 1 ),(K 2,I 2 )…(K N,I N )
Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.
Cs164 Prof. Bodik, Fall Symbol Tables and Static Checks Lecture 14.
Leif Grönqvist 21 Jan th International Symposium on Social Communication 1 Finding Word Clusters in Spoken Dialogue with Narrow Context Based Similarities.
Implementing FastTBL in Oz Leif Grönqvist & Fredrik Kronlid
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Doubly Linked Lists Deleting from the end of the list – Have to traverse the entire list to stop right in front of tail to delete it, so O(n) – With head.
Binary Trees Chapter 6.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
A Review of Recursion Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
CSC 211 Data Structures Lecture 13
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach
Lecture 3: Uninformed Search
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Chapter 1 Introduction Major Data Structures in Compiler
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Data Structures and Algorithms Searching Algorithms M. B. Fayek CUFE 2006.
+ Arrays & Random number generator. + Introduction In addition to arrays and structures, C supports creation and manipulation of the following data structures:
CS321 Data Structures Jan Lecture 2 Introduction.
Circular linked list A circular linked list is a linear linked list accept that last element points to the first element.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Data Structures CSCI 132, Spring 2014 Lecture 33 Hash Tables.
Computer Science: A Structured Programming Approach Using C1 Objectives ❏ To introduce the basic concepts of linked lists ❏ To introduce the basic concepts.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
CPS 100e 5.1 Inheritance and Interfaces l Inheritance models an "is-a" relationship  A dog is a mammal, an ArrayList is a List, a square is a shape, …
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture3.
Information Retrieval in Practice
CSCI-255 LinkedList.
Database Management System
Hashing Exercises.
Arrays, For loop While loop Do while loop
Chapter 15 Lists Objectives
Algorithmic complexity
CSc 453 Interpreters & Interpretation
Complexity Analysis (Part I)
Lecture-Hashing.
Presentation transcript:

Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Statistical Methods 1 Växjö University, : 16:00

Växjö: Statistical Methods I Background NordTalk and SweDanes: NordTalk and SweDanes: Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & Magnus Comparable Danish and Swedish corpora Comparable Danish and Swedish corpora 1.3 MToken each, natural spoken interaction 1.3 MToken each, natural spoken interaction We are mainly working with Spoken language – not written We are mainly working with Spoken language – not written

Växjö: Statistical Methods I Peter Juel Henrichsen’s ideas Words with similar context distibutions are called Siblings Words with similar context distibutions are called Siblings Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins Some pairs (seed pairs) of Swedish and Danish words with ”the same” meaning are carefully selected: Cousins Groups of siblings in each corpus together with seed pairs gives new probable cousins. Groups of siblings in each corpus together with seed pairs gives new probable cousins.

Växjö: Statistical Methods I Siblings as word groups Drop the Cousins for now – focus on Siblings Drop the Cousins for now – focus on Siblings Traditional parts-of-speech are not necessarily valid Traditional parts-of-speech are not necessarily valid What we have is the corpus. Only the corpus What we have is the corpus. Only the corpus We will take information from the 1+1 words context We will take information from the 1+1 words context Nothing else like morphology or lexica Nothing else like morphology or lexica

Växjö: Statistical Methods I The original Sibling formula

Växjö: Statistical Methods I Improvements of the Sibling measure Symmetry: sib(x 1, x 2 )= sib(x 2, x 1 ) Symmetry: sib(x 1, x 2 )= sib(x 2, x 1 ) Similarity should be possible even if the context on one of the sides is different Similarity should be possible even if the context on one of the sides is different

Växjö: Statistical Methods I Trees instead of groups Iterative use of the ggsib similarity measure Iterative use of the ggsib similarity measure 1. Calculate ggsib between all word pairs above a frequency threshold 2. Pairs with similarity above a rather high score threshold S th are collected in a list L 3. For each pair in L: replace the less frequent of the words with the other, in the corpus

Växjö: Statistical Methods I Trees instead of groups (forts) 4. If L is empty: decrement S th slightly 5. Run from step 1 again if S th is above a lowest score threshold. The result may be interpreted as trees The result may be interpreted as trees

Växjö: Statistical Methods I An example tree

Växjö: Statistical Methods I Implementation Easy to implement: Peter made a Perl script Easy to implement: Peter made a Perl script But… One step in the iteration with ~5000 word types took 100 hours But… One step in the iteration with ~5000 word types took 100 hours Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours Our heavily optimized C-program ran on less than 60 minutes, and 100 iterations on less than 100 hours

Växjö: Statistical Methods I Most important optimizations Starting point: we have enough memory but not enough time A compiled low level language instead of an interpreted high level A compiled low level language instead of an interpreted high level Frequencies for words and word pairs are stored in letter trees instead of hash tables Frequencies for words and word pairs are stored in letter trees instead of hash tables Try to move computation and counting out in the loop hierarchy Try to move computation and counting out in the loop hierarchy

Växjö: Statistical Methods I Optimizations (letter trees) Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) Retrieving information from the letter trees is done at constant time to the size of the lexicon (compared to log(n) for hash tables) But in linear time to the average length of the words, but this is constant when the lexicon grows. But in linear time to the average length of the words, but this is constant when the lexicon grows. Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares. Another drawback: our example needs 1GB to run (each node in the tree is an array of all possible characters), but who cares.

Växjö: Statistical Methods I Optimizations (more) An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words An example of moving computation to an outer loop is to calculate the set of all context words once, and use it for comparisons with all other words The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree The set may be stored as an array of pointers to nodes (between words in word pairs) in the letter tree

Växjö: Statistical Methods I Personal pronouns

Växjö: Statistical Methods I

Växjö: Statistical Methods I Colours

Växjö: Statistical Methods I Problems Sparse data Sparse data Homonyms Homonyms When to stop When to stop Memory and time complexity Memory and time complexity

Växjö: Statistical Methods I Conclusions Our method is an interesting way of finding word groups Our method is an interesting way of finding word groups It works for all kinds of words (syncategorematic as well as categorematic) It works for all kinds of words (syncategorematic as well as categorematic) Difficult to handle low frequent words and homonyms Difficult to handle low frequent words and homonyms

Växjö: Statistical Methods I

Växjö: Statistical Methods I