Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Finding Research Writing Research-Based Papers. The purpose of research is to find more out about a topic To explain what you learn to a reader or viewer.
Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Norah Fahim Jennifer Eidum Zinchuk University of Washington, Seattle, WA 2014 TESOL Convention, Portland OR Digital Composing: Utilizing Students’ Web.
Review of HTML Ch. 1.
Using Your EXPLORE ® Results Student Guide to EXPLORE 3 9/2010.
Personal Statements “The best writing is full of description.”
Welcome to Back to School Night Mrs
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
POGIL vs Traditional Lecture in Organic I Gary D. Anderson Department of Chemistry Marshall University Huntington, WV.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Perceptions of the Role of Feedback in Supporting 1 st Yr Learning Jon Scott, Ruth Bevan, Jo Badge & Alan Cann School of Biological Sciences.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
1 The Sample Mean rule Recall we learned a variable could have a normal distribution? This was useful because then we could say approximately.
Science Fair Information Night Presented by: [Teacher Name] Courtesy of Science Buddies: Providing free science fair project ideas, answers, and tools.
E-Safety Quiz Keeping safe online! A guide for parents & children.
Mean, Median & Mode Dana Quinones EDU Table of Contents Objective WV Content Standards Guiding Questions Materials Vocabulary Introduction Procedure.
C OLLEGE VOCABULARY ENGLISH FOR ACADEMIC SUCCESS CHAUDRON GILLE Dictionaries and Word Study.
Where do you find a job? Online Newspapers Phonebooks Talking to friends/looking for help-wanted signs.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
STOP BULLYING NOW! KayeDee Smith and Kassidy Osbourne.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Chapter 2 Modeling and Finding Abnormal Nodes. How to define abnormal nodes ? One plausible answer is : –A node is abnormal if there are no or very few.
Science Project Information Presented by: Shane Pearson Courtesy of Science Buddies: Providing free science fair project ideas, answers, and tools for.
NBPTS the 2 nd day A closer look at the certificate standards for ASTL 6325.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Unit 1 – Improving Productivity Instructions ~ 100 words per box.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
JunioR Parent Night April 24, 2012 Agenda Senior Year Timeline College Application Process Financial Aid/Scholarships Question & Answer.
Science Fair Information Night Presented by: 4 th Grade Teachers Courtesy of Science Buddies: Providing free science fair project ideas, answers, and tools.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
“The Internet and the English Language by Terence Carter Charles Sauter.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
1 Introduction to Linguistics Teacher: Simon Smith ( 史尚明 ) – “Dr Smith”, “Simon” or “ 老師 ”: OK – “Smith” or “Teacher”: not OK This semester’s course: –
Problem of the Day  Why are manhole covers round?
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
A FRAMEWORK FOR Task based learning Jane willis
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Igniting 21st century learning ® ® © One-to-One Institute 1 Teaching & Learning in a One-to-One Environment 1 Muskegon August 16,17,18.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
MXit is a mobile application that allows people to chat to their friends at a much cheaper rate than normal text messages. You can contact anyone anywhere.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
8 th grade Probability Project Math-STAT class November 11, SP.1 --Construct and interpret scatter plots for bivariate measurement data to investigate.
PROJECT MEETING – BUDAPEST NOV EVALUATION (TEACHERS & STUDENTS)
Big6 Research and Problem Solving Skills 6 th Grade Project Creating a Travel Brochure.
Find the following information from the task.
Plan for Today’s Lecture(s)
Welcome to “Moodle Part Deux”
Measuring Monolinguality
teachHOUSTON Student Society
Research preferred A levels for law
Word embeddings (continued)
Presentation transcript:

Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College November 11, 2006

Examples of English synonyms Smith: A Dictionary of Synonymous Words in the English Language [1889] CLASS. Order. Rank. Degree. Classification. Grade. Webster’s Dictionary of Synonyms [1942] classify. Alphabetize, pigeonhole, assort, sort. Ana. Order, arrange, systematize, methodize, marshal.

Goals To construct a Swedish dictionary of synonyms as a list of synonymous pairs I don’t want to work a lot I don’t want to pay anyone to work The resulting list should be free

Ideas Automatically construct a large set of word pairs that might be synonyms Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

More ideas Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 17 M) of lookups each month Users visit Lexin to translate words, and are thus probably motivated to help me Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

My plan 1. Construct lots of possible synonyms 2. Sort out bad synonym pairs automatically 3. Ask lots of users if the rest of the pairs are good synonyms 4. Analyze the gradings done by the users and decide which pairs to keep

Step 1: Construct lots of possible synonyms If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish {(w,v):  y: y  SE(w)  v  ES(y)} or {(w,v):  y: y  SE(w)  y  SE(v)} word pairs were generated

Step 2: Sort out bad synonym pairs automatically Use RI (Random Indexing) [Kanerva, Kristoferson, Holst 2000] to measure the distance between words represented in a large vector space Keep pairs that have small enough distance in the vector space

Random Indexing Each word w is assigned a random label vector L w of thousand elements For each word w construct a context vector C w by adding the random vectors for the words appearing in the context of each occurrence of w in a large corpus

Random Indexing settings Context: 4 words to the left and 4 to the right Stop words were removed Dimensionality: corpora from different domains were used, for example newspapers and medical texts

Number of pairs for different cos thresholds ( of pairs occurred in corpus)

Step 3: Ask lots of users if the rest of the pairs are good synonyms When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like: Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

After answering the user may grade new randomly chosen word pair look up word in the synonym dictionary suggest new synonymous word pair download synonym dictionary in XML

Step 4: Analyzing the gradings done by the users 1.2 millions gradings were made in less than 2 months Grading statistics were analyzed on several occasions Some users sent comments

Keeping the users happy! Many users said that there were too many bad pairs Lots of pairs were graded 0 (not at all synonyms) by all users. After some weeks such pairs were removed. Later more pairs were removed, improving the quality of the remaining pairs considerably.

User gradings first two months

More interesting gradings 2006

Distribution of mean gradings of word pairs after two months

Distribution of mean gradings of word pairs 2006

Analysis of the pairs graded 0 Distance (cosine) in RI space

Some statistics (November 2006) 2.5 M user gradings done pairs (graded ≥ 2) in dictionary pairs suggested by users unique pairs suggested of them have been accepted

Example: Synonyms to klass (class) 5: rang (grade) rank (rank) slag (kind) 4: kategori (category) stånd (social class) årskurs (grade) 3: fack (sphere) grad (degree) grupp (group) kvalitet (quality) nivå (level) ordning (order) 3: skikt (layer) sort (sort) standard (standard) stil (style) 2: storleksordning (magnitude) typ (type) 1: poäng (point) stadga (stability) 0: uppdrag (mission) utbilda (educate)

How to prevent abuse? Many gradings of a word pair are needed before it’s considered to be good The pair to be graded is randomly picked from a very large list Word pairs suggested by users are spell checked before they are added to the very large list

People's definition of synonymy Exact meaning of 'synonym' wasn’t defined Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

The people’s synonym dictionary on the web

Lessons learned The list of suggested synonyms should be huge Try to improve the quality of the list automatically as much as possible, Random indexing is useful for this, also try tagging and using other dictionaries Use the 0 answers early to remove bad pairs that only irritate the users