Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008.

Slides:



Advertisements
Similar presentations
Overview of Free/Open Source Software for Librarians Eric Goldhagen
Advertisements

Quality control tools
© 2006 Richard M. Conlan Interface Designs to Help Users Choose Better Passwords (study design) Richard M. Conlan, Peter Tarasewich Northeastern University.
Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.
Free Beer and Free Speech Thomas Krichel
Is it true that university students sleep late into the morning and even into the afternoon? Suppose we want to find out what time university students.
Completing the California Dream Act Application 1 Event link: To see the presentation
Configuration management
Thomas A. Stewart Literacy Test (OSSLT) Prep Guide 2013
Interview Question Types
McGraw-Hill Copyright © 2011 The McGraw-Hill Companies, Inc. All rights reserved. Office Excel 2010 Lab 1 Creating and Editing a Worksheet.
How to be a good language learner. Think about your native language – it’s probably English. You were a fluent speaker before you even started school.
A Lawyer Looks at the Open Source Revolution Robert W. Gomulkiewicz Director, Intellectual Property Law & Policy Program Associate Professor of Law University.
Free and Open Source Content, Software?. Free  In the context of free and open-source software, free refers to the freedom to copy and re-use the software,
EMPLOYEE ENGAGEMENT (Gallup Q12)
]po[ Docu Wiki.  ]project-opem[ 2008, Rollout Methodology / Frank Bergmann / 2 Types of Readers  Beginners – These users have just started using ]po[.
What is GNU/Linux (Not Linux!)? David Sudjiman davidsudjiman (at) yahoo (dot) com The latest version of this document can.
Should Nation Be The Foundation of Identity?
Probability and Induction
How to Adapt Assignments and Assessments for English Language Learners
It’s Not the Technology It’s the Learning Professional Development September :00.
Language tools for writers Ola Knutsson IPLab, NADA, KTH Sweden.
Free construction of a free dictionary of synonyms using computer science Viggo Kann and Magnus Rosell KTH, Stockholm Talk given by Viggo at Amherst College.
GNU / Linux A free operating system. Summary History What can you find on a Linux OS Linux Economy.
Copyleft and cathedrals How the counterculture is changing the way we do business.
A Language Environment for Second Language Writers Ola Knutsson KTH Nada.
Whitesmoke what is it ? It’s the most comprehensive grammar checker in the world Whitesmoke is the ultimate language solution for full text translation.
Defining WiktionaryZ. All words in all languages ISO 639-3: 7602 languages ISO 639-3: 7602 languages Average vocabulary: 60,000 words – 100,000 words.
Open Source Applications Mikko Mustalampi DAP02S.
CS /29/2004 (Recitation Objectives) and Computer Science and Objects and Algorithms.
WebCT CE-6 Assignment Tool. Assignment Tool and Assignment Drop Box Use “Assignment” button under Course Tools (your must be in “Build” mode) to: –Modify.
An innovative platform to allow translation and indexing of internet sites Localization World
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
“Free Beer” for Libraries? - Getting Results with Open Source Solutions SNRG FM Michelle Suranofsky Mark Canney Lehigh University.
TELEPHONE INTERVIEWS : Telephone Interviews are very popular in modern fast work culture. Telephone interviews are often conducted by employers in the.
What the hell is. Free software is software that anyone is free to use, copy, improve, examine or distribute, either free of cost or for a price. More.
Hypothesis Testing. Distribution of Estimator To see the impact of the sample on estimates, try different samples Plot histogram of answers –Is it “normal”
Introduction. » How the course works ˃Homework ˃Project ˃Exams ˃Grades » prerequisite ˃CSCI 6441: Mandatory prerequisite ˃Take the prereq or get permission.
Licenses A Legal Necessity Copyright © 2015 – Curt Hill.
LIS508 lecture 9: GNU & introduction to networks Thomas Krichel
Computers and Society Examine the extent to which Richard Stallman’s GNU manifesto has succeeded in challenging the dominance of conventionally distributed.
Selenium automated testing in Openbravo ERP Quality Assurance Webinar April 8th, 2010.
Open Source Software An Introduction. The Creation of Software l As you know, programmers create the software that we use l What you may not understand.
Science Project Information Presented by: Shane Pearson Courtesy of Science Buddies: Providing free science fair project ideas, answers, and tools for.
Freebies from the Web Hank Maier The Things Bill Gates Never Told Me or.
Elaborated paragraphs Cynthia Hatchell 7 th grade Language Arts.
Data Structures & Algorithms and The Internet: A different way of thinking.
Science Fair Information Night Presented by: 4 th Grade Teachers Courtesy of Science Buddies: Providing free science fair project ideas, answers, and tools.
Related terms search based on WordNet / Wiktionary and its application in ontology matching RCDL'2009 St. Petersburg Institute for Informatics and Automation.
Maintainability of FLOSS Projects
CPS 82, Fall Open Source, Copyright, Copyleft.
Downloading and Installing Autodesk Revit 2016
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Downloading and Installing Autodesk Inventor Professional 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the.
Igniting 21st century learning ® ® © One-to-One Institute 1 Teaching & Learning in a One-to-One Environment 1 Muskegon August 16,17,18.
Applying the Open Source development model ● Technologies ● Open Source? ● Drawbacks of Open Source ● Advantages of Open Source ● System outline.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
11/18/2015 IENG 486 Statistical Quality & Process Control 1 IENG Lecture 07 Comparison of Location (Means)
Open Source Examples – Linux; Apache; Firefox Requirements – Distributed w/ source code – License allows for modifications (GPL) – License remains w/ any.
Linux Not in textbook. Why Not Just Use Windows? Windows costs money ($100 to $300, typically) Windows is proprietary: –The source code is a closely guarded.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Collaborative Peer Production In a Health Context Jimmy Wales President, Wikimedia Foundation Wikipedia Founder.
5 ٥ V new position ? 10 ^1 10 ^2 10 ^ : : : : 99.
Welcome to Open Source Technology An Overview of Software By Afroz Hippargi, CIT, YASHADA, Pune.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
LINUX History In 1984 a project was launched by Richard Stallman to develop a complete Unix-like operating system that would be considered free software.
Wikipedia, the free encyclopedia
What is Copyright?.
Ethics of Free Software
Presentation transcript:

Free Swedish Word Lists or Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008

What is a free language resource? Anyone can use it in an application Anyone can study it and modify it Anyone can take a copy of it Anyone can improve it, release the improvements to the public, so that the whole community benefits (based on four freedoms of free software, Richard Stallman)

Strong free software culture GNU project FSF – Free Software Foundation GPL – GNU General Public License OSI – Open Software Initiative Linux, TeX, Emacs, GCC, MySQL, PHP, Java, Python, Firefox

First meeting of the Free Swedish Words group at KTH January persons from around Sweden Lars Aronsson: project Runeberg and Swedish Wikipedia (Wiktionary) Lars Törnquist and Sven Lange: Swedish thesaurus built on Bring (1930) Christian Mattson: Lexin dictionaries

Niklas Johansson: Spelling error detection and correction in OpenOffice Göran Andersson: DSSO – The large Swedish word list Viggo Kann: Stava, Granskatagger, Synlex, Tvärslå Nordic dictionary Per Starrbäck, Leif-Jöran Olsson, Tomas Padron-McCarthy, Erik Geijer

Plans for more free words Swedish synonyms in OpenOffice (Niklas) Extending DSSO with synonyms, associations etc (Göran) Building a free Swedish-English dictionary (Viggo) Testing Swedish grammar checking in Languagetool/OpenOffice (Viggo&Niklas)

Typical ways to construct a resource …if you are a language technologist: Get funding Use resources that are free to use for researchers Hire linguists to do the heavy jobs …if you are a free software hacker: Use other free resources Collect data from lots of people using e.g. a wiki or a web form

Example: Synlex Construct a Swedish dictionary of synonyms as a list of synonymous pairs I don’t want to work a lot I don’t want to pay anyone to work The resulting list should become free

Ideas Automatically construct a large set of word pairs that might be synonyms Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

More ideas Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 25 M) of lookups each month Users visit Lexin to translate words, and are thus probably motivated to help me Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

My plan 1. Construct lots of possible synonyms 2. Sort out bad synonym pairs automatically 3. Ask lots of users if the rest of the pairs are good synonyms 4. Analyze the gradings done by the users and decide which pairs to keep

Step 1: Construct lots of possible synonyms If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish {(w,v):  y: y  SE(w)  v  ES(y)} or {(w,v):  y: y  SE(w)  y  SE(v)} word pairs were generated

Step 2: Remove bad synonym pairs automatically Use RI (Random Indexing) [Kanerva, Kristoferson, Holst 2000] to measure the distance between words represented in a large vector space Keep pairs that have small enough distance in the vector space

Step 3: Ask lots of users if the rest of the pairs are good synonyms When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like: Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

Step 4: Analyzing the gradings done by the users 1.2 millions gradings were made in less than 2 months Grading statistics were analyzed on several occasions Some users sent comments

More and more interesting gradings as time goes by

Distribution of mean gradings of word pairs

Some statistics (January 2008) 2.8 M user gradings done pairs (graded ≥ 2) in dictionary pairs suggested by users unique pairs suggested of them have been accepted

Example: Synonyms to klass (class) 5: rang (grade) rank (rank) slag (kind) 4: kategori (category) stånd (social class) årskurs (grade) 3: fack (sphere) grad (degree) grupp (group) kvalitet (quality) nivå (level) 3: sort (sort) standard (standard) stil (style) 2: skikt (layer) storleksordning (magnitude) typ (type) 1: poäng (point) stadga (stability) 0: uppdrag (mission) utbilda (educate)

How to prevent abuse? Many gradings of a word pair are needed before it’s considered to be good The pair to be graded is randomly picked from a very large list Word pairs suggested by users are spell checked before they are added to the very large list

People's definition of synonymy Exact meaning of 'synonym' wasn’t defined Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

Links The large Swedish word list Spell checker lexin.nada.kth.se/synlex.html synonyms lexin.nada.kth.se/synlex.html sv.wiktionary.org word dictionary sv.wiktionary.org Hyperlexicon