Unamunda N-Gram Based Natural Language Classification for Single Novel Words December 12, 2006 Mike Buchanan.

Slides:



Advertisements
Similar presentations
How do you study for a test ?
Advertisements

Using ITV to Enable D/HH Students Meet the Challenge of State Language Arts Standards Barbara K. Strassman The College of New Jersey To contact the author.
Statistical vs Clinical Significance
Compression techniques. Why we need compression. Types of compression –Lossy and lossless Concentrate on lossless techniques. Run Length coding. Entropy.
Word List A.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Powerpoint How Not To Do It !
Distance Between Two Partitions Joe Previte Penn State Erie.
What you might do and see at each place. Why you would enjoy it.
WINNER VS LOOSER. WINNER VS LOOSER WINNER VS LOOSER The winner is always part of the answer; The loser is always part of the problem.
E4004 Surveying Computations A Two Missing Distances.
56.) Congruent Figures—figures that have the same size and shape 57.) Similar Figures—figures that have the same exact shape but different size (angles.
Transforming Context-Free Grammars to Chomsky Normal Form 1 Roger L. Costello April 12, 2014.
English Conversation Week 4 SUMMER Dreams and Ambitions  In this lesson we will talk about our dreams.  We will talk about our ambitions.  We.
Huffman Encoding 16-Apr-17.
Why I Use Facebook. (And why you should, too.) Elizabeth A. Evans UNC-Chapel Hill ITS Teaching and Learning December 2007.
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
Why I Use Facebook. (Maybe you should, too.) Elizabeth A. Evans UNC-Chapel Hill ITS Teaching and Learning December 2007.
TENSE and MOOD.
Minority Language Conference Hanasaari-The Swedish- Finnish Cultural Centre November 27th and 28th 2008.
I WILL The following text will show you some aspects of my life in the future, based on the things currently i most like, the ones i would like to do.
Could we be pen friends with you? Year 7, unit 5, lesson 5.
Turkey Italy Spain France The Netherlands. After the Comenius Experience I feel more European because I had to face and fit into different traditions,
Mixed-level English classrooms What my paper is about: Basically my paper is about confirming with my research that the use of technology in the classroom.
Franco Singh-Vigilante April 11,2011. W HAT DID I CHOOSE I chose Game programming as it has constant use of code and sometimes used to create engines,
Rules Get in groups of 4 Answer each question. If you get the question correct, you get to pick a box. At the end of the period we will open the boxes,
* Discussion: DO YOU AGREE OR DISAGREE WITH THESE STATEMENTS? WHY OR WHY NOT? 1.The difficulty of a text depends mostly on the vocabulary it contains.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
1 Confounding In an unreplicated 2 K there are 2 K treatment combinations. Consider 3 factors at 2 levels each: 8 t.c’s If each requires 2 hours to run,
11.5 Similar Triangles Identifying Corresponding Sides of Similar Triangles By: Shaunta Gibson.
All Rights Reserved – Languistics Consultancy Welcome To Who Wants To Be A Virtual Millionaire – The Game In Which You Can Win A Virtual Million Pounds!
8-8 6 th grade math Similar Figures. Objective To use proportions to solve problems involving similar figures Why? To know how to solve missing sides.
10.1 DAY 2: Confidence Intervals – The Basics. How Confidence Intervals Behave We select the confidence interval, and the margin of error follows… We.
ENM 503 Block 1 Algebraic Systems Lesson 3 – Modeling with Sets Sets - Why do we care? The Application of Sets – a look ahead 1 Narrator: Charles Ebeling.
07 KM Page 1 ECEn/CS 224 Karnaugh Maps. 07 KM Page 2 ECEn/CS 224 What are Karnaugh Maps? A simpler way to handle most (but not all) jobs of manipulating.
C.W. Bednarz and W.D. Shurley University of Georgia and W.S. Anthony USDA-ARS Losses in Yield, Quality, and Profitability of Cotton From Improper Harvest.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
6 - 1 Simplification Theorems Useful for simplification of expressions & therefore simplification of the logic network which results. XY + XY' = ( X +
LAME DUCKS AND FLYING DOGS. Information in the digital era ……. (and why you can’t always get what you want) Colin Mc Cullough, eMedia, 14 December 2001.
Similar Figures Notes. Solving Proportions Review  Before we can discuss Similar Figures we need to review how to solve proportions…. Any ideas?
EXAMPLE 2 Use the Midsegment Theorem In the kaleidoscope image, AE BE and AD CD. Show that CB DE. SOLUTION Because AE BE and AD CD, E is the midpoint of.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
LEARNING ENGLISH. ENGLISH IS HERE, ENGLISH IS THERE, ENGLISH IS EVERYWHERE!
Wait! Unit5 Let’s spell Let’s a game! Read and Yes or No Today Monday today is your. play say Wow! We have computer class on s. Maybe is. Ms Feng be.
Similar Figures (Not exactly the same, but pretty close!)
It’s Time to Write a strong Thesis Statement! Packet #3 Working Thesis.
The Tennis Club Problem Supplementary Notes  A tennis club has 2n members.  We want to pair up the members (by twos) to play singles matches  Question:
 If three sides of one triangle are congruent to the three sides of another triangle, then the two triangles are congruent.  If AB = DE, BC = EF, AC.
10/23/2009 Alpha Prototype. 10/23/2009 TOPICS FOR TODAY Project Schedule o Achievements o The last few weeks... System Design and Architecture (new) Prototype.
Computerizing Division 3 Tennis Rankings John Goldis Scientific Computing Spring 2003.
EXAMINERS’ COMMENTS RAPHAEL’S LONG TURN GRAMMAR Accurate use of simple grammatical structures and also of some complex sentences: ‘they could also be preparing.
Welcome to our school We are from the Czech republic and we all live in Most Jane, Hanah and Bill.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
Maestro AI Vision and Design Overview Definitions Maestro: A naïve Sensorimotor Engine prototype. Sensorimotor Engine: Combining sensory and motor functions.
Warm Up  Use a ruler to draw a large triangle. Measure the angles of the triangle. Make a conclusion about the sum of the measure of the angles.
FCE Speaking Test – Part 3
Relationship among the Three sides of a Triangle
Matrix Chain Multiplication
Perimeter of combined figures
Matrix Chain Multiplication
Fractional Factorial Design
Writing Algebraic Expressions
Exploring Partially ordered sets
Lesson 13.1 Similar Figures pp
2018 UGA OFT Cotton Varieties (14)
Huffman Encoding.
X y y = x2 - 3x Solutions of y = x2 - 3x y x –1 5 –2 –3 6 y = x2-3x.
ON HALF SHEET: Write the formula that was used in the video for today.
Design matrix Run A B C D E
Optimization of Strawberry Production in Fusarium Infested Soil Part 1
Presentation transcript:

Unamunda N-Gram Based Natural Language Classification for Single Novel Words December 12, 2006 Mike Buchanan

The Problem I have an undated, post-WWII photograph of the gates to a Jewish cemetery in Skopje (formerly Yugoslavia, now Republic of Macedonia). Underneath the Hebrew text are the words in block letters: "IZRAELITSЌO POKOPALIŠČE" (the diacritics being my best guess). My questions: * What language is this? * What does the text mean? -- Many thanks, Deborahjay 07:34, 11 December 2006 (UTC)

Is Letter Frequency a Solution? Letter frequency analysis of that suggests it is Bosnian, but possibly Czech, Croatian, Serbocroatian, Lithuanian, Slovak, and Slovenian. “Diacritic on "Ќ" should not be there — maybe a damage on the inscription or photo?” Duja 10:31, 12 December 2006 (UTC)

N-Grams Look at more than one letter at a time. “abcdef” becomes:  a b c d e f  ab bc cd de ef  abc bcd cde def  abcd bcde cdef

N-Gram Frequency “The rain in Spain falls mainly on the plains.” n, in, i, a, ain, n, ai, l, in, ain, he, s, p, e, h By letter frequency, that could be quite a few languages. Many fewer languages have the “ai” dipthong.

Distance Metric n in i a ain n ai l in ain he... e t a i o n s r h l c...

Misses What do you do when an n-gram is in one sample but not another? How far is “Википедија” from English?

What would Cavnar and Trenkle do? Previous N-gram based approaches have limited their frequency profiles to some small constant length (~300), so a miss obviously cost 300. My frequency profiles are not limited to finite length: the more training data, the longer the profile. No obvious answer.

Idea 1: Individual Size Misses for a given language depend on that language's frequency profile length. Misses for English cost lots, misses for Zulu almost nothing. Result: Zulu does very well. Observation: The null language always wins. This might be desirable if some possible languages have scarce training data.

Idea 2: Uniform Miss Cost We want to be fairer to languages with large profiles. Let's agree on one miss cost for all languages.  Maximum: Unfair to short profiles.  Minimum: Unfair to long profiles.  Mean: Just right?

General organization Language Corpora Parser Canonical Profiles Parser Unknown text sample Sample Profile Metric Nearest match

Where to get corpora? It's pretty easy to get corpora for English, most European languages, etc. What about our example from Macedonia? Wikipedia is in 250 languages, from Afrikaans to Zulu, including Lojban, Navajo, Manx, Assyrian Neo-Aramaic, and Klingon. Markup is a bit painful, but I'm now the master of sed.

Does it work? My network says that “ IZRAELITSЌO POKOPALIŠČE ” is Slovenian. “It's definitely Slovenian” -Duja 10:31, 12 December 2006 (UTC) No quantitative results yet, because the network is too much fun to play with. Demo!