Download presentation
Presentation is loading. Please wait.
Published byTobias Brooks Modified over 9 years ago
1
CS 396 Pattern Recognition Project Language Classifier v1.0 By Paul Troncone, David Keiper, Eugene Schvarts
2
Topics of discussion… The Proposal – A Language Classifier Designing the project Implementing the project References Conclusion
3
The Proposal – A Language Classifier The user would input a text file in the format of a language which uses the standard A-Z alphabet: The program then would then take the text file and determine which language the file was written in.
4
Designing TheProject
5
Input Cleanup f1f2f3f4f5f6 Vector Classifier Output
6
INPUT English German French Italian Spanish Swedish Polish Dutch Romanian Portuguese Danish
7
This is a test. This is a test toooooo. CLEANUP This is a test..... 1987 This is a test too-oo-oo. Removes: Multiple Periods Numbers Hyphens, Slashes, Etc.
8
FEATURES Feature Vector: [0] = Average Word Length [1] = Percent of Words Ending In Vowels [2] = Average Sentence Length [3] = Average Characters Per Sentence [4] = Average Number of Vowels Per Word [5] = Number of Words With Z’s [6] = Number of Words That End in “ing”
9
FEATURES
10
ImplementingTheProject
11
Classifier 1 Nearest Neighbor Features Used: Average Word Length Percent of Words Ending In Vowels Accuracy: 80% – 85%
12
Classifier 2 Artificial Neural Network Features Used: Average Word Length Percent of Words Ending In Vowels Average Sentence Length Average Characters Per Sentence Average Number of Vowels Per Word Number of Words With Z’s Number of Words That End in “ing” Accuracy: 95%
13
Creating the Graphical User Interface Wanted to implement that Java look-and-feel jComboBox – holds 15 samples plus the option for a random sample. jButton – allows for the paste functionality jTextArea – text files are read and added to this area jRadioButton – triggered when either classifier is clicked jTextArea – output from classifier appended here jButto n – sets all text areas to null, all buttons to false, effectively clearing the screen of text jComboBox – holds 11 languages and an option for random jTextArea – word count appended here jButton – sends text sample to the feature extractors jButton – sends text to cleanUp method jTextArea – output from featture extractors stored in array, then appended here
14
References
15
Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale Peter Constable and Gary Simons, SIL International 6,800 languages known References Region# of languages Africa2062 Americas1020 Asia2202 Europe237 Pacific1312
16
There are many factors that may be considered, such as the following: actual linguistic similarity between speech varieties; Intelligibility literacy and ability to share a common literature; ethnic identities and self-perception of language communities; other perceptions and attitudes based on political or social factors; References What factors will form the basis of an operational definition of language?
17
References Change Categorization: Different operational definitions of language Inadequate definition Scale: There are on the order of 6,800 languages known to exist Documentation Problems
18
that consistently applies an operational definition of language so that all entities for which an identifier is assigned are of a comparable nature, that encompasses all of the languages of the world, that clearly documents the speech variety that each identifier denotes, that is maintained and updated on an on-going basis, and that is freely and readily accessible to the public over the Internet. A solution to these problems would be considerably advanced by a compilation of language information References
19
Conclusion
20
Conclusion Number Of Features: 7 Size of Training Set: 165 Files Testing Set: 100+ Files Overall Success Rate: 93%
21
Conclusion Given more time to extract additional features, we could achieve 99.5% accuracy for the set of eleven languages.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.