To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.

Slides:



Advertisements
Similar presentations
OO Programming in Java Objectives for today: Overriding the toString() method Polymorphism & Dynamic Binding Interfaces Packages and Class Path.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Preparing Data for Quantitative Analysis
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
T. E. Potok - University of Tennessee Software Engineering Dr. Thomas E. Potok Adjunct Professor UT Research Staff Member ORNL.
ALG0183 Algorithms & Data Structures Lecture 3 Algorithm Analysis 8/25/20091 ALG0183 Algorithms & Data Structures by Dr Andy Brooks Weiss Chapter 5 Sahni.
Programming with Alice Computing Institute for K-12 Teachers Summer 2011 Workshop.
TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh- Gama, Eric Engquist.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Heuristic alignment algorithms and cost matrices
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 2: Algorithm Discovery and Design
1 CSI 101 Elements of Computing Fall 2009 Lecture #4 Using Flowcharts Monday February 2nd, 2009.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Case-based Reasoning System (CBR)
Guide To UNIX Using Linux Third Edition
Chapter 2: Algorithm Discovery and Design
Text Search and Fuzzy Matching
Fundamentals of Python: From First Programs Through Data Structures
McGraw-Hill/Irwin The Interactive Computing Series © 2002 The McGraw-Hill Companies, Inc. All rights reserved. Microsoft Excel 2002 Exploring Formulas.
© The McGraw-Hill Companies, 2006 Chapter 1 The first step.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9 Internet Control Message.
Advanced Shell Programming. 2 Objectives Use techniques to ensure a script is employing the correct shell Set the default shell Configure Bash login and.
Fundamentals of Python: First Programs
Reverse Engineering State Machines by Interactive Grammar Inference Neil Walkinshaw, Kirill Bogdanov, Mike Holcombe, Sarah Salahuddin.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
INTRODUCTION TO COMPUTING CHAPTER NO. 06. Compilers and Language Translation Introduction The Compilation Process Phase 1 – Lexical Analysis Phase 2 –
Identifying Reversible Functions From an ROBDD Adam MacDonald.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved ADT Implementation:
Database Management 9. course. Execution of queries.
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Project 2 Presentation & Demo Course: Distributed Systems By Pooja Singhal 11/22/
Data entry: Validation
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Searching. RHS – SOC 2 Searching A magic trick: –Let a person secretly choose a random number between 1 and 1000 –Announce that you can guess the number.
Testing Methods Carl Smith National Certificate Year 2 – Unit 4.
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.
Extreme Google Searching!. Calculator n To use Google’s built in calculator function, simply enter the calculation you’d like done in the search box and.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Software Development Problem Analysis and Specification Design Implementation (Coding) Testing, Execution and Debugging Maintenance.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Testing and inspecting to ensure high quality An extreme and easily understood kind of failure is an outright crash. However, any violation of requirements.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Computer and Programming. Computer Basics: Outline Hardware and Memory Programs Programming Languages and Compilers.
Moodle Quizes Staff Guide. Creating Quizzes Click Add an Activity or Resource With the course in editing mode...
User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.
In the news: A recently security study suggests that a computer worm that ran rampant several years ago is still running on many machines, including 50%
WNSpell: A WordNet-Based Spell Corrector BILL HUANG PRINCETON UNIVERSITY Global WordNet Conference 2016Bucharest, Romania.
Software. Introduction n A computer can’t do anything without a program of instructions. n A program is a set of instructions a computer carries out.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
OPERATING SYSTEMS (OS) By the end of this lesson you will be able to explain: 1. What an OS is 2. The relationship between the OS & application programs.
Testing i. explain the importance of system testing and installation planning;
Dynamic Black-Box Testing Part 1 What is dynamic black-box testing? How to reduce the number of test cases using: Equivalence partitioning Boundary value.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
A First Book of C++ Chapter 4 Selection.
1 ADT Implementation: Recursion, Algorithm Analysis Chapter 10.
Advanced Computer Systems
Applying Deep Neural Network to Enhance EMPI Searching
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Query Languages.
Chapter 11 Data Compression
Spreadsheets, Modelling & Databases
CHAPTER 6 Testing and Debugging.
Presentation transcript:

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from thousands to millions of times. The test beds varied widely in content because they were freshly constructed from the latest statistical data of the most commonly searched words on two products of our sponsor company. To ensure that the time efficiency results were accurate, the data from each trial was recorded and averaged. In addition, each corrected result and the frequency of said result were recorded to show the accuracy and consistency of the program. The spell checker works relatively well.The theoretical run time using Big-Oh notation is O(m*n 2 ), where m is a small constant (0<m<1) and n is the length of the inputted word. The scalar m is used, because the Kd tree filters out many impossible matches and therefore cuts down on the data size of the algorithms. The n 2 comes from the use of the Damerau-Levenshtein edit distance algorithm. The runtime of the algorithm on our sponsor company’s servers is even less, and is therefore within the acceptable range. The correctness of the algorithm is relatively high. The algorithm gets the correct answer for all but four cases of the 70-word test-bed that the algorithm was run on. The algorithm did not find an answer for two of the four test cases that did not get the right answer. Because the spell checker output will not be displayed if it does not return an answer, the spell checker will only suggest an incorrect word 2.9% of the time. 94.2% Correctness

The purpose of the project was to create a spell checker that could check through a corpus specific dictionary of at least 100,000 words long and correct a search query from a user in 500 milliseconds or less. The program was required to identify incorrectly spelled phrases of words and to use words in the phrase to guess the correct spelling of the rest of the words. The spelling mistakes of a person can be divided into two categories: typos and guesses. Typos are random mistakes made by accident. Guesses are attempts at spelling a word by sounding it out. With typos, the correct word can be found by using string matching algorithms, such as the Demarau-Levenstein edit distance algorithm, to search a dictionary and find the closest word that matches. Correcting guesses requires the words to be translated into their phonetic equivalents, which can be done using programming libraries such as soundex and metaphone. The closest match of the misspelled word can be found by comparing the phonetic equivalents using the exact same string matching algorithms used for typos. However, checking through an entire dictionary for each misspelled word is inefficient and time costly. Therefore, the process can be optimized by filtering out words in the dictionary that are cannot be matches. A Kd tree is a special data structure that intelligently organizes the dictionary so only words that could possibly match with the misspelled word are found. This acts like a filter for the core algorithm and eliminates excess calculations. In addition to checking individual words for mismatches, words that are commonly used together as phrases can be utilized to “hint” at what the correct spelling of the rest of the phrase is. This can be used to reduce excess work even more and increase efficiency. The accuracy and time efficiency results shown for the final spell checker program exceeded the performance of the spell checker previously used by our sponsor company and also versions using the Landau-Vishkin algorithm and a modified-Landau-Vishkin algorithm. The previous spell checker program used was an “out of the box” spell checker built into the coding of Lucene, an open source software package that was used to store and access webpage data. The lucene Spell Checker works by ranking words using the Levenshtein edit distance. The Levenshtein edit distance algorithm operates in O(M*N) efficiency and requires M*N member space, where M and N are the lengths of the two strings being compared. The algorithm is very simple and is extremely fast when comparing short strings, but inefficient for long strings (Black, 2006). The Damerau-Levenshtein edit distance algorithm implemented in the final spell checker relies on the same method and is very similar; however, the Damerau-Levenshtein algorithm considers transposed letters as one mistake, and the Levenshtein algorithm treats them as two. The Lucene spell checker provides quick results for small test cases and is easy to use, but was unsatisfactory because of the low accuracy in returned results and poor time efficiency for long test cases (Black, 2006). The spell checker produced by the project far exceeds the Lucene spell checker in speed for longer test cases. This study was designed to create a spell checker for an Internet-based search engine. Different algorithms were created and unified under a common spell checker program. The resulting program returned the correct spelling of an inputted word most of the time, and the program ran fast enough that the user would not notice a delay. Users now can search the Internet using our sponsor company’s search engine and have their spelling corrected.