Shani Vered Oz Adi Advisor : Prof. Michael Elhadad

Slides:



Advertisements
Similar presentations
How To Make Your Own Web Page: Basic Web Design
Advertisements

XHTML Basics.
Use of spreadsheet Software!
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Multimedia: Making it Work
XP Practical PC, 3e Chapter 10 1 Writing and Printing Documents.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Basic HTML e-Learning Tutorial Storyboard Linda Sauerbrun AET/545 February 15, 2015 Dr. Poe.
--Caesar Cat.  Write an optical character recognition application that identifies and recognizes printed text within an image.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 3.
NATIONAL NODES OF THE BIOSAFETY CLEARING- HOUSE CANADIAN SYSTEM OVERVIEW Caribbean Islands Workshop December 2005 Bridgetown, Barbados.
Hands segmentation Pat Jangyodsuk. Motivation Alternative approach of finding hands Instead of finding bounding box, classify each pixel whether they’re.
1 Review Projections etc. & data types & Downloading or HOW TO GET CONFUSED.
Gili Werner. Motivation Detecting text in a natural scene is an important part of many Computer Vision tasks.
Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.
Power Point Ravuru Paul. New File Select the Templets and Themes.
Websites with good heuristics Irene Wachirawutthichai.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
OMR, OCR and MICR Software Group 2: Maaz Masood(Leader) Haris Khan Talha Mobeen Hasan Shariq.
Additional Features in Microsoft Word Session Version 1.0 © 2011 Aptech Limited.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
GreenFIE-HD: A “Green” Form-based Information Extraction Tool for Historical Documents Tae Woo Kim.
How to create an educational wiki. Laurie Roberts 2010.
XP New Perspectives on Creating Web Pages With Word Tutorial 1 1 Creating Web Pages With Word Tutorial 1.
Week-11 (Lecture-1) Introduction to HTML programming: A web based markup language for web. Ex.
This research is supported by the U.S. Department of Education and DARPA. Focuses on mistakes in determiner and preposition usage made by non-native speakers.
Connecting From Home Editing at Home(You don’t have to.)
Prepared by Sana Maqbool. Objectives After completing this lesson, the you will be able to… Understand about web authoring Name and explain the uses of.
Prepared by Sana Maqbool. Objectives After completing this lesson, the you will be able to… Understand about web authoring Name and explain the uses of.
Avoiding Run-on Sentences, Comma Splices, and Fragments
Controller.
MSU Libraries’ Course Materials Program:
Microsoft Word 2016 Lesson 3.
Avoiding Run-on Sentences, Comma Splices, and Fragments
Learning Usage of English KWICly with WebLEAP/DSR
S.Rajeswari Head , Scientific Information Resource Division
OCR AS Level F451: Data transmission
Accuracy Assessment of Thematic Maps
Putting Things Where We Want Them
Introduction to Lime Survey
Microsoft® Word 2010 Training
System: OU Campus (CMS - Content Management System)
Overview What is Multimedia? Characteristics of multimedia
Avoiding Run-on Sentences, Comma Splices, and Fragments
Speech Generation Using a Neural Network
Databases Software This icon indicates the slide contains activities created in Flash. These activities are not editable. For more detailed instructions,
Plain Sailing.
Distributed Production
Who is ISANS? ISANS is the leading deliverer of settlement services in Atlantic Canada We are the primary contact in Nova Scotia on refugee, settlement.
Training & Development
Word Processing and Desktop Publishing Software
Critical Path Analysis
Keyboarding Notes Speed – measure in wpm (words per minute).
How To Make Accessible Word Documents
Visual recall of class information
Avoiding Run-on Sentences, Comma Splices, and Fragments
Dr. Sampath Jayarathna Cal Poly Pomona
Word Processing Software Photo credit: © 2007 JupiterImagesCorporation.
Insert a textbox To insert a new textbox, click on Insert on the top tool bar Look down the list and click on text box. Use your mouse to click once on.
University of Illinois System in HOO Text Correction Shared Task
Avoiding Run-on Sentences, Comma Splices, and Fragments
TIPS: Where a box says “insert image here” you will need to go up to “insert” and choose “image” then either search for an image to insert or choose one.
Project Location & School Name (Student names are nice to include)
Avoiding Run-on Sentences, Comma Splices, and Fragments
Text Features.
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Dr. Sampath Jayarathna Cal Poly Pomona
Avoiding Run-on Sentences, Comma Splices, and Fragments
An Introduction to Microsoft Word
Correct document structure Easy for authors and accessible to readers
Presentation transcript:

Shani Vered Oz Adi Advisor : Prof. Michael Elhadad OCR With Nikud Shani Vered Oz Adi Advisor : Prof. Michael Elhadad

Motivation Create a free tool that converts a text without nikud to one with it. Will help to preserve the language. (nikud usage is decreasing) NLP hebrew research - create hebrew corpus with nikud

Already Exist : Tesseract - Open Source, OCR for hebrew Without Nikud (relatively good result) Still we have mistakes. It really depends on the font we do OCR אם תרדבה בליל דמﬠותיר שמחתי לך אבﬠיר כצרור תבןִ אם תרחפבה מקור ﬠצמותייר, אכסר ואשכב ﬠך' אבן.

תוֹלדוֹת הלבוּשׁ מצביﬠוֹת צל כך, שבמשך הדוֹרוֹת השכיל With Nikud הכווּן - בגדים נוֹחים תוֹלדוֹת הלבוּשׁ מצביﬠוֹת צל כך, שבמשך הדוֹרוֹת השכיל האדם להשתחרר מאָפנוֹת כיבוּש שלא היוּ יפוֹת לבריאוּת. We can see that the result is pretty good, but most of the nikud is not recognized

Results are not satisfying. Trying to use other OCR tools that exist on the web like: Hocr, Qhocr etc... Results are not satisfying. More nikud features are recognized - but lots of mistakes, a lot of times the text is linked and without correct spaces הָעלוּ בִּמַשִׁאֵבָה שֵׁהְפִעלָה עַל יִדֵי בִּהֵמָה (.ֹבִסוֹבִבָה בִּמַעִגָל. בָּאָרְץהִשִׁתַמִשׁוּהַמִתִיַשִׁבִיםהַיִהוּדִיםהָרִאשׁוֹנִים,מֵרְאשִׁית הַיִשׁוּב,בִּבִאֵרוֹתמוּנָעוֹתבִּדְלְקאוֹבִּחַשִׁמַל,הַסוֹבֵבבִּפַרִדִסֵי הַשָׁרוֹןיִתָקֵלעַלכָּלצַעַדוִשַׁעַלבִּמִבִנְיבֵּטוֹןשְׁעַלגַגָם

How To Improve We want to train the tesseract so it will recognize the Britannica Hebrew letters and nikud. The way is to create an improved train data file for tesseract. We used a useful tool called Moshpytt Bounding box - only letter and nikud vs. letter + nikud in the same bounding box

Data Set Distribution + Box Files Example Letter Hits א 1201 י 153 ע 242 ב 755 כ 1785 פ 520 ג 1333 ך 60 ף 31 ד 212 ל 333 צ 356 ה 469 מ 1020 ץ 48 ו 163 ם 651 ק 192 ז 1720 נ 881 ר 370 ח 108 ן 168 ש 1055 ט 402 ס 522 ת 808

Data Set Distribution - with nikud א אְ אֱ אֳ אִ אֵ אֶ אַ אָ אׂ אׁ אֹ אּ אֻ 2 11 46 45 155 94 105 75 - 4 מ מְ מֱ מֲ מֳ מִ מֵ מֶ מַ מָ מׂ מׁ מּ מֻ 191 230 103 64 115 82 18 ע עְ עֱ עֲ עֳ עִ עֵ עֶ עַ עָ עׂ עׁ עּ עֻ 6 3 98 88 21 7 137 1 פ פְ פֱ פֲ פֳ פִ פֵ פֶ פַ פָ פׂ פׁ פּ פֻ 77 23 68 14 5 ש שְ שֱ שֲ שֳ שִ שֵ שֶ שַ שָ שׂ שׁ שּ שֻ 133 69 37 243 81 139 ת תְ תֱ תֲ תֳ תִ תֵ תֶ תַ תָ תׂ תׁ תּ תֻ 136 70 9 100 10 Top letters - a good table to understand how to improve

Project Results Confusion Matrix Top 10 Errors : Words ending with letter ד - lots of times we have Hirik - דִ mistakes between שֶ and שֻ כַּז-וּר , כֵרוּר instead of כַּדוּר הֶ instead of הֻ and הָ הֵ instead of הְ יָ instead of יֶ בֶ instead of בָ letter ק needs better training ךְ - doesn't exist in corpus holam haser - is missing in the corpus for some letters תְ instead of חְ סַ instead of סֵ גַ' becomes נַ Confusion Matrix

Project Results - Cont. Overall Accuracy : 90% ! Precision Recall Plain Letters 95% 93% Letters With Nikud 82.4% 80.5% Only Nikud 87.9% 87% Explain about Precision and Recall

Tesseract after trained by us Image : לָפוּמְבְּדִיתָא, שֶׁהָיְתָה מֶרְכָּז יְהוּדִי חָשׁוּב מִיָמָיו שֶׁל הַתַנָא מַר שְׁמוּאֵל, בַּר-הַפְּלֶגְתָא שֻׁל רַבִּי יְהוּדָה הַנָשִׂיא עוֹרֵךְ הַמִשְׁנָה. מַר שְׁמוּאֵל הָיָה נֶאֱמָן לַפַּרְסִים וּפָסַק כִּי בְּעִנְיָנִים אֶזְרָחִיִים מְחַיְבִים חֻקֵי הַמְדִינָה שֶׁבָּהּ יוֹשְׁבִים הַיְהוּדִים מֵמָשׁ כְּאִלוּ הָיוּ חֻקֵי הַתוֹרָה. הוּא קָבַע אֶת הַכְּלָל: "דִינָא דְמַלְכוּתָא - דִינָא". בִּתְקוּפָה זוֹ, לְאַחַר חֲתִימַת הַמִשְׁנָה עַל ידֵי יְהוּדָה הַנָשְׂיא, פָּעֲלוּ בִּישִׁיבוֹת בָּבֶל הָאָמוֹרָאִים חַכְמֵי הַתַלְמוּדִ. Text :

Questions ? http://www.cs.bgu.ac.il/~nlpproj possible question s : 1. tesseract model 2.how to improve current results 3. http://www.cs.bgu.ac.il/~nlpproj