Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Center for Processing Hebrew Alon Itai – CS Technion.

Similar presentations


Presentation on theme: "Knowledge Center for Processing Hebrew Alon Itai – CS Technion."— Presentation transcript:

1 Knowledge Center for Processing Hebrew Alon Itai – CS Technion

2 Tools for underrepresented languages Computer tools and especially the Internet are Anglophile.  Search engines are not tooled for morphologically rich languages.

3 Search “dog” “dogs” “and dogs”  כלב כלב  כלב - ויקיפדיה כלב - ויקיפדיה  כלבים מאולפים מחפשים בית כלבים מאולפים מחפשים בית  כלב (יונק( כלב (יונק(  כלבים | כלב כלבים | כלב  אוגר זהב כלב הבית מכונה בלשון המדע – כלב זאב ביתי אוגר זהב  עמותת SOS חיות - בחירת כלב מתאים עמותת SOS חיות - בחירת כלב מתאים  לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירה - כלב לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירה - כלב  כלבים | כלב אתר המציע שידוכים בין גזעים, בייביסיטרים, תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעות. כלבים | כלב  Dog אתר הכלבים מכיל הרבה מידע, מאמרים, קורסים, תמונות וקטעי וידאו של כלבים וכל הקשור בהם Dog  dog גזעי כלבים · תמונת החודש · הכלב והחוק · רפואה וטיפול · קורסים · מאמרים · לוח מודעות · כלבי הצלה · קטעי וידאו · תמונת השנה · פינת האימוץ... dog כלבים מאולפים מחפשים בית רוני אילוף כלבים אתרי קטגורית כלבים הב-הב אתר חיות המחמד של ישראל! קובי חזן אילוף כלבים היחידה המיוחדת לאילוף כלבים זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלב על אלמנה וכלב ניופאונדלנד, כלבי רועים וכלב רועים בלגי - PETNET.co.il ליווי, עזרת זולת רפואית וכלב נחייה

4 Tools for underrepresented languages.  Computer tools and especially the Internet are Anglophile.  Search engines are not tooled for morphologically rich languages.  Email and chats do not cope well with strange alphabets   use (pidgin) English for communication,…  The local language is used less and less.

5 The problem  Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools.  Even when tools are available – no open source  Tools developed at Universities are not fit for general use: not robust enough no standard interface lack of documentation

6 Duplication of Effort  Every researcher has to redevelop her own tools, before conducting original research  For example: In Hebrew, there are many morphological analyzers: 1. Choueka and Shapira 1964, 2. Ornan 1987, Lavie et al. 1988, 3. Bentur et al. 1992, 4. Segal 1999, 5. HSPELL 6. Yona and Wintner 2005

7 The Knowledge Center  In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew.  Its aim to develop products (software and databases) for processing Hebrew and make them available to the public, both in academia and industry.  Researchers from four universities are involved in the Center's activities.

8 The researchers  Yoad Winter (Technion),  Shuly Wintner (Haifa University),  Michael Elhadad (Ben Gurion University),  Arnon Cohen (Ben Gurion University),  Yoram Singer (Hebrew University)  Eli Shamir (Hebrew University)  Alon Itai (Technion)

9 The model  The ministry provides initial funds.  The Center should be self-sustainable – it should finance itself by selling products. The problems:  The market is too small, had it been large then there would have been no need for the center.  Contradicts our philosophy of open research and open code.

10 Licensing Policy  Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL.  Payments only for special services.  Can get a non-exclusive license for commercial use.

11 XML EXAMPLE -  All products are represented by XML. Readable both by machines and by humans Enables using off-shelf tools for on screen presentation and validation Info for the morphological parser

12 XML (2)  Facilitates interface between tools:  For example, the output of the morphological analyzer is the input for the morphological disambiguator.  Thus one can match different morphological analyzers with different disambiguators and compare their results

13 Products  Morphological analyzers  Morphological disambiguators  Lexicon  Corpora  Speech data base  Tools for editing lexicons and tagging corpora.  PR: forum,…

14 The lexicon by part of speech 100preposition10332noun 62conjunction4485verb 60pronoun4227 Proper Name 40interjection1612adjective 9interrogative352adverb 6negation132quantifier Total : 21,417

15 GUI for editing the lexicon

16

17 Morphological disambiguators  Roy Bar-Haim constructed a HMM- based parser which partitions each word in a corpus into morphemes – success rate 96%.  Erel Segal combined a Brill-like method with a priori occurrence probabilities.  Meni Adler used HMM on whole words.  All three disambiguators are available at the Center.

18 Corpora קורפוסSize Unique tokens total 319,66611,062,232 7Arutz304,16011,216,867 Sha’ar la-matkhil (dotted) 166,7801,300,326 Knesset262,33817,732,122

19 Corpora (2)  6000 sentences of manually tagged corpus (12,000 tokens).

20 Tree bank  6000 syntactically parsed sentences.  Used for automatic parsing.

21 Conclusions  The Center is an example of cooperation between researchers in several universities.  Many users have downloaded the products.  10 companies have purchased licenses.

22 Conclusions (2)  Money is running out, …  The model requires money, experts, and commitment.  Not suitable for languages with very few speakers, or for poor communities.

23 Modern Hebrew  Official Language of the State of Israel  Spoken by 7 M people  Related, but linguistically distinct, from Biblical Hebrew.  Morphologically rich

24 Semitic Word Formation root + pattern  word root pattern CaCaCyiCCoC ktb šbr katab (he wrote(yiktob (he will write) šabar (he broke)yišbor (he will break )

25 Writing System  Most vowels are omitted  Particles are prepended to words, Example: h – definite article, b – preposition (in) w – conjunction (and) wbbyt = w + b + ha +byt and in the house

26 Morphological Ambiguity  Most words are morphologically ambiguous  Example: šbth שבתה 1. šavta = šbt + CaCCa = stopped working 2. šavta = šbh + CaCCa = took prisoner 3. šabatah = her Saturday 4. še-b-te = that in tea 5. še-b-ha-te = that in the tea 6. še-bit-h = that her daughter …


Download ppt "Knowledge Center for Processing Hebrew Alon Itai – CS Technion."

Similar presentations


Ads by Google