Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science

Similar presentations


Presentation on theme: "Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science"— Presentation transcript:

1 Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

2  According to Ethnologue – Languages of the World (SIL) ◦ Spoken in: Romania (22 millions), Moldavia (2.7 millions), 300.000 (Serbia, Montenegro), 250,000 (Ukraine), 250,000 (Israel), Hungary (100,000), USA, Canada, Spain, Italy, etc. ◦ Native speakers: 24 millions, +4 millions as a second language ◦ Romanian (Rumanian, Moldavian, Moldovan, Daco-Romanian) ◦ Linguistic lineage: Indo-European>Italic>Romance>Eastern ◦ Dialects: Istro Romanian (Croatia), Macedo Romanian (Greece), Megleno Romanian (Greece) ◦ Lexical similarity: 77% with Italian, 75% with French, 74% with Sardinian, 73% with Catalan, 72% with Portuguese and Rheto- Romance, 71% with Spanish ◦ Other influences: Slavic, Hungarian, Turkish, etc. 2 LT Days, Luxembourg, 14-15 Jan, 2009

3  Since 1900: linguistics & lexicography research (in the Academy and the universities)  1960: early trials of Machine Translation; after that – no financing for more than 45 years  1980s: first NLP models and systems ◦ semantic networks, dialogue systems (IURES, QUERNAL), paradigmatic morphology and morphological analysers, unification-based formalisms, generation, grammars and parsers, etc.  Good computer science and computer engineering schools (in Bucharest, Iasi, Cluj- Napoca, Timisoara) 3 LT Days, Luxembourg, 14-15 Jan, 2009

4  Master level: ◦ Iasi (UAIC-FII, since 2001), University of Bucharest  PhD level: ◦ Bucharest (RACAI), Iasi (UAIC-FII) ◦ 6 PhD thesis will be defended this year  Summer schools, international and national conferences  EUROLAN, since 1993, second as significance in Europe (after ESSLLI)  SPED (since 2001) – Speech Technology and Human- Computer Dialogue conferences  ConsILR (since 2002) – the national conference of the Consortium for Informatisation of the Romanian Language  Alumni: ◦ >30 PhDs and PhD students doing LT all over the world 4 LT Days, Luxembourg, 14-15 Jan, 2009

5  Bucharest ◦ Romanian Academy, RACAI (acad. Dan Tufis)  10 researchers (3 PhDs): Romanian resources, language independent tools, human-computer interfaces, statistical models of Romanian, NLP Web services ◦ Romanian Academy, Institute of Linguistics (acad. Marius Sala)  lexicography, old Romanian texts corpora ◦ University of Bucharest  formal models, resources ◦ Technical University of Bucharest & Military Academy  speech processing (prof. Corneliu Burileanu, prof. Olteanu) 5 LT Days, Luxembourg, 14-15 Jan, 2009

6  Iasi ◦ Alexandru Ioan Cuza University – Dept. of Computer Science (UAIC-FII, my group)  8 PhDs (2 in co-tutelle with prof. E.Munteanu, Dept. of Letters), 4 researchers, >20 masters in CL, undergraduate projects  resources, language independent tools in written LT, NLP Web services, computational lexicography, multimodal interfaces, NL user interfaces ◦ Romanian Academy, Institute of Computer Science (acad. Horia-Neculai Teodorescu)  4 PhDs, 8 researchers  speech processing and resource building, tools and annotated resources in written language processing ◦ Romanian Academy, Institute of Philology  lexicography, old manuscripts (including in old Cyrillic) 6 LT Days, Luxembourg, 14-15 Jan, 2009

7  Word Alignment (Ro-En): ◦ RACAI 2003, 2005: ranked first  Question Answering (CLEF - Ro, En): ◦ RACAI 2006: Ro-En 7/13, 2007: Ro-Ro 1/2 ◦ UAIC 2008: Ro-Ro 1/2  Answer Validation Exercise (CLEF - En) ◦ UAIC 2007: 1/7, 2008: 1/7  Anaphora Resolution Exercise (En): ◦ UAIC 2007: ranked first  Textual Entailment (En): ◦ UAIC 2007: 2-way task – 3/26, 3-way task – 4/10 ◦ UAIC 2008: 2-way task – 2/26, 3-way task – 1/13 7 LT Days, Luxembourg, 14-15 Jan, 2009

8  Morphological and POS tagger (En/Ro)  Lemmatizer (En/Ro)  Dependency Linker (En/Ro)  Sentence splitting (En/Ro)  Spell checker (Ro)  Word aligner (En-Ro)  Anaphora resolver (En/Ro)  Discourse parser (En/Ro)  Summarisation (En/Ro)  Q&A (En/Ro)  SMT (En-Ro-En, En-Gr-En, En-Sl-En)  Definitions extractor (En/Ro)  Information Retrieval (Ro Wikipedia) 8 LT Days, Luxembourg, 14-15 Jan, 2009

9  Ro WordNet aligned with Princeton En WN (ILI) ◦ the second largest in the world (55,000 synsets)  Mono and multilingual corpora ◦ various RO classical novels (about 3,000,000 words)  richest annotation: Orwell’s “1984” (110,000 words) ◦ tagged, lemmatized, chunked, word-aligned (XCES):  Semcor (En, Ro): 1,000,000 words  Ev.Zilei (En, Ro): 1,000,000 words  Acquis Communautaire (22 languages), Ro: 30,832,212 words  Wikipedia-Ro (fragment): 3,405,324 words ◦ dictionaries: Dictionary of Modern Romanian – DEX, Thesaurus Dictionary of Romanian Language (eDTLR)  Language models, grammars, NE lists, complete inflexional lists, AR models, sentence splitting models, discourse cue words, etc. 9 LT Days, Luxembourg, 14-15 Jan, 2009

10  European past: ◦ ELSNET (ESPRIT), ELSNET-Goes-EAST (Copernicus), TELRI (COPERNICUS), FF-POIROT (FP5), Balkanet (FP5), RolTech (INTAS), LT4eL (FP6)…(more than 30 projects, see lists at www.racai.ro, www.info.uaic.ro/~dcristea)www.racai.ro www.info.uaic.ro/~dcristea  European active: ◦ CLARIN: design & build the European LT infrastructure for HSS (representation in SB and EB, 2 partners and 5 member institutions) ◦ FlareNet: Nicoletta’s speech ◦ ALEAR: models of language evolution in humanoid agents (robots): unification optimisation and discourse modelling 10 LT Days, Luxembourg, 14-15 Jan, 2009

11  Language Technology and preservation of national heritage – national priorities in the Ro research plan  Massive financing over the last 2 years (compared to previous)… 11 LT Days, Luxembourg, 14-15 Jan, 2009

12 ◦ Under the Ministry Culture and Arts (dir. Dan Matei) ◦ Digitisation of the Ro literature 12 LT Days, Luxembourg, 14-15 Jan, 2009

13 13 LT Days, Luxembourg, 14-15 Jan, 2009

14  @ RACAI  A follow up of a successful SEE-ERA.net project (Ro, Bg, Gr, Sl, Sr)  Encouraging pilot experiments for Ro-En-Ro, Gr-En-Gr, Sl-En-Sl 14 LT Days, Luxembourg, 14-15 Jan, 2009 Language pairGoogle translationRACAI translation NIST scoreBLEU scoreNIST scoreBLEU score English to Greek 3.57050.29343.97300.3533 English to Slovene 3.53400.26533.67190.2450 English to Romanian 4.40570.45084.93480.5464 Greek to English 3.54270.28683.77330.2981 Slovene to English 4.04240.22154.05890.2293 Romanian to English 4.35730.28274.54260.4604

15  ALPE: a model of anchoring specifications of NLP applications on XML annotation schemas (standards)  build a pipeline/parallel architecture without any need to program  just input your own file and indicate the form of the output  use the federation of tools as bricks for new applications  cooking: the more ingredients you have, the list of possible recipes you may go for increases 15 LT Days, Luxembourg, 14-15 Jan, 2009

16 ◦ Explosion of formats  difficulty of standardisation ◦ Standards are like laws: they help to organise the society, but they also reduce freedom ◦ Standards usually come late ◦ We are in a hurry to do thinks instantly  Invent heuristics able to guess the semantics of new formats  ‘Compute’ wrappers to transform non-standard input into standard 16 LT Days, Luxembourg, 14-15 Jan, 2009

17 17 LT Days, Luxembourg, 14-15 Jan, 2009


Download ppt "Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science"

Similar presentations


Ads by Google