LT4eL - WP1: Setting the scene WP leader: UAIC Univ. AI. I. Cuza of Iasi Faculty of Computer Science Dan Cristea, Corina Forăscu, Dan Tufiş, Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene Contact: Utrecht Review Meeting, February 1, 2007
Objectives 1.inventarization and classification of existing tools necessary for the development of the relevant functionalities (i.e. key word extractor, glossary candidate detector); 2.collection and normalization of the learning material related to the use of the computer in education (Humanities, Social Sciences); 3.investigation of IPR issues; 4.adoption of relevant standards for linguistic annotation of learning objects; 5.dissemination of the results through a Web portal
Partners in WP1 Utrecht University (UU), The Netherlands University of Hamburg (UHH), Germany University of Lisbon (FFCUL), Portugal Charles University Prague (CUP), Czech Republic Institute for Parallel Processing, Bulgarian Academy of Sciences (IPP-BAS), Bulgaria University of Tübingen (UTU), Germany Institute of Computer Science, Polish Academy of Sciences (ICS-PAS), Poland Zürich University of Applied Sciences Winterthur (ZHW), Switzerland University of Malta (UOM), Malta
Lexikon CZ EN CONVERTOR 1 Documents SCORM Pseudo-Struct. Basic XML LING. PROCESSOR Lemmatizer, POS, Partial Parser CROSSLINGUAL RETRIEVAL LMS User Profile Documents SCORM Pseudo-Struct Metadata (Keywords) Ling. Annot XML Ontology CONVERTOR 2 Documents HTML Lexikon PT Lexikon RO Lexikon PL Lexicon GE Lexikon MT Lexikon BG Lexikon DT Lexicon EN PLGE BG PTMTDTRO EN Documents User (PDF, DOC, HTML, SCORM,XML) REPOSITORY Glossary
The Portal A working space: –Repository for resources, tools, deliverables –Exchange information among participants –Statistics Hosted by UAIC: –January 2007: 1.15 Gb (without realTimeStat, searchForm, upload/updateForm) Address: –Username: guestLt4eL –Passwd: elearning Demo version on CDCD
O1. Collection of language resources and tools (1) Inventarization and classification of existing tools ( relevant to: –the integration of language technology resources in eLearning (WP2) –the integration of semantic knowledge (WP3)
O1. Collection of language resources and tools (2) Inventarization and classification of existing language resources –corpora and frequencies lists: –lexica:
O2. Collection of LOs: the portal Uploads, updates & real-time statistics at Criteria (→ attributes): -Subdomains relevant for beginners in IST & e-learning → Domain -Multilingualism → Language -Medium sized documents → Number of words -IPR~clear → IPR -Uniformity in topics → keywords selected initially
Collection of LOs: domains 1. Use of computers in education, with sub-domains: 1.1 Teaching academic skills, with sub-domains: Academic skills Relevant computer skills for the above tasks (MS Word, Excel, Power Point, LaTex, Web pages, XML) Basic skills (use of computer for beginners) (chats, , Intenet) 1.2 e-Learning, e-Marketing 1.3 The I*Teach document (Leonardo project, Impact of use of computers in society 1.5 Studies about use of computers in schools / high schools 1.6 Impact of e-Learning on education 2. Calimera documents (parallel corpus developped in the Calimera FP5 project, )
Collection of LOs: domains coverage
The hierarchy of LOs’ formats
Collection of LOs: annotation layers 1.Initial documents: doc, pdf, html, txt → Base-XML 2.Linguistic annotation: tokens, POS, lemma, chunks → WP2 XML format (LT4ELAna.dtd) 3.Keywords, definitions and ontology links annotations
Level 1 conversions Base-XML plain texthtml otherlatexpdfdoc doc → html
Level 1 conversions doc → html (UTF-8) 1. MS Office: Save As html 2.OpenOffice Writer SXC/ODT: Save As html
Level 1 conversions Base-XML plain texthtml otherlatexpdfdoc pdf → html
Level 1 conversions: pdf → html (UTF-8) 1. Adobe on-line conversion tool 2. pdfbox (Windows) 3. pdftohtml (Linux) 4. OpenOffice 5. Adobe Acrobat Professional
Level 1 conversions Base-XML plain texthtml otherlatexpdfdoc Base-XML convertor
Level 1 conversions: html → Base-XML The UAIC Java converter –keeps all the tags possibly useful (fixed) –produces a log of all the removed tags/data The CUP converter –tags kept according to a DTD
Collection of LOs: second level WP2 XML format tok-pos-lemma lemmapostokmorpho NP Language specific tools
Collection of LOs: second level WP2 XML format tok-pos-lemma lemmapostokmorpho NP scripts
Collection of LOs: KW extractor WP2 XML format Man KD XML Auto KD XML Level 2 Level 3 KW extractor
Collection of LOs: KW extractor WP2 XML format Man KD XML Auto KD XML Level 2 Level 3 KW extractor evaluation
Collection of LOs: third level Incl. akw, adefIncl. km.xml, dm.xml Man KD XML Auto KD XML def extractor kmxml: manually annotated kws dmxml: manually annotated defs akw: automatically annotated kws adef: automatically annotated defs
Collection of LOs: third level Incl. akw, adefIncl. km.xml, dm.xml Man KD XML Auto KD XML def extractor kmxml: manually annotated kws dmxml: manually annotated defs akw: automatically annotated kws adef: automatically annotated defs def extractor evaluation
Open issues Convertors –Tables, figures, page look… IPRs –Clarify the IPR status authors & EU + national legislation –Define IPR categories for LOs: usage (free, restricted, for research...)
WP1 over time December 05 February 06 NowMay 06 Initial collection on Portal Structure & functionalities to the portal - BaseXML convertors - new LOs Levels 2&3 additions - new tools - grammars - guides, docs - ontology, TermLex D1.1 Official end of WP1 Beginning of project Evaluation
Proposal: the hierarchy seen as a processing environment Level 2 docpdflatexother htmltxt sxml morphotokposlemmaNP wp2xml tpl akwadef axml Level 3 Level 1
Conclusions LOs, resources and tools collected Initially: portal seen as a repository Now: portal potentially integrated with the LMS as a processing environment