Download presentation
Presentation is loading. Please wait.
1
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Knowledge Base for RTD Competencies in IST – Results from a European SSA Project – Brigitte Jörg German Research Center for Artificial Intelligence Language Technology Lab, Saarbrücken, Germany
2
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Introduction of Speaker Brigitte Jörg M.A. Information Science Information Systems, Business Administration Project Manager, Researcher DFKI GmbH, Language Technology Lab, Saarbrücken, Germany CERIF TG Leader, Board Member euroCRIS Contact: brigitte.joerg @ dfki.de http://www.dfki.de/~brigitte/
3
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Presentation Outline Introduction of the Project Information Repository Data Collection / Data Integration / Data Cleaning Analytic Tools Evaluation and Results Conclusion / Beyond the Project
4
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Project Information Funding Organization: European Commission Funding Program: Sixth Framework Programme (FP6: IST (3 rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, STFC Project Consortium: 15 partners from EU MS, NMS and ACC
5
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Project Consortium Deutsches Forschungszentrum für Künstliche Intelligenz, Germany Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary Institute of Mathematics and Computer Science, University of Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK (formerly CCLRC, UK)
6
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Technology Partners DFKI Co-ordinator “LT World” Portal Information Extraction Semantic Web Jozef Stefan Institute Technical Co-ordinator “Project Intelligence” Data Mining Social Network Analysis Ontotext “KIM Semantic Annotation Platform” euroCRIS “CERIF” Standard Access to Data
7
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Project Objectives Set up and populate an information portal on IST research Provide information about RTD actors and their expertise Provide innovative and automated services To promote RTD competencies in specific fields To support partner search for IST proposals and commercial projects
8
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Repository Features Information Repository Entities ( based on the CERIF 2004 Standard*) are Organisations Persons Projects Publications Data Collection - Import (based on CERIF XML) from National CRISs (Current Research Information Systems) National Collections (no system behind) Web Crawlings Community Support * CERIF: Common European Research Information Format http://www.euroCRIS.org/
9
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Repository Challenges Data Integration from Heterogeneous Sources CERIF-based databases (MSSQL Server; MS Access; EPSRC database) MSWord documents; MSExcel documents Raw Text files; HTML files; XML files Data crawled from the Web; from CERIF-based CRISs; from public CRISs Data Integration into ONE single dataset to enable Analysis at European Level Overall Data Cleaning with Supervised Machine Learning Methods (Active Learning)
10
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results European Research Dataset (entries) Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs Cyprus: 29 Orgs Czech Republic: 183 Orgs, 163 Proj, 164 Exp Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs Hungary: 2665 Orgs, 1297 Proj, 2425 Exp Latvia: 106 Orgs, 830 Proj, 701 Exp Lithuania: 102 Orgs, Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs Romania: 169 Orgs, 68 Proj, 87 Exp Serbia: 60 Orgs, 2278 Exp, 79130 Pubs Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp Slovakia: 56 Orgs, 432 Proj, 683 Exp. Turkey: 285 Orgs EPRI-start: 286 Orgs, 275 Exp Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp Community: 61 Orgs, 41 Proj, 435 Exp January 2008
11
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Collection Method Analysis From National CRISs / Collections: complete & comprehensive often bi-lingual quick, easily (exported) transformed into CERIF XML mostly technical contact/expertise available Crawled from public CRISs / CERIF-based CRISs: complete as publicly available needs data transformation / re-structuring efforts into CERIF XML technical expertise not related to domain knowledge depends on static website structures Crawled from the Web (Google Scholar Publication Data): not usable for quality analysis Community Contributions: a lot of interest entries incomplete, only basic personal data, not many relations
12
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Repository Analysis before Data Integration Analysis of Obvious Errors: Duplicate records inherent in single datasets Even more duplicate records after merging of datasets Most obvious duplicates for organisations and persons no significant number of duplicate projects publications have been ignored Duplicate records are a known problem !!
13
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Heuristic Analysis of Random Samples in National Datasets / Cordis Datasets most obvious duplicates found inside Cordis FP5 and FP6 datasets and across Cordis FP5 and FP6 datasets Largest Sets !! not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined Decision with Respect to the IST World Scope not touching project records ignore publication records let the community resolve person records ( IST World Community ) concentrate on cleaning organisation records Repository Analysis after Data Integration
14
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Problems with Organisation Records Most entries had slightly different names caused by additional special characters or character modifications Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, ) Mixture of Organisation Names and Department Names Differences in Addresses Data Cleaning Application
15
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Active Learning Application
16
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Evalution of Automated Matching Results in the CORDIS FP6 dataset Human evaluation of 1000 organisation record pairs: 30 Matches correct 934 Non-Matches correct 1 Match incorrect 35 Non-Matches incorrect integration approach worked well can be used for large scale integration tasks Result: semi-automated identification of 4000 duplicates with high accuracy and reasonable recall 97% precision 46% recall
17
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Analytic Tools publicly available at: http://www.ist-world.org/ Advanced Tools Competence Diagram Collaboration Diagram Experimental Tools Collaobration Trends Competence Trends Consortia Prediction Semantic Search
18
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results How to Analyze or Generate a Diagram definition of a query in the IST World Portal get a list of result records matching the query generate diagrams based on results
19
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Thematic Areas (Blue Clouds): SEMANTIC HEALTH LEGAL CHANGING ROADMAP SOFTWARE Projects (Red Dots) Linked with Full Record in Repository
20
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Goals (List of Keywords): DEMENTIA PEOPLE MEDICAL STANDARDS … Configuration of Result Space: 40% of result list 30 topics
21
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Goals Configuration of Result Space: 40% of result list 30 topics Themes
22
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Collaboration Diagram Query: IST SSA projects within FP6 Aim: investigate the collaboration of SSA partners in FP6 Number of joint partners Configuration of Result Space: 20% of result list Project
23
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Evaluation of Analytic Tools … very powerful … itself are a powerful dissemination means … strongly depend on the data behind More evaluation details and results can be found in the CRIS 2008 Proceedings at http://www.eurocris.org
24
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Overall Conclusion Data Collection: Data should be updated at their origin to avoid repetition of data cleaning with updates independent of their collection method updates have to happen in the processes needs backwards-communication with data providers CRISs support systematic data collections and updates A Lingua Franca for communication and interchange between systems is needed for large-scale integration large-scale analyses across single sets CERIF was crucial for IST World Crawlings/CRISs do not easily distinguish between topics (IST only) Web Crawlings (GoogleScholar) considerably lacked quality Automated Data Integration: Semi-automatically learned models can be re-used with new data
25
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Overall Conclusion Evaluation of large Datasets: very difficult needs expert knowledge Analytics and Tools: depend heavily on quality data are very powerful for investigation of large datasets are much appreciated by the community (many registered users) Common Interest: Very High! even from outside the project: Hungary, Serbia, Croatia, Russia, … epriStart project Needs professional Authority: legalization; not available within the scope of a project
26
© Brigitte Jörg iConnectEU Workshop – October 16th, 2008 Brussels Project Results Beyond the Project IST World is public http://www.ist-world.org/ Registration is free Create your own Profile, Competence Map, Collaboration Map Currently FP7 Data are being prepared Continuation is planned …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.