Download presentation
Presentation is loading. Please wait.
Published byCameron Rose Modified over 9 years ago
1
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analyzing European Research Competencies in IST – Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol (DFKI) (IJS) (DFKI) (IJS)
2
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Project Information Funding Organization: European Commission Funding Program: Sixth Framework Programme (FP6: IST (3 rd Call)) Project Type: Specific Support Action (SSA) Duration: 32 Months (April 2005 – November 2007) Project Co-ordination: DFKI GmbH Technical Co-ordination: Jozef Stefan Institute (IJS) Technology Partners: DFKI, IJS, Ontotext, CCLRC Project Consortium: 15 partners from EU MS, NMS and ACC
3
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Project Consortium Deutsches Forschungszentrum für Künstliche Intelligenz, Germany Institute Jozef Stefan, Slovenia Ontotext Lab, Sirma AI EAD, Bulgaria RTD Talos, Cyprus Institute of Information Theory and Automation, Czech Republic Archimedes Foundation, Estonia Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary Institute of Mathematics and Computer Science, Uni of Latvia Lithuanian Innovation Centre, Lithuania Projects in Motion, Malta Technical University of Silesia, Poland National Institute for R&D in Informatics, Romania Slovak University of Technology, Poland TUBITAK, Turkey The Science and Technology Facilities Council, UK (formerly CCLRC, UK)
4
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Technology Partners DFKI Co-ordinator “LT World” Portal Information Extraction Semantic Web Jozef Stefan Institute Technical Co-ordinator “Project Intelligence” Data Mining Social Network Analysis Ontotext “KIM Semantic Annotation Platform” euroCRIS “CERIF” Standard Access to Data
5
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Project Objectives Set up and populate an information portal on IST research Provide information about RTD actors and their experience and expertise Provide innovative and automated services To promote RTD competencies in specific fields To support partner search for IST proposals and commercial projects
6
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Presentation Outline Information Repository Data Collection Data Integration / Data Cleaning Evaluation of Results Analytic Tools Overall Conclusion
7
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Repository Features Information Repository (CERIF 2004) containing Organisation Person Project Publications Data Collection (CERIF XML) from National CRISs National Collections Web Crawlings Community Support Data Integration into ONE single dataset to enable analysis at European Level Data Cleaning with Supervised Machine Learning Methods (Active Learning)
8
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Repository Data Analysis Duplicate records inherent in single datasets Even more duplicate records after merging single datasets Most obvious duplicates for organisations and persons no significant number of duplicate projects publications have been ignored Duplicate records are a known problem
9
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Problem: duplicate detection in record set A Given: a set of records in A Classify: every pair (a,b) A x A M U (set of true matches) (set of true non matches) Formal Problem Definition (Winkler 2006)
10
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Heuristic Analysis of Random Samples: National Datasets / Cordis Datasets most obvious duplicates found inside Cordis FP5 and Cordis FP6 datasets and across Cordis FP5 and FP6 datasets not so many duplicates found in national datasets a lot of duplicate person records across all datasets no duplicate records found in project datasets only some duplicate records across project datasts publications have not been examined Decision taken with respect to the IST World scope not touching project records ignore publication records find a solution for person records (IST World Community) concentrate on cleaning organisation records IST World Problem Definition
11
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Problems with Organisation Records Most entries had slightly different names caused by additional special characters or character modifications Capitalization, Lowercase Letters Blanks, extra Spaces Hyphens Quotes Coma in Different Places Article in Name Full stop in Name Incomplete Names English Translation Word Order Language Specific Characters (Jorg instead of Jörg) Special Characters (wrong encoding &, ?, ) Mixture of Organisation Names and Department Names Differences in Addresses Data Cleaning Application
12
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results IST World Dataset Integration Organisation Names: Fulltext Indexing Querying Organisation Names + Location (1) Name/Location Strings (Bag of Words) (2) Word/Character Order (String Kernels) (3) Spelling Errors (Edit Distance Measure) (4) Normalization of (1-3) Human Decision M = Match U = Non-Match - = unknown Machine Learning (Support Vector Machine) M = Match U = Non-Match - = unknown Machine Decision M = Match U = Non-Match Knowledge about Records
13
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Active Learning Application
14
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Evalution of Results in CORDIS FP6 dataset human evaluation of 1000 organisation record pairs 30 M correct; 934 U correct 1 M incorrect; 35 U incorrect 97% precision 46% recall integration approach worked well can be used for large scale integration tasks Result: semi-automated identification of 4000 duplicates with high accuracy and a reasonable recall
15
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Analytic Tools Advanced Tools Collaboration Diagram Competence Diagram Experimental Tools Collaobration Trends Competence Trends Consortia Prediction Semantic Search
16
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results How to analyze or generate a Diagram definition of a query in the IST World Portal get a list of result records matching the query generate diagrams based on results
17
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Thematic Areas (Blue Clouds): SEMANTIC HEALTH LEGAL CHANGING ROADMAP SOFTWARE Projects (Red Dots) Linked with Full Record in Repository
18
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Goals (List of Keywords): DEMENTIA PEOPLE MEDICAL STANDARDS … Configuration of Result Space: 40% of result list 30 topics
19
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Competence Diagram Query: IST SSA projects within FP6 Aim: investigate the thematic range of SSA projects in FP6 Goals Configuration of Result Space: 40% of result list 30 topics Themes
20
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Collaboration Diagram Query: IST SSA projects within FP6 Aim: investigate the collaboration of SSA partners in FP6 Number of joint partners Configuration of Result Space: 20% of result list Project
21
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Evaluation of Analytic Tools IST World allowed to perform the tasks defined for more details see the full paper in the Proceedings All analytics depend on the data behind The analytic tools are very powerful
22
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Evaluation of Queries Query execution performed in March 2008 Queried datasets IST World / Cordis IST World Portal: http://www.ist-world.org/ CORDIS Search: http://cordis.europa.eu/en/home.html
23
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Results of Query Evaluation Discovered inconsistencies with Cordis data: „FP6“ string: 30 of 80 relevant records missed the string „SSA“ string: 15 of 208 relevant records missed the string „Specific Support Action“ string: 15 of 208 relevant records missed the string Dates (Year of the call): not consistently recorded Query 1: 22 projects contained the string „Coordination Action“, „Specific Targeted Action“, „Integrated Project“, others An investigation of the results of the Query 1 in Cordis revealed: 80 projects of the result list are missing in IST World
24
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Overall Conclusion Integration Method: Could be further developed Test data could be used to generate a better classification model Feature generation could be improved by using ontological knowledge Transfer learning methods might be helpful for re-use of the learned model Evaluation of large Datasets: very difficult needs expert knowledge Analytic Tools: depend on quality data behind are very powerful for investigation of large datasets
25
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results European Research Dataset (entries) Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs Cyprus: 29 Orgs Czech Republic: 183 Orgs, 163 Proj, 164 Exp Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs Hungary: 2665 Orgs, 1297 Proj, 2425 Exp Latvia: 106 Orgs, 830 Proj, 701 Exp Lithuania: 102 Orgs, Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs Romania: 169 Orgs, 68 Proj, 87 Exp Serbia: 60 Orgs, 2278 Exp, 79130 Pubs Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp Slovakia: 56 Orgs, 432 Proj, 683 Exp. Turkey: 285 Orgs EPRI-start: 286 Orgs, 275 Exp Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp Community: 61 Orgs, 41 Proj, 435 Exp January 2008
26
© Brigitte Jörg June 4th, 2008 in Maribor, Slovenia Project Results Beyond the Project IST World is online: http://www.ist-world.org/ Registration is free Create your Competence Map / Collaboration Map Continuation is planned …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.