GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
A Prototype Implementation of a Framework for Organising Virtual Exhibitions over the Web Ali Elbekai, Nick Rossiter School of Computing, Engineering and.
Melbourne, October 13, Electronic Communication on Diverse Data - The Role of the oo CIDOC Reference Model - Martin Doerr (ICS-FORTH, Crete, Greece)
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Information Retrieval in Practice
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –
Overview of Search Engines
Retrieving Location-based Data on the Web Andrei Tabarcea,
Application for Internet Radio Directory 19/06/2012 Industrial Project (234313) Kickoff Meeting Supervisors : Oren Somekh, Nadav Golbandi Students : Moran.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
1 Chapter 11 Implementation. 2 System implementation issues Acquisition techniques Site implementation tools Content management and updating System changeover.
GESIS Dr. Maximilian Stempfhuber Head of Research and Development Social Science Information Centre, Bonn, Germany How to deal with heterogeneity when.
Organizing Internet Resources OCLC’s Internet Cataloging Project -- funded by the Department of Education -- from October 1, 1994 to March 31, 1996.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
HTML. Principle of Programming  Interface with PC 2 English Japanese Chinese Machine Code Compiler / Interpreter C++ Perl Assembler Machine Code.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Metadata Bridget Jones Information Architecture I February 23, 2009.
Metadata for the Web Andy Powell UKOLN University of Bath
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
Oreste Signore- Quality/1 Amman, December 2006 Standards for quality of cultural websites Ministerial NEtwoRk for Valorising Activities in digitisation.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
SYNTHESIS An information system for administration documentation and promotion of cultural instances Center for Cultural Informatics Foundation for Research.
Model Design using Hierarchical Web-Based Libraries F. Bernardi Pr. J.F. Santucci {bernardi, University of Corsica SPE Laboratory.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Soon Joo Hyun Database Systems Research and Development Lab. US-KOREA Joint Workshop on Digital Library t Introduction ICU Information and Communication.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
From XML to DAML – giving meaning to the World Wide Web Katia Sycara The Robotics Institute
Differences and distinctions: metadata types and their uses Stephen Winch Information Architecture Officer, SLIC.
Web Design – Week 2 Introduction to website basics Website basics: How the Web Works Client / server architecture Packet switching URL components.
XML 2002 Annotation Management in an XML CMS A Case Study.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
COMP 143 Web Development with Adobe Dreamweaver CC.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Web Page Elements Writing For the Web
Search Engine Architecture
A Shopping Agent for the WWW
Web Engineering.
Information Integration for Digital Libraries
Cataloging the Internet
Chapter 16 The World Wide Web.
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Introduction to World Wide Web
A Generic System for Clearinghouses
Presentation transcript:

GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August using Meta-Data Extraction and Query Translation Treatment of Semantic Heterogeneity...

2 GESIS Outline  What is semantic heterogeneity?  Meta-Data extraction  Semantic relations  Query translation  Outlook

3 GESIS Project CARMEN  Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures)  Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents)  Methods for treatment of resisting semantic heterogeneity in CARMEN

4 GESIS Semantic Heterogeneity  Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMEN  Semantic heterogeneity appears in different data collections using  different thesauri or classifications for content description  varying or no metadata at all  or when intellectually indexed documents meet completely un-indexed Internet pages

5 GESIS Material: Social Sciences  SOLIS/FORIS vs. Internet documents from social sciences  specialized documentation databases with high-quality content description like abstract, controlled keywords and classification  Internet documents in the majority of cases without any metadata, high semantic and formal heterogeneity

6 GESIS Extraction of Meta-Data

7 GESIS Meta-Data in Test Corpus  Size: 3,661 documents  File format: only HTML documents  TITLE:  Correct title tags: 96 %  Title, but incorrectly coded: 17.7 % of the rest  KEYWORD:  Correct keyword tags: 25.5 %  ABSTRACT:  Correct description tags: 21 %  Abstract, but incorrectly coded: 39,4 % of the rest

8 GESIS Extraction from HTML files - Some Problems  Missing or irregular use of Meta tags (author, keywords, DC-Tags)  Inconsistent use of semantic HTML tags (title, h1, h2, address etc.)  Irregular formatting style for context information (type size, type style, horizontal orientation etc.)  Missing context information (date, author, institution, etc.)  Not specification consistent use of HTML!

9 GESIS Converting HTML  XML  Advantages:  (syntactical) homogenisation of HTML files  XML allows the use of many existing tools for document analysis, particularly the query language XPath.  Disadvantage:  Poor performance of the converting process (not a big issue: extraction runs during gathering process, not at retrieval time)

10 GESIS HTML Heuristic : Title (part) If ( -tag exists && does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in */ If ( ==HMAX) { Title[1]= } elsif ( contains HMAX) { /* ' contain' does always mean case insensitive substring */ Title[0,8]= } elsif (HMAX contains ) { Title[0,8]=HMAX } else { Title[0,8]= + HMAX } } elsif ( exists && S exists) { /* i.e. exists AND an item //p/b, //i/p etc. exists */ Title[0,5]= + S } elsif ( exits) { Title[0,5]= } elsif ( exits) { Title[0,3]=HMAX } elsif (S exits) { Title[0,1]= S } }

11 GESIS Results and Outlook  Extraction of Meta-Data  TITEL: 80 % extracted with medium or high quality  KEYWORDS: nearly 100 % extracted with high quality  ABSTRACTS: 90 % extracted with medium/high quality  Conclusion  In principle transferable on other domains  Expensive maintenance  Only compromise solution, until builders of web pages use Dublin Core or other Meta-Data standard

12 GESIS Semantic Relations  Intellectual transfers relations (Cross-Concordances) Tools for creation: SIS-TMS for thesauri, CarmenX for classifications  Statistical transfer relations (Co-occurrence analysis)

13 GESIS Cross-Concordances in SIS-TMS

14 GESIS SIS-TMS Correlation Editor

15 GESIS Parallel Corpus

16 GESIS Corpus with Internet Documents Social Sciences‘ Internet documents are not indexed using a thesaurus or classification

17 GESIS Simulating a Parallel Corpus

18 GESIS Result: Simulated Parallel Corpus

19 GESIS ax(0,8) ; y(0,4) bx(0,3) ; z(0,3) ca(0,2) ; y(0,4) dx(0,6) ; y(0,7) Term-Term-Matrix

20 GESIS Tool: Jester Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus

21 GESIS Query Transformation

22 GESIS Binding of Query Languages Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.

23 GESIS CARMEN Transfer Architecture  Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer service  Exchange of partial queries using XML/XIRQL  Transfer service runs as TomCat servlet server

24 GESIS Evaluation of Transfer Modules  Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer)  Limitation: no use of weight information of transfer relations  Tested transfer: SOLIS/IZ-Thesaurus  SoWi Internet documents/free-terms  Comparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer  2 exemplary searches per 3 domains (women studies, migration, sociology of industry)

25 GESIS Exemplary Search: “Dominanz“  „Dominanz“ (“dominance“): 16 relevant documents  10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste): 14 additive documents, thereof 7 relevant (50%, increase 44%)  Precision: 77%

26 GESIS Exemplary Search: „Leiharbeit“  „Leiharbeit“ (“temporary work“): 10 relevant documents  4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung): 10 additive documents, thereof 2 relevant (20%, increase 20%)  Precision: 60%

27 GESIS Results  All exemplary searches using transfers leads to additive relevant documents compared with a search without transfer  Quota of relevant documents from all new documents between 13% and 55%  Transfer terms not always evident (Example „Wüste“ (“desert”))  Partly very many transfer terms (user parametrizing or better algorithms needed)

28 GESIS Outlook (What needs to be done?)  Improvement of double corpora:  Kind of documents  Diversity of document types  Diversity of institutions / web sites  Domain  Corpus size  Comparison of transfers using statistical relations intellectual relations  Improvement of algorithms  Effect of interactive, repetitive retrieval and user parametrizing / adjustment  User tests

29 GESIS Exploitation  Services (transfer)  Software (Java classes)  Projects:  Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz)  European Schools Treasury Browser (ETB)  Informationsverbund Bildung – Sozialwissenschaften – Psychologie (InfoConnex)  Contact: