GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment.

GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment of Semantic Heterogeneity...

2 GESIS Outline  What is semantic heterogeneity?  Meta-Data extraction  Semantic relations  Query translation  Outlook

3 GESIS Project CARMEN  Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures)  Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents)  Methods for treatment of resisting semantic heterogeneity in CARMEN

4 GESIS Semantic Heterogeneity  Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMEN  Semantic heterogeneity appears in different data collections using  different thesauri or classifications for content description  varying or no metadata at all  or when intellectually indexed documents meet completely un-indexed Internet pages

5 GESIS Material: Social Sciences  SOLIS/FORIS vs. Internet documents from social sciences  specialized documentation databases with high-quality content description like abstract, controlled keywords and classification  Internet documents in the majority of cases without any metadata, high semantic and formal heterogeneity

6 GESIS Extraction of Meta-Data

7 GESIS Meta-Data in Test Corpus  Size: 3,661 documents  File format: only HTML documents  TITLE:  Correct title tags: 96 %  Title, but incorrectly coded: 17.7 % of the rest  KEYWORD:  Correct keyword tags: 25.5 %  ABSTRACT:  Correct description tags: 21 %  Abstract, but incorrectly coded: 39,4 % of the rest

8 GESIS Extraction from HTML files - Some Problems  Missing or irregular use of Meta tags (author, keywords, DC-Tags)  Inconsistent use of semantic HTML tags (title, h1, h2, address etc.)  Irregular formatting style for context information (type size, type style, horizontal orientation etc.)  Missing context information (date, author, institution, etc.)  Not specification consistent use of HTML!

9 GESIS Converting HTML  XML  Advantages:  (syntactical) homogenisation of HTML files  XML allows the use of many existing tools for document analysis, particularly the query language XPath.  Disadvantage:  Poor performance of the converting process (not a big issue: extraction runs during gathering process, not at retrieval time)

10 GESIS HTML Heuristic : Title (part) If ( -tag exists && does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in */ If ( ==HMAX) { Title[1]= } elsif ( contains HMAX) { /* ' contain' does always mean case insensitive substring */ Title[0,8]= } elsif (HMAX contains ) { Title[0,8]=HMAX } else { Title[0,8]= + HMAX } } elsif ( exists && S exists) { /* i.e. exists AND an item //p/b, //i/p etc. exists */ Title[0,5]= + S } elsif ( exits) { Title[0,5]= } elsif ( exits) { Title[0,3]=HMAX } elsif (S exits) { Title[0,1]= S } }

11 GESIS Results and Outlook  Extraction of Meta-Data  TITEL: 80 % extracted with medium or high quality  KEYWORDS: nearly 100 % extracted with high quality  ABSTRACTS: 90 % extracted with medium/high quality  Conclusion  In principle transferable on other domains  Expensive maintenance  Only compromise solution, until builders of web pages use Dublin Core or other Meta-Data standard

12 GESIS Semantic Relations  Intellectual transfers relations (Cross-Concordances) Tools for creation: SIS-TMS for thesauri, CarmenX for classifications  Statistical transfer relations (Co-occurrence analysis)

13 GESIS Cross-Concordances in SIS-TMS

14 GESIS SIS-TMS Correlation Editor

15 GESIS Parallel Corpus

16 GESIS Corpus with Internet Documents Social Sciences‘ Internet documents are not indexed using a thesaurus or classification

17 GESIS Simulating a Parallel Corpus

18 GESIS Result: Simulated Parallel Corpus

19 GESIS ax(0,8) ; y(0,4) bx(0,3) ; z(0,3) ca(0,2) ; y(0,4) dx(0,6) ; y(0,7) Term-Term-Matrix

20 GESIS Tool: Jester Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus

21 GESIS Query Transformation

22 GESIS Binding of Query Languages Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.

23 GESIS CARMEN Transfer Architecture  Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer service  Exchange of partial queries using XML/XIRQL  Transfer service runs as TomCat servlet server

24 GESIS Evaluation of Transfer Modules  Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer)  Limitation: no use of weight information of transfer relations  Tested transfer: SOLIS/IZ-Thesaurus  SoWi Internet documents/free-terms  Comparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer  2 exemplary searches per 3 domains (women studies, migration, sociology of industry)

25 GESIS Exemplary Search: “Dominanz“  „Dominanz“ (“dominance“): 16 relevant documents  10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste): 14 additive documents, thereof 7 relevant (50%, increase 44%)  Precision: 77%

26 GESIS Exemplary Search: „Leiharbeit“  „Leiharbeit“ (“temporary work“): 10 relevant documents  4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung): 10 additive documents, thereof 2 relevant (20%, increase 20%)  Precision: 60%

27 GESIS Results  All exemplary searches using transfers leads to additive relevant documents compared with a search without transfer  Quota of relevant documents from all new documents between 13% and 55%  Transfer terms not always evident (Example „Wüste“ (“desert”))  Partly very many transfer terms (user parametrizing or better algorithms needed)

28 GESIS Outlook (What needs to be done?)  Improvement of double corpora:  Kind of documents  Diversity of document types  Diversity of institutions / web sites  Domain  Corpus size  Comparison of transfers using statistical relations intellectual relations  Improvement of algorithms  Effect of interactive, repetitive retrieval and user parametrizing / adjustment  User tests

29 GESIS Exploitation  Services (transfer)  Software (Java classes)  Projects:  Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz)  European Schools Treasury Browser (ETB)  Informationsverbund Bildung – Sozialwissenschaften – Psychologie (InfoConnex)  Contact: soe@bonn.iz-soz.de

GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment.

Similar presentations

Presentation on theme: "GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment.

Similar presentations

Presentation on theme: "GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment."— Presentation transcript:

Similar presentations

About project

Feedback