Download presentation
Presentation is loading. Please wait.
Published byJewel Daniel Modified over 9 years ago
1
GESIS Robert Strötgen Social Science Information Centre, Bonn euroCRIS 2002, 29th August 2002... using Meta-Data Extraction and Query Translation Treatment of Semantic Heterogeneity...
2
2 GESIS Outline What is semantic heterogeneity? Meta-Data extraction Semantic relations Query translation Outlook
3
3 GESIS Project CARMEN Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures) Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents) Methods for treatment of resisting semantic heterogeneity in CARMEN
4
4 GESIS Semantic Heterogeneity Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMEN Semantic heterogeneity appears in different data collections using different thesauri or classifications for content description varying or no metadata at all or when intellectually indexed documents meet completely un-indexed Internet pages
5
5 GESIS Material: Social Sciences SOLIS/FORIS vs. Internet documents from social sciences specialized documentation databases with high-quality content description like abstract, controlled keywords and classification Internet documents in the majority of cases without any metadata, high semantic and formal heterogeneity
6
6 GESIS Extraction of Meta-Data
7
7 GESIS Meta-Data in Test Corpus Size: 3,661 documents File format: only HTML documents TITLE: Correct title tags: 96 % Title, but incorrectly coded: 17.7 % of the rest KEYWORD: Correct keyword tags: 25.5 % ABSTRACT: Correct description tags: 21 % Abstract, but incorrectly coded: 39,4 % of the rest
8
8 GESIS Extraction from HTML files - Some Problems Missing or irregular use of Meta tags (author, keywords, DC-Tags) Inconsistent use of semantic HTML tags (title, h1, h2, address etc.) Irregular formatting style for context information (type size, type style, horizontal orientation etc.) Missing context information (date, author, institution, etc.) Not specification consistent use of HTML!
9
9 GESIS Converting HTML XML Advantages: (syntactical) homogenisation of HTML files XML allows the use of many existing tools for document analysis, particularly the query language XPath. Disadvantage: Poor performance of the converting process (not a big issue: extraction runs during gathering process, not at retrieval time)
10
10 GESIS HTML Heuristic : Title (part) If ( -tag exists && does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in */ If ( ==HMAX) { Title[1]= } elsif ( contains HMAX) { /* ' contain' does always mean case insensitive substring */ Title[0,8]= } elsif (HMAX contains ) { Title[0,8]=HMAX } else { Title[0,8]= + HMAX } } elsif ( exists && S exists) { /* i.e. exists AND an item //p/b, //i/p etc. exists */ Title[0,5]= + S } elsif ( exits) { Title[0,5]= } elsif ( exits) { Title[0,3]=HMAX } elsif (S exits) { Title[0,1]= S } }
11
11 GESIS Results and Outlook Extraction of Meta-Data TITEL: 80 % extracted with medium or high quality KEYWORDS: nearly 100 % extracted with high quality ABSTRACTS: 90 % extracted with medium/high quality Conclusion In principle transferable on other domains Expensive maintenance Only compromise solution, until builders of web pages use Dublin Core or other Meta-Data standard
12
12 GESIS Semantic Relations Intellectual transfers relations (Cross-Concordances) Tools for creation: SIS-TMS for thesauri, CarmenX for classifications Statistical transfer relations (Co-occurrence analysis)
13
13 GESIS Cross-Concordances in SIS-TMS
14
14 GESIS SIS-TMS Correlation Editor
15
15 GESIS Parallel Corpus
16
16 GESIS Corpus with Internet Documents Social Sciences‘ Internet documents are not indexed using a thesaurus or classification
17
17 GESIS Simulating a Parallel Corpus
18
18 GESIS Result: Simulated Parallel Corpus
19
19 GESIS ax(0,8) ; y(0,4) bx(0,3) ; z(0,3) ca(0,2) ; y(0,4) dx(0,6) ; y(0,7) Term-Term-Matrix
20
20 GESIS Tool: Jester Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus
21
21 GESIS Query Transformation
22
22 GESIS Binding of Query Languages Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.
23
23 GESIS CARMEN Transfer Architecture Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer service Exchange of partial queries using XML/XIRQL Transfer service runs as TomCat servlet server
24
24 GESIS Evaluation of Transfer Modules Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer) Limitation: no use of weight information of transfer relations Tested transfer: SOLIS/IZ-Thesaurus SoWi Internet documents/free-terms Comparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer 2 exemplary searches per 3 domains (women studies, migration, sociology of industry)
25
25 GESIS Exemplary Search: “Dominanz“ „Dominanz“ (“dominance“): 16 relevant documents 10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste): 14 additive documents, thereof 7 relevant (50%, increase 44%) Precision: 77%
26
26 GESIS Exemplary Search: „Leiharbeit“ „Leiharbeit“ (“temporary work“): 10 relevant documents 4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung): 10 additive documents, thereof 2 relevant (20%, increase 20%) Precision: 60%
27
27 GESIS Results All exemplary searches using transfers leads to additive relevant documents compared with a search without transfer Quota of relevant documents from all new documents between 13% and 55% Transfer terms not always evident (Example „Wüste“ (“desert”)) Partly very many transfer terms (user parametrizing or better algorithms needed)
28
28 GESIS Outlook (What needs to be done?) Improvement of double corpora: Kind of documents Diversity of document types Diversity of institutions / web sites Domain Corpus size Comparison of transfers using statistical relations intellectual relations Improvement of algorithms Effect of interactive, repetitive retrieval and user parametrizing / adjustment User tests
29
29 GESIS Exploitation Services (transfer) Software (Java classes) Projects: Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz) European Schools Treasury Browser (ETB) Informationsverbund Bildung – Sozialwissenschaften – Psychologie (InfoConnex) Contact: soe@bonn.iz-soz.de
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.