Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference Jürgen Oesterle

Similar presentations


Presentation on theme: "Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference Jürgen Oesterle"— Presentation transcript:

1 Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference Jürgen Oesterle Juergen.oesterle@fastsearch.com Fast Search & Transfer Deutschland GmbH

2 Most prominent problems with digital libraries „Multiple content sources“ problem In a typical digital library, you have to provide a combined search on many different collections at a time The format of the content varies between these collections The availability of structure varies between these collections The availability of external reference data varies between these collections The availability of meta data varies between these collections The kind of content might vary between these collections On these grounds, it‘s extremely difficult to provide equal ranking among the documents in a results set, coming from different content sources.

3 Most prominent problems with digital libraries „Meta data“ problem You can‘t tell how the meta data was generated (Author? Editor? Automatically assigned?) You can‘t tell in advance what meta data is available (Title, author, keywords, date, publisher, place, etc.) You don‘t know the original purpose of the meta data (Quick summary for reader? Condensed description? Normalization of content for search?) You can‘t assume uniform availability and quality of meta data even on one collection

4 Most prominent problems with digital libraries „Distributed documents“ problem Documents are often really hypertext, i.e. their parts are distributed over a site, with links between them „Multiple languages“ problem Documents are often in many different languages „Availability of classification schemas“ problem If classification is of interest (and of help while searching), the underlying classification taxonomies are not standardized across collections

5 Most prominent problems with digital libraries „Inaccurate queries“ problem Users typically lack domain specific knowledge Users don‘t have proper terminology to hand Users don‘t include all potential synonyms and variations in the query Users have a problem but aren‘t sure how to phrase it (i.e. how the same problem is phrased in the documents) On these grounds, it‘s extremely difficult to provide a perfectly relevant result set as first response. Intelligent suggestions for refinement or expansion are needed.

6 Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Part of speech tagging & normalization Extraction of specific syntactic patterns Statistic analysis of the extracted patterns Suffering from chronical rhinitis, the patient was treated...... Vpart Prep Adj N Det N Vcop Vpart P(„chronical rhinitis“) P(„chronical “) * P(„rhinits“) log Identification of new terminologychronical rhinitis

7 Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Investigations in E. coli B. C. Abracadabra Department of Molecular Medicine University of Wisconsin S. Miheev Analytical Laboratory Russian Academy of Scieneces Moscow Journal of Cancer Research Issue 5, 2003 -12 Abstract 1. Introduction 2. Materials and Methods In this study we investigate……… Investigations in E. coli B. C. Abracadabra Department of Molecular Medicine University of Wisconsin S. Miheev Analytical Laboratory Russian Academy of Scieneces Moscow Journal of Cancer Research Issue 5, 2003 -12 Abstract 1. Introduction 2. Materials and Methods In this study we investigate……… Journal title Article title Affiliation 1. Analyse structure 2. Determine text block features 3. Classify text blocks 4. Apply structure grammar

8 Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Artikel 1 Artikel 6 Artikel 7 Artikel 8 Artikel 2 Artikel 5 Artikel 4 Artikel 3 Citation graph Infer relative importance of Article 5 and Use textual context of citation to obtain good descriptors of it

9 Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Artikel 1 Artikel 6 Artikel 7 Artikel 8 Artikel 2 Artikel 5 Artikel 4 Artikel 3 Citation graph Infer relatedness of Article 8 and Article 7 because they are cited by the same articles

10 Technologies are underway to solve the problems.. Equal ranking Test runs with representative queries Check typical ranking position per content source Assign static rank boosts per content source, based on results Retrieval Engine Content source A Content source E Content source D Content source B Content source C only abstracts rich meta data no external references few meta data indexed in citation index full text articles web data hard to crawl, distributed documents unreliable meta data web anchor text as external reference full text documents PDF, DOC  conversion problems full text documents indexed in citation index rich meta data High boost Medium boost Low boost

11 Technologies are underway to solve the problems.. Proper treatment of queries Deal with orthographic variation Deal with morphological variation Deal with vocabulary variation Deal with special-interest queries (e.g. restrict on user homepages, find definitions, narrow down on articles) Cerebral infarct Cerebral infarcts Apoplexy Apoplectic insult Stroke “Cerebral infarct” Cerebral infarkt Serebral infarct Cetebral ingarct Cerebral disease Infarction Cerebral infarct / medicine Cerebral infarct / biology Cerebral infarct / conferences Infarctus cérébral Phrasing Doc type classification Spellchecking SynonymyThesaurus support Refinement Character normalization Lemmatization Topic classification Ambigue queries

12 Technologies are underway to solve the problems.. Investigations in E. coli B. C. Abracadabra [author info] S. Miheev [author info] Journal of Cancer Research Issue 5, 2003 -12 Journal contents Current Issue This issue Personal Profile While crawling for documents Abstract Chapter 1 Chapter 2 Chapter 3 Chapter 4 Introduction recognize links that point to „other parts of the document“ Abstract Introduction Chapter 1 Chapter 2 Chapter 3 Chapter 4 follow these links and put together a complete document. Smart data aggregation

13 Technologies are underway to solve the problems.. Crawling Document processing INDEX Results Query processing Result processing Query Doc Smart data aggregation (e.g. restoring distributed documents) Advanced linguistic processing (e.g. terminology extraction, classification, structural analysis) Proper treatment of queries (e.g. covering morphol.+semant. variation) Query refinement suggestions (e.g. covering morphol.+semant. variation) Citation index

14 Scirus

15

16

17 Evolution of Digital Libraries Traditional DL Full text search engine Next generation DL data base pure predefined meta data exact match data is heterogenuous not normalized incomplete unreliable inverted index full text exact match data is heterogenuous not normalized redundant unreliable inverted index + linguistics + smart data aggregation extracted information fuzzy search data is homogenuous auto-normalized auto-completed reliable


Download ppt "Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference Jürgen Oesterle"

Similar presentations


Ads by Google