Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference Jürgen Oesterle

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

THE STEPS OF SEARCH You have opened a new veterinary clinic in a small town, and want people in the vicinity to know about it. You need some new ideas.
Chapter 5: Introduction to Information Retrieval
Compiled by Helene van der Sandt. Is a search engine that searches for scholarly literature Can search across many disciplines Searches for articles,
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Advanced Searching Engineering Village.
Information Retrieval in Practice
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Learn how to search for information the smart way Choose your own adventure!
1 Scopus Update 15 Th Pan-Hellenic Academic Libraries Conference, November 3rd,2006 Patras, Greece Eduardo Ramos
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information Retrieval
Using the ERIC Database This tutorial will show you how to access ERIC which contains citations, abstracts and some full-text materials from journals and.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Exercise Your your Library ® Smart Searching UW Library Winter 2007.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
International Atomic Energy Agency INIS Training Seminar Principles of Information Retrieval and Query Formulation 07 – 11 October 2013 Vienna, Austria.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Text Mining in Combination with in Combination with Enterprise Search Enterprise Search Thomas Herbst CEO B-S-S GmbH 7th Fraunhofer Symposium on Text Mining.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
ASMLibrary-MYP-9/18/09 MYP – Personal Project ASM Library EBSCO databases.
Web Scale Discovery Service Vs Federated Search NIKESH NARAYANAN
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
IL Step 2: Searching for Information Information Literacy 1.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
An Introduction to the Study Centre’s One Stop Search Tool for all your resources.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Resources for Biological Research Catherine Dockerty and Sophie Wilcox February 2008.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Smart Searching Techniques Fall 2006 the Library.
GOOGLE SCHOLAR Compiled by Helene van der Sandt. WHAT IS GOOGLE SCHOLAR?
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Oxlip+. What is Oxlip+? A tool for finding & linking to databases – Online collections of (scholarly) materials – Includes full text / indexes / range.
Developing GRID Applications GRACE Project
Roger Mills February don’t be evil stand on the shoulders of giants.
WISER: What’s new in Science SCOPUS, SCIRUS and Google Scholar Kate Williams and Juliet Ralph May 2006.
Research Vocabulary. Research The investigation of a particular topic using a variety of reliable resources.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Searching for Information
Search Engine Architecture
User Awareness Program ‘Accessing Emerald’ Universitas Lancang Kuning
Information Retrieval on the World Wide Web
Eric Sieverts University Library Utrecht Institute for Media &
Information Retrieval
Review Key Teaching Points
IL Step 3: Using Bibliographic Databases
IL Step 2: Searching for Information
Introduction to Search Engines
Presentation transcript:

Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference Jürgen Oesterle Fast Search & Transfer Deutschland GmbH

Most prominent problems with digital libraries „Multiple content sources“ problem In a typical digital library, you have to provide a combined search on many different collections at a time The format of the content varies between these collections The availability of structure varies between these collections The availability of external reference data varies between these collections The availability of meta data varies between these collections The kind of content might vary between these collections On these grounds, it‘s extremely difficult to provide equal ranking among the documents in a results set, coming from different content sources.

Most prominent problems with digital libraries „Meta data“ problem You can‘t tell how the meta data was generated (Author? Editor? Automatically assigned?) You can‘t tell in advance what meta data is available (Title, author, keywords, date, publisher, place, etc.) You don‘t know the original purpose of the meta data (Quick summary for reader? Condensed description? Normalization of content for search?) You can‘t assume uniform availability and quality of meta data even on one collection

Most prominent problems with digital libraries „Distributed documents“ problem Documents are often really hypertext, i.e. their parts are distributed over a site, with links between them „Multiple languages“ problem Documents are often in many different languages „Availability of classification schemas“ problem If classification is of interest (and of help while searching), the underlying classification taxonomies are not standardized across collections

Most prominent problems with digital libraries „Inaccurate queries“ problem Users typically lack domain specific knowledge Users don‘t have proper terminology to hand Users don‘t include all potential synonyms and variations in the query Users have a problem but aren‘t sure how to phrase it (i.e. how the same problem is phrased in the documents) On these grounds, it‘s extremely difficult to provide a perfectly relevant result set as first response. Intelligent suggestions for refinement or expansion are needed.

Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Part of speech tagging & normalization Extraction of specific syntactic patterns Statistic analysis of the extracted patterns Suffering from chronical rhinitis, the patient was treated Vpart Prep Adj N Det N Vcop Vpart P(„chronical rhinitis“) P(„chronical “) * P(„rhinits“) log Identification of new terminologychronical rhinitis

Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Investigations in E. coli B. C. Abracadabra Department of Molecular Medicine University of Wisconsin S. Miheev Analytical Laboratory Russian Academy of Scieneces Moscow Journal of Cancer Research Issue 5, Abstract 1. Introduction 2. Materials and Methods In this study we investigate……… Investigations in E. coli B. C. Abracadabra Department of Molecular Medicine University of Wisconsin S. Miheev Analytical Laboratory Russian Academy of Scieneces Moscow Journal of Cancer Research Issue 5, Abstract 1. Introduction 2. Materials and Methods In this study we investigate……… Journal title Article title Affiliation 1. Analyse structure 2. Determine text block features 3. Classify text blocks 4. Apply structure grammar

Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Artikel 1 Artikel 6 Artikel 7 Artikel 8 Artikel 2 Artikel 5 Artikel 4 Artikel 3 Citation graph Infer relative importance of Article 5 and Use textual context of citation to obtain good descriptors of it

Technologies are underway to solve the problems.. Meta data extraction Automatic extraction of keywords Structural analysis Normalization of existing meta data Use external reference data & citation analysis Artikel 1 Artikel 6 Artikel 7 Artikel 8 Artikel 2 Artikel 5 Artikel 4 Artikel 3 Citation graph Infer relatedness of Article 8 and Article 7 because they are cited by the same articles

Technologies are underway to solve the problems.. Equal ranking Test runs with representative queries Check typical ranking position per content source Assign static rank boosts per content source, based on results Retrieval Engine Content source A Content source E Content source D Content source B Content source C only abstracts rich meta data no external references few meta data indexed in citation index full text articles web data hard to crawl, distributed documents unreliable meta data web anchor text as external reference full text documents PDF, DOC  conversion problems full text documents indexed in citation index rich meta data High boost Medium boost Low boost

Technologies are underway to solve the problems.. Proper treatment of queries Deal with orthographic variation Deal with morphological variation Deal with vocabulary variation Deal with special-interest queries (e.g. restrict on user homepages, find definitions, narrow down on articles) Cerebral infarct Cerebral infarcts Apoplexy Apoplectic insult Stroke “Cerebral infarct” Cerebral infarkt Serebral infarct Cetebral ingarct Cerebral disease Infarction Cerebral infarct / medicine Cerebral infarct / biology Cerebral infarct / conferences Infarctus cérébral Phrasing Doc type classification Spellchecking SynonymyThesaurus support Refinement Character normalization Lemmatization Topic classification Ambigue queries

Technologies are underway to solve the problems.. Investigations in E. coli B. C. Abracadabra [author info] S. Miheev [author info] Journal of Cancer Research Issue 5, Journal contents Current Issue This issue Personal Profile While crawling for documents Abstract Chapter 1 Chapter 2 Chapter 3 Chapter 4 Introduction recognize links that point to „other parts of the document“ Abstract Introduction Chapter 1 Chapter 2 Chapter 3 Chapter 4 follow these links and put together a complete document. Smart data aggregation

Technologies are underway to solve the problems.. Crawling Document processing INDEX Results Query processing Result processing Query Doc Smart data aggregation (e.g. restoring distributed documents) Advanced linguistic processing (e.g. terminology extraction, classification, structural analysis) Proper treatment of queries (e.g. covering morphol.+semant. variation) Query refinement suggestions (e.g. covering morphol.+semant. variation) Citation index

Scirus

Evolution of Digital Libraries Traditional DL Full text search engine Next generation DL data base pure predefined meta data exact match data is heterogenuous not normalized incomplete unreliable inverted index full text exact match data is heterogenuous not normalized redundant unreliable inverted index + linguistics + smart data aggregation extracted information fuzzy search data is homogenuous auto-normalized auto-completed reliable