Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February.

Slides:



Advertisements
Similar presentations
Alternatives to Federated Search - Presented by: Marc Krellenstein Date: July 29, 2005.
Advertisements

Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Summon: Web-scale discovery. Agenda Web-scale Discovery Defined How Summon Works Summon User Experience (live demonstration) Additional Resources.
Information and Business Work
Information Retrieval in Practice
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1.)Please visit to begin this tutorial. Note: You must register with MY NCBI before beginning tutorial. Registration is free.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Learn how to search for information the smart way Choose your own adventure!
Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Information Retrieval
Overview of Search Engines
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
IL Step 2: Searching for Information Information Literacy 1.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Electronic Scriptorium, Ltd. AIIM Minnesota Chapter Metadata and Taxonomy Presentation Copyright Electronic Scriptorium, Ltd. All rights reserved, 1991.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Data Mining for Web Intelligence Presentation by Julia Erdman.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Retrieval 1/2 BDK12-5 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Internet Research – Illustrated, Fourth Edition Unit A.
1 Smart Searching Techniques Fall 2006 the Library.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
Discovery Tool Implementation: UGA Bill Clayton Assistant University Librarian for Systems University of Georgia Libraries GUGM, Macon State, May.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Semantic (web) activity at Elsevier Marc Krellenstein VP, Search and Discovery Elsevier October 27, 2004
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
DbWiz Federated Search Tool Demo to Staff Carol MacDonald April 4, 2007.
Chapter 20 Asking Questions, Finding Sources. Characteristics of a Good Research Paper Poses an interesting question and significant problem Responds.
CS 440 Database Management Systems Web Data Management 1.
Research Methods in Business and Economics4 Jan Brzozowski, PhD.
Next generation search Marc Krellenstein VP, Search and Discovery Elsevier August 23, 2004
Information Retrieval in Practice
Summon® 2.0 Discovery Reinvented
Education 499-R01 Search Basics.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
PubMed Database Interface (Basic Course Module 4 Part A)
Eric Sieverts University Library Utrecht Institute for Media &
Internet Research Third Edition
Information Retrieval
Introduction to Search Engines
Data Mining Chapter 6 Search Engines
Review Key Teaching Points
IL Step 3: Using Bibliographic Databases
IL Step 2: Searching for Information
Introduction to Information Retrieval
Introduction to Search Engines
Citation databases and social networks for researchers: measuring research impact and disseminating results - exercise Elisavet Koutzamani
Presentation transcript:

Technology for integrated access and discovery Presented by: Marc Krellenstein Title: VP, Search and Discovery Advanced Technology Group Date: February 5, 2004

Basic search is pretty good Modern search engines are fast and scalable – Having the data (usually lots) is still key Can interpret keyword, Boolean and pseudo-natural language queries – Ex: “how to make an international call with my Blackberry” Spell checking, thesauri and stemming to improve recall Users are more experienced – More multi-term searches Gets lots of hits, but that’s usually OK if good ones on top

Basic search is pretty good Best practice relevancy ranking is good: – Term frequency (TF): more hits count more – Inverse document frequency (IDF): hits of rarer search terms count more Ex: diabetes diagnosis and treatment – Hits of search terms near each other count more Ex: penicillin allergy vs. “penicillin allergy” – Hits on metadata (title,subject, etc.) count more Use anchor text – referring text – as metadata – Items with more links/references to them count more Authoritative links/referrers count yet more – Many other factors: length, date, etc.

Basic search is pretty good Using these techniques search engines can locate specific documents, or good documents (if not the absolute best) around general or specific topics But challenges remain…

Current challenges Integrated search: Content still exists in separate silos – Silos getting bigger but there are still too many – Library patrons have dozens of choices – Putting even more into Google is probably not sufficient to solve the problem Finding the best/novel documents – Hard to perform complicated searches (e.g., research similar to one’s own) Historians can’t define a profile… Discovery – Hard to do more than search: summarize, uncover novelty and relationships, analyze

The integration challenge Two approaches: – Build even bigger databases (well, yes…) Not easy, but sometimes the easiest approach Can be difficult to manage and secure appropriate rights – Distribute search: Search separately managed (or owned) large databases as if they are one Technically more challenging, but a scalable and maintainable architecture

Distributed search Index multiple (maybe geographically) separate databases with a single search engine that supports distributed search – Use common metadata scheme (e.g., Dublin Core) and/or determine other common fields or field mappings for each database – Search engine provides parallel search, integrated ranking and integrated results – The separate databases can be maintained and updated separately – Elsevier is currently unifying its own sources in such a model with a ‘web service’ architecture Has contributed specifications to the public domain – Such services can also be offered externally

Distributed search Simplifies some business issues, but still requires common technology platform Where common platform not possible, add federated search (i.e., metasearch) – Translate queries – Access and perform parallel search of multiple search engines (vs. multiple databases) – Integrate results as best as possible – Use standards to approximate distributed research Uniform access, one query language (Z39.50, updated) Add standards for relevancy ranking and results return? NISO and its members are working on standards

Finding the best: Navigation More data can also make finding the best or novel documents harder – For searches for rare items, more data is a win – For all other searches, it’s more likely your answer is in there…but it’s also more likely there’s lots of other stuff close but not as good Why? relevancy is good but… Relevancy has its limits…there may be many ‘good’ documents referring to different aspects of the search…the best? Underlying problems: – User’s needs may not be that specific – Even long searches are under-specified

One solution: clustering documents Group results around common themes: same subject, author, web site, journal,… Show largest/most interesting categories Depression  psychology, economics, meteorology, antiques… – Psychology  treatment of depression, depression symptoms, seasonal affective… – Psychology  Kocsis, J. (10), Berg, R. (8), … Themes could come from static metadata or dynamically by analysis of results text – Static: fixed, clear categories and assignments – Dynamic: doesn’t require metadata (or controlled vocabulary to draw from)

Clustering benefits Disambiguates and refines search results to get to documents of interest quickly Can navigate long result lists hierarchically – Would never offer thousands of choices to choose from as input… – Access to bottom of list…maybe just less common Discovery – new aspects or sources Can narrow results *after* search – Start with the broadest area search – don’t narrow by subject or other categories first – Easier, plus can’t guess wrong, miss useful, or pick unneeded, categories…results-driven Knee surgery  cartilage replacement, plastics, …

Finding the best: Complex search Main problem is still short searches/under- specification….which the keyword-based ‘enter a query’ paradigm encourages One solution: Relevance feedback – marking good and bad results A long-standing and proven search refinement technique – More information is better than less (longer queries are better) – Pseudo-relev feedback is a research standard Commercial forms – find-similar, etc. --– not widely used (or well executed)... …but successful in Pubmed (diff users)

Relevance feedback One catch: Must first find a good document to be similar to Solution: Let the user provide the ideal document – or a long query or problem statement – as input in the first place – Can enter free text or specific documents describing the interest, e.g., article, grant proposal, experiment description, etc. – Should provide the best possible matches

Discovery challenge: Beyond search How do you summarize a corpus? – May want to report on what’s present, numbers of occurrences, trends, etc. – Ex: What diseases are studied the most? – Must know all diseases and look one by one How to you find a relationship if you don’t know what relationships exist? – Ex:does gene p53 relate to any disease? – Must check for each possible relationship Ad hoc analysis – How do all genes relate to this one disease? Over time? What organisms have the gene been studied in? Show me the document evidence…

One solution: entity extraction Identify entities (things) in a text corpus – Examples: authors, universities… diseases, drugs, side-effects, genes…companies, law suits, plaintiffs, defendants… – Use lexicons, patterns, NLP for finding any or all instances of the entity Identify relationships: – Through co-occurrence Relationship presumed from proximity Example: author-university affiliation – Through limited limited natural language processing Semantic relations – causes, is-part-of, etc. Examples: drug-causes-disease…drug-is treatment for-disease…a is suing b…

ClearForest pilot, Fall 2002 Goal: Demonstrate real value to a working expert in 90 days Chose biomedical domain Hired expert to help define entities and relationships Used 25,000 abstracts from 23 Elsevier journals Worked with ClearForest to define and revise extraction of entities and relationships Have related partnership with Stanford for text mining

Pilot scenarios Answered real questions using real data – not a demo or mock-up The user: – anyone involved in genomic academic research: a primary researcher, graduate student or post-doc Scenario 1: Research about gene p53 – What journals should I publish in? – Who’s an expert I can ask for advice? – What connections have been made to my gene? – What organisms have my gene?

What journals should I publish in?

Who’s an expert?

Connections to p53?

To organisms?

Pilot scenarios Scenario 2: Disease research – What diseases are most researched? – What’s the time trend in HIV research? – What are the centers of HIV research? – Who are the author teams in HIV? – What gene-disease relationships are there? What were they to start in 1996? through 1997? – (Note: Cannot answer the above with search alone)

What diseases are most researched?

Time trend in HIV research?

Centers of HIV research?

Author teams In HIV research?

Gene-disease relationships?

To start, in 1996?

Through 1997?

Pilot scenarios Scenario 3: Connections between leukemia and Alzheimer’s – Are there direct connections between leukemia and Alzheimer’s? – What enzymatic activity is associated with leukemia? – Are there indirect connections between leukemia and Alzheimer’s mediated by enzymatic activity?

Direct connections between leukemia and Alzheimer’s?

Enzymes associated with leukemia?

Indirect links from leukemia to Alzheimer’s via enzymes

The power of indirect links Almost impossible to determine manually Can provide completely unexpected relationships between source and target

The value of analytics Goes beyond search – summarizes, shows relationships, answers complex questions A significant value-added service – Value of one new drug discovery?

SummarySummary Need to search more broadly, more easily – Larger databases – Distributed search Need to locate best/novel documents in even larger (distributed) databases – Clustering to find documents of real interest – Find/similar, descriptive search Need to go beyond search for overviews, relationships and discovery – Text-based data mining and entity extraction