Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.

Slides:



Advertisements
Similar presentations
Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.
Advertisements

Chapter 5: Introduction to Information Retrieval
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Mining and Summarizing Customer Reviews
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Search Engines and Information Retrieval Chapter 1.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Automatic vs manual indexing Focus on subject indexing Not a relevant question? –Wherever full text is available, automatic methods predominate Simple.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Information Retrieval in Practice
Information Retrieval in Practice
Searching for Information
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Research on Knowledge Element Relation and Knowledge Service for Agricultural Literature Resource Xie nengfu; Sun wei and Zhang xuefu 3rd April 2017.
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
Citation-based Extraction of Core Contents from Biomedical Articles
Introduction to Information Retrieval
The New LexisNexis® Statistical
Information Organization: Overview
Introduction to Search Engines
Presentation transcript:

Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of information being produced. 1

Manual Indexing n Human’s read and index content. n Fairly good, although not consistent (interobserver, or even intraobserver). n Certain fields support costly manual indexing (primary example is Medline). 2

Major Issues n For all fields unable to afford manual indexing, and even for biomedical, because there is so much knowledge in the huge amount of literature being produced that we cannot keep track of it, or utilize it. n Research Example: Swanson’s undiscovered literature 3

What this means n Need ways to index without requiring paid experts –Automatic indexing, classification, keyword extraction, and even relationship and fact extraction. –Need to take advantage of experts who are reading the materials to comment on it and provide rankings, summarizations, keywords, “factoids”. (like Amazon)Amazon 4

5 Why Automatic Classification? n Classification is time consuming and expensive n Knowledge structuring –To much information n Status of automatic classification –Approaching level of human indexing. (NLM’s Metamap).NLM’s Metamap

6 What is Automatic Classification? n Automatic manipulation of a document’s contents to support logical grouping with other similar documents for organization and/or retrieval activities. Can include the assignment of, or manipulation of, classification notation.

7 Approaches and Methods n Initial approach –Create an inverted file –On-the-fly (natural language processing) n Methods –All words, remove stop words –Word frequencies (Wilson’s objective method of determining aboutness) –More sophisticated IR methods Semantic/linguistical analysis, co-occurrence/similarity measures, etc.

8 Simple automatic indexes n Inverted file: contains all the index terms automatically drawn from the document records according to the indexing technique used. –Position of term -record number -Field number -Number of occurrences -Position in the field (digits 45-57)

9 Pros and Cons of Automatic Indexing n Pros –Consistency –Cost reduction –Time reduction n Cons / limitations –Human intellect –Term relationships –Misleading in retrieval –Good algorithms, but generally domain-specific

10 How to gauge effectiveness? Recall Number of relevant documents retrieved out of all the possible relevant documents in system. [quantity—did you get it all?] Precision Percentage of documents retrieved that were relevant [quality of what you found]

11 Tradeoff between Recall and Precision We can easily recall everything that matches a particular text string or pattern; however, we cannot search through all the matching results (too many) We can do an OK job limiting to most relevant, but as we “tune” result to be more relevant, we leave out more and more matching results.

Future Search n Full text searching of content, and of associated annotations on content, and metadata (including reader rankings, tags, etc). Like Connotea, NeoNote, etc.Connotea n Faceted based searching (Endeca, e.g. Home Depot, NCSU library). Home DepotNCSU library n Clustered based searching (Clusty)Clusty 12

Study on gene name searching n Looks at full text searching n Tradeoff between precision and recall n (Hemminger 2007). 13

14 Article Discovery Study Schizophrenia + Schizophrenia Gene Schizophrenia GeneArabidopsis Gene Genes Found in Metadata Only % % % Genes Found in Full- text Only % % % Genes Found in Metadata and Full-text % % % Totals for Found Genes

15 Article Review Study n Two literature cohorts, –Schizophrenia (Pat Sullivan) –Arabidopsis (Todd Vision) n Each cohort had three readers n Readers are asked to “review the article and judge its relevance to them as someone new to the gene in this biological setting, trying to build an understanding of the state of knowledge in that research area.”

16 Metadata Articles More Valuable n In both cases and for all observers, their mean quality rating values were lower (more useful) for the metadata discovered articles. There were statistically significant differences between the mean quality rating for the metadata discovered articles versus the full-text discovered articles for the both the Arabidopsis and Schizophrenia sets at the p < 0.05 level

17 Precision and Recall SchizophreniaArabidopsis RecallPrecisionRecallPrecision Metadata discovered15.7% (16.6%) 94.7%84.1% (84.1%) 100% Full-text only discovered100%63.7%100%69%

18 Article Features that correlate with Value: Number of Hits n The number of hits or matches of the search term within the returned document is a commonly used feature to rank returned articles. To test the value of this feature, the number of hits was correlated with the mean quality ranking for each article (averaged across all observers). The results clearly show a relationship where articles with many matches of the search term, tend to be much more highly valued.

19 Improving Relevance for Metadata Searching n Repeating the calculations on the schizophrenia and Arabidopsis article review sets, but limited to only matches with high hit counts (Schizophrenia ≥ 20 hits and Arabidopsis ≥ 15 hits) shows that precision for the full text is now the same (100% in Aradidopsis) or slightly better than that of the metadata retrieved articles (95% versus 94.4% in schizophrenia). However, the number of additional cases discovered by full- text searching is now only slightly better, finding 5% more cases in schizophrenia and 28% more in Arabidopsis.

20 Conclusions n This suggests that rather than accepting metadata searching as a surrogate for full- text searching, it may be time to make the transition to direct full text searching as the standard. This could be accomplished by using certain features of the full-text article, such as number of hits of the search string or whether the search string is found in the metadata (i.e. our current metadata search) as filters that allow us to increase the precision of our results. (and put the user in control of the filtering).