Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
An Information Retrieval and Extraction System for C. elegans Literature.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Biological literature mining
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Gene Ontology John Pinney
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
Search Engines and Information Retrieval
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Gene Regulation in Eukaryotes Same basic idea, but more intricate than in prokaryotes Why? 1.Genes have to respond to both environmental and physiological.

Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Retrieval Effectiveness of an Ontology-based Model for Information Selection Khan, L., McLeod, D. & Hovy, E. Presented by Danielle Lee.
Search Engines and Information Retrieval Chapter 1.
Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz (KDD`08) Speaker: Hsu, Yi Ling.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Entity Summaries Jing Jiang and Xu Lin BeeSpace Programmers’ Meeting Sept. 6, 2006.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
University of Illinois at Urbana-Champaign BeeSpace Navigator v4.0 and Gene Summarizer beespace.uiuc.edu `
Copyright OpenHelix. No use or reproduction without express written consent1.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Motif discovery and Protein Databases Tutorial 5.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Bioinformatics and Computational Biology
BeeSpace Informatics: Interactive System for Functional Analysis Bruce Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Labeling protein-protein interactions Barbara Rosario Marti Hearst Project overview The problem Identifying the interactions between proteins. Labeling.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The TDR Targets Database Prioritizing potential drug targets in complete genomes.
Semantic Processing with Context Analysis
Nora Pierstorff Dept. of Genetics University of Cologne
Basic Local Alignment Search Tool
Presentation transcript:

Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …

A quick review of the Gene Summarizer (1) Pre-defined six generic aspects for summarizing genes: –GP (Gene Product): describing the product (protein, rRNA, etc.) of the target gene; –EL (Expression Location): describing where the target gene is mainly expressed; –SI (Sequence Information): describing the sequence information of the target gene and its product; –WFPI (Wild-Type Function & Phenotypic Information): describing the wild-type functions and the phenotypic information about the target gene and its product; –MP (Mutant Phenotype): describing the information about the mutant phenotypes of the the target gene; –GI (Genetic interaction): describing the genetical interactions of the target gene with other molecules

A quick review of the Gene Summarizer (2) Two-stage approach for generating semi- structured gene summary –Retrieving sentences about a gene (keyword match/NER  precision, recall) –Extracting sentences for each specified semantic aspect (categorization  training sentences, classification algorithm)

A quick review of the Gene Summarizer (3) Potentially an Entity Summarizer? Yes. Ex. Pathway Summarizer –Define categories: (1) pathway type (Metabolism, Genetic Information Processing, Environmental Information Processing, Human Diseases, …); (2) molecules involved (3) molecular interactions/reactions/relations; … –Collect example sentences for each category Gene Summarizer Special entities + Pre-defined categories Entity Summarizer

Current Problems (1) Key word match: not high precision –Ex. Screening a cDNA library prepared from silk- producing glands of the black widow spider,… NER component: not high recall –Ex. …beta-alanine biosynthesis is regulated by black. –Take advantage of known synonyms –a simpler problem than NER as we already know the gene name and its synonyms

Current Problems (2) Categorization component: no high quality example sentences –Training sentences are mostly automatically downloaded from model organism databases (FlyBase, SGD), which are not really sentences from literature. –Classification algorithms can be further improved. Irrelevant gene mentions –Some popular gene names are frequently mentioned in abstracts for comparison/reference purpose.

Strategies used for V1.’s keyword match –Recall: use dictionary-based keyword retrieval, to retrieve all documents containing any synonym of the target gene –Precision: require retrieved abstracts to contain at least one long synonym. Ex. …SmZF1 binds both ds and ss DNA oligonucleotides,… Our Solutions (1)

Our Solutions (2) NER component for V2. –Recall: require exact match but allowing crossing the tag boundary –Ex. Query: ABC-a xxxxxxxxxxxxxx ABC a gene xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx ABC a xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx ABC a encodes xxxxxxxxxxxx xxx gene ABC a xxxxxxxxxxxxxxx

Our Solutions (3) NER component for V3. –Recall: take advantage of known synonyms, i.e., expand the query using the known synonyms to improve the recall. –Ex. Coexpression of Ss and Tgo in Drosophila SL2 cells…

Future Work (1) Further improvement of the gene name recognition component –A simpler problem than NER as we already know the gene name and its synonyms –Build a classifier that focuses on contextual features to identify false gene mentions Ex. The purpose of this study was to investigate the black gene, and protein,… Screening a cDNA library prepared from silk-producing glands of the black widow spider,… Only use contextual features because the term/phrase already matches a gene name Need some “good” negative examples (ambiguous gene names) in the training data

Future Work (2) Enable synonym selections by user –Use synonyms for potential query expansion, to increase recall –Allow user choose which synonyms for which species to be searched, to control the precision. –Ex. “black” is also used as a synonym for D. ananassae "ebony" gene, GS will extract all the sentences mentioning "ebony" for summarizing "black".

Future Work (3) Integrating the gene mention relevance to the sentence ranking. –Some popular gene names are frequently mentioned in abstracts for comparison/reference purpose. –Ex. It encodes a protein predicted to contain 688 amino acid residues, including 11 zinc finger motifs of the C-2H-2 type in the C-terminal region, that are Kruppel-like in the conservation of the H/C link sequence connecting them. –Number of Occurrences of the query gene mentions in the retrieved abstract will somehow indicate how relevant this abstract is to the query gene.

Future Work (4) Speed up the GS for huge collections, like the entire Medline (need ~3 mins)! –Three major steps for the GS: a)Get sentence IDs for the query gene b)Compute relevance scores for all six aspects c)Retrival original sentences for result display –Step (a) can be precomputed for all genes in our dictionary. –Step (b) can be precomputed for all sentences in our collection. –Only step (c) has to be done on-the-fly.

Future Work (5) Enable summarizing on any dynamically generated collection –Users may only be interested to see summarized information about a honeybee gene in a behavior- related collection. E.g., people may want to know what are the discoveries of this gene about mating behavior? –Add a filter between step (a) and (b)  Only return sentences within the current collection for summarization.

Discussions … Thoughts ? Questions?