CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,

Slides:



Advertisements
Similar presentations
© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Co-Chair: Alexander Yeh, MITRE Corp. Data: FlyBase ( July 2002 KDD Cup 2002 Task1:
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Gene Ontology John Pinney
© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
BIO513: Lecture 1. Central dogma “The central dogma of molecular biology deals with the detailed residue-by-residue transfer of sequential information.
Archives and Information Retrieval
COG and GO tutorial.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
Alternative splicing and evolution Daniel Jeffares.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
A number of slides taken/modified from:
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Using The Gene Ontology: Gene Product Annotation.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Copyright OpenHelix. No use or reproduction without express written consent1.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
Introduction to biological molecular networks
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
1 GAPSCORE: Finding Gene and Protein Names one Word at a Time Jeffery T. Chang 1, Hinrich Schutze 2 & Russ B. Altman 1 1 Department of Genetics, Stanford.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
GUIDE. P UB M ED
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Archives and Information Retrieval
Annotation Presentation
PubMed.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman, L. Hirschman, A. Yeh, S. Raychaudhuri)

Bioinformatics Topics Last week Basic biology Why text about biology is special Text mining case studies Microarray analysis, Abbreviation mining Today Combined text mining and data mining I Text-enhanced homology search Text mining in biological databases KDD cup: Information extraction for bio- journals Combining text mining and data mining II

Text-Enhanced Homology Search (Chang, Raychaudhuri, Altman)

Sequence Homology Detection Obtaining sequence information is easy; characterizing sequences is hard. Organisms share a common basis of genes and pathways. Information can be predicted for a novel sequence based on sequence similarity: Function Cellular role Structure

PSI-BLAST Used to detect protein sequence homology. (Iterated version of universally used BLAST program.) Searches a database for sequences with high sequence similarity to a query sequence. Creates a profile from similar sequences and iterates the search to improve sensitivity.

PSI-BLAST Problem: Profile Drift At each iteration, could find non- homologous (false positive) proteins. False positives create a poor profile, leading to more false positives.

Addressing Profile Drift PROBLEM: Sequence similarity is only one indicator of homology. More clues, e.g. protein functional role, exists in the literature. SOLUTION: we incorporate MEDLINE text into PSI-BLAST.

Modification to PSI-BLAST Before including a sequence, measure similarity of literature. Throw away sequences with least similar literatures to avoid drift. Literature is obtained from SWISS-PROT gene annotations to MEDLINE (text, keywords). Define domain-specific “stop” words ( 85,000 sequences) = 80,479 out of 147,639. Use similarity metric between literatures (for genes) based on word vector cosine.

Evaluation Created families of homologous proteins based on SCOP (gold standard site for homologous proteins-- ) Select one sequence per protein family: Families must have >= five members Associated with at least four references Select sequence with worst performance on a non-iterated BLAST search

Evaluation Compared homology search results from original and our modified PSI-BLAST. Dropped lowest 5%, 10% and 20% of literature-similar genes during PSI-BLAST iterations

Results 46/54 families had identical performance 2 families suffered from PSI-BLAST drift, avoided with text-PSI-BLAST. 3 families did not converge for PSI-BLAST, but converged well with text-PSI-BLAST 2 families converged for both, with slightly better performance by regular PSI-BLAST.

Discussion Profile drift is rare in this test set and can sometimes be alleviated when it occurs. Overall PSI-BLAST precision can be increased using text information.

Mining Text in Biological Databases

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

Genetic Information in GenBank Numbers are for all species. Biology is fundamentally an information science.

Species represented in GENBANK Entries Bases Species Homo sapiens Mus musculus Drosophila melanogaster Arabidopsis thaliana Caenorhabditis elegans Tetraodon nigroviridis Oryza sativa Rattus norvegicus Bos taurus Glycine max Lycopersicon esculentum Hordeum vulgare Medicago truncatula Trypanosoma brucei Giardia intestinalis Strongylocentrotus purpuratus Entamoeba histolytica Danio rerio Zea mays Xenopus laevis

Complete Genomes Aquifex aeolicus Aquifex aeolicus Archaeoglobus fulgidus Archaeoglobus fulgidus Bacillus subtilis Bacillus subtilis Borrelia burgdorferi Borrelia burgdorferi Chlamydia trachomatis Chlamydia trachomatis Escherichia coli Escherichia coli Haemophilus influenzae Haemophilus influenzae Methanobacterium thermoautotrophicum Methanobacterium thermoautotrophicum Caulobacter crescentus Caulobacter crescentus Helicobacter pylori Helicobacter pylori Methanococcus jannaschii Methanococcus jannaschii Mycobacterium tuberculosis Mycobacterium tuberculosis Mycoplasma genitalium Mycoplasma genitalium Mycoplasma pneumoniae Mycoplasma pneumoniae Pyrococus horikoshii Pyrococus horikoshii Treponema pallidum Treponema pallidum Saccharomyces cerevisiae Saccharomyces cerevisiae Drosophila melanogaster Drosophila melanogaster Arabidopsis thaliana Arabidopsis thaliana Homo sapiens Homo sapiens

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

Protein Sequences Swiss-prot (as of 3/03) 122,564 sequences Almost 45,000,000 total amino acids 103,486 references

Three-Dimensional Structures Protein three-dimensional Structures Protein Data Bank (PDB), as of March 27, ,158 proteins 939 nucleic acids 616 protein/nucleic acid complex 18 carbohydrates

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature

Complete yeast genome (6000 genes) on a chip.

Online access to DNA chip Data www4.stanford.edu/MicroArray/SMD/ O(10) data sets available from Stanford site 10,000 to 40,000 genes per chip Each set of experiments involves 3 to 40 “conditions” Each data set is therefore near 1 million data points. People gearing up for these measurements everywhere…

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

A Reaction in EcoCYC

KEGG

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

Signaling Pathways

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

Where’s the Information? Medical Literature on line. Online database of published literature since 1966 = Medline = PubMED resource 4,000 journals 10,000,000+ articles (most with abstracts)

PubMed

SwissProt 103,000 references 100s Mb of text 100,000s unique words

Abstracts Referenced in SP37 Number of abstracts associated with sequences in Swiss Prot. (# sequences truncated at 100) (as of 2001)

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

MESH = Medical Entity Subject Headings Controlled vocabulary for indexing biomedical articles. 19,000 “main headings” organized hierarchically Browser at html

MESH

UMLS: Semantic Model of Biomedical Language Representing more of semantics of words and more relationships. UMLS = Unified Medical Language System mls/

UMLS Elements Semantic concepts (475K) = specific terms connected to semantic categories (e.g. Munchausen syndrome linked to Behavioral-Dysfunction) Concept maps (1,000K) = mapping from a terminology to a semantic concept (e.g. ICD-9 Billing code to Munchausen syndrome) Categorizations = relate semantic concepts Conceptual links (7K) = relate two semantic concepts with a semantic relationship

Gene Ontology ( A controlled listing of three types of function: Molecular Function Biological Process Cellular Component Vision: universal language for molecular biology across species

Molecular Function <molecular_function ; GO: %anti-toxin ; GO: %lipoprotein anti-toxin ; GO: %anticoagulant ; GO: %antifreeze ; GO: %ice nucleation inhibitor ; GO: %antioxidant ; GO: %glutathione reductase (NADPH) ; GO: ; EC: % flavin-containing electron transporter ; GO: % oxidoreductase\, acting on NADH or NADPH\, disulfide as acceptor ; GO: %thioredoxin reductase (NADPH) ; GO: ; EC: % flavin-containing electron transporter ; GO: % oxidoreductase\, acting on NADH or NADPH\, disulfide as acceptor ; GO:

Current Genome Annotations

Where is the Information? What is the Data? GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory networks Medline – biomedical literature Taxonomies / Ontologies

KDD Cup 2002: Information Extraction for Biological Text

Task Background: Flybase Flybase project Curates biomedical publications on the fruitfly Uses GO (gene ontology) as ontology Fruitfly (Drosophila melanogaster) is one of the key “model organisms” Flybase goals Distillation of literature on the fruitfly Table of contents function Support search of literature Current methodology: Manual curation Curators read the literature and manually update flybase Goal of KDD Cup 2002: Can this be (partially) automated?

FlyBase: Example of Data Curation

Curators Cannot Keep Up with the Literature! FlyBase References By Year

Task Rationale and Description FlyBase provided the Data annotation (plus biological expertise) Input on the task formulation What can be useful to the curators Start fairly simple. Try to help automate part of what one group of FlyBase curators needs to do: Determine which papers need to be curated for fruit fly gene expression information Want to curate those papers containing experimental results on gene products (RNA transcripts and proteins)

Abstracts are not enough, need the full papers E.g., for one paper on Appl proteins (PubMed ID # ), FlyBase lists 19 “when-where” pairs for Appl protein expression A “when-where” pair indicates when in the life cycle and where in the body some transcript or protein is found “When-where” pair example: adult-brain Only 2 of the 19 pairs (11%) are mentioned in the abstract. The rest are only mentioned in the body of the full paper So need full papers in electronic form Some Data (Text) Preparation Challenges

Full papers are copyrighted by publishers For the contest, only use “free” papers As a result of all these complications, out of the ~7100 papers in FlyBase that were of interest only ~1100 were used Some Data (Text) Preparation Challenges

Plain text is not enough, also need things like superscripts, subscripts, italics, Greek letters (in English text) E.g., represent alleles (variants of a gene) with superscripts Some Appl gene alleles: Appl, Appl, Appl If lose the superscripts, these appear as: Appld, Appls, Applsd This would make it harder to determine that these refer to the same gene Need to know what suffixes to remove before trying to match Some Data (Text) Preparation Challenges (Continued) dssd

FlyBase has certain conventions to represent superscripts, etc. in ASCII E.g., represent those alleles as Appl[d], Appl[s], Appl[sd] In general, gene and protein names are already hard to match because they often have a complicated word structure (morphology) One needs to know what morphological transformations (like prefix or suffix removal) to perform before attempting to match the names Some Data (Text) Preparation Challenges (Continued)

Information Extraction Task Given for each paper The full text of that paper A list of the genes mentioned in that paper Determine for each paper For each gene mentioned in the paper, does that paper have experimental results for Transcript(s) of that gene (Yes/No)? Protein(s) of that gene (Yes/No)?

Task is Harder Than It First Appears Interested in results applicable to “regular” (found in the wild) flies, not mutants Genes have multiple names (synonyms) Given a list of the known synonyms But list may be incomplete Some names can refer to multiple genes E.g., “Clk” is a symbol for one gene (Clock) and is also a synonym for another gene (period, symbol is “per”) Contestants given evidence of experimental results found in the training data, But only in the form that is recorded in the FlyBase database

Training Data in Flybase Database (DB) records what evidence is found in a training paper, but not where in that paper The evidence is often recorded in a “normalized” form and domain knowledge is needed to find the corresponding text, e.g., DB: Assay mode: “immunolocalization” Text (PubMed ID# ): “ Figure 12. …Whole-mount tissue staining using an affinity- purified anti-PHM antibody in the CNS … This view displays only a portion of the CNS ” Term “immunolocalization” is not in the text Instead, text describes the process of performing an immunolocalization

Typical NLP Training Data: More Detailed These systems assume every mention of an entity or relation of interest in the text is annotated So anything not annotated is not a mention E.g., Annotations to train a “Northern blot” detector: Paper # :... transcripts on Northern analyses, raising questions Northern blots were carried out as described Analysis of Adult Figure 3: Northern blot analysis of transcripts in adult... I This paper has a total of 19 mentions.

Task Details Task has 3 sub-tasks, that contribute equally to the overall score 1. Ranked-list of papers (curatable before non-curatable) 2. Yes/No decisions on the papers being curatable (having any results of interest) 3. Yes/No decisions for having results for each type of product (transcript, protein) for each gene mentioned in a paper

Some Numbers Training set: 862 articles Test set: 213 articles (non-public!) Time Allowed Release training set, wait ~6 weeks Release test set, results due ~2 weeks later 18 teams submitted 32 entries Entries from 7 “countries”: Japan, Taiwan, Singapore, India, UK, Portugal, USA About equal numbers of universities and companies Evaluation measure: F measure

Winner: a team from ClearForest and Celera Used manually generated rules and patterns to perform information extraction Also had the best score in each of the 3 sub-tasks Best Median Ranked-list: 84% 69% Yes/No curate paper: 78% 58% Yes/No gene products: 67% 35% Results

Summary Reliance on partial annotations is key. “Information retrieval” task easiest to solve and immediately useful. Electronic availability of full-text is big issue. Mundane format problems (subscripts etc) are a big issue. Best results were 67% for information extraction.

Curated Databases Flybase is an example of a curated database. A lot of biological research is organized around such databases (cf. building and publishing software packages in CS) There are hundreds (thousands?) of curated databases. 13 important databases just for one area: nuclear receptors. Maintaining curated databases is labor- intensive.

Curated Databases Text mining can be used for: Cost savings Time savings Consistency Freshness

Curated Databases: Uses Protein-protein interactions Which proteins interact with X? Support information retrieval Find all transcription factors that are involved in cell death Interpretation of data-intensive experiments Microarray case study presented last week In silico biology

E-Cell (

Curated Databases: Uses (cont.) Summary/selection of what is known Support search Knowledge discovery Contradictory findings Nobel Prize He/She who points out a critical gene- disease link first, wins the Nobel Prize. You better do a thorough literature search.

Combining Text Mining and Data Mining

Combining Text and Links Recall: Classifying a web document based on The text they contain The categories of other pages pointing to it The categories of other pages it is pointing to Also Usage information (Pitkow et al.)

Clustering: Example (Eisen et al.)

Combining Gene Expression&Text Clustering of genes in a microarray experiment Last week Clustering based on text only, or: Clustering based on gene expression only What about combining the two? There is a large number of “good clusterings” for a particular problem Use literature to guide clustering

Comments Yeast : genes were grouped by expression. Functional labels guided us to find key subgroups. Once key subgroups are identified, supervised approaches can refine identification process. Cancer : cell line were grouped by semantic category (hypoxia versus normoxia). Used supervised approaches to refine identification process

Literature as a guide Free text documentation is widely available Patient records to describe pathological specimens ~20,000 documents describing specific yeast genes May have the information to guide us in searching for similarities in genes and expression

Goal of algorithm To identify subgroups of genes with commonalities in gene expression and in biological function. Literature is the means by which we identify functional commonalities

Projections in Linear Discriminant Analysis A normal distribution is estimated for the features of each population of the training set. Each distribution is centered at the mean of the population Linear discriminant analysis assumes a pooled covariance matrix.

Our approach Look for projections that separate specific groups of genes In a good projection, the separated genes have some functional commonalities These commonalities should be evident in the gene literature

Challenges C1 : Can we identify biologically meaningful concepts from simple text representations? C2 : In a group of genes with some biological similarity, can we detect that similarity in the literature? C3 : Can we then find projections in the expression data that group genes appropriately?

Resources NLP sessions of PSB: psb.stanford.edu bioperl.org, biopython.org National Library of Medicine: bm.html (out of date, but still comprehensive) bm.html

Links to Today’s Topics Pac Symp Biocomput. 2001;: PMID: Blast: Genome Res 2002 Oct;12(10): Using text analysis to identify functionally coherent gene groups. Raychaudhuri S, Schutze H, Altman RB b=Genome (complete genomes) b=Genome

Links to Today’s Topics