Opportunities & Challenges in Applying IR Techniques to Bioinformatics ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology.

Slides:

Advertisements

Similar presentations

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

Advertisements

CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.

NCBI Molecular Biology Resources

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

School of Computer Engineering Master of Science (Bioinformatics) A/P Kwoh Chee Keong 2009 presented by.

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.

NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases.

Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis

Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.

Phage? New Sequence Horizontal Transfer Molecular Evolution.

Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.

Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.

Lecture 2.21 Retrieving Information: Using Entrez.

Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.

Introduction to Bioinformatics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 25, 2005 ChengXiang Zhai Department of Computer Science University.

Scalable Text Mining with Sparse Generative Models

Bioinformatics for your classroom Seth Bordenstein Discover the Microbes Within! March 12, 2006 NCBI BLAST 1. No programming skills needed 2.Familiarity.

Algorithms in Computational Biology Tanya Berger-Wolf Compbio.cs.uic.edu/~tanya/teaching/CompBio January 13, 2006.

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Data Mining – Intro.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Overview of Search Engines

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.

CSE 6406: Bioinformatics Algorithms. Course Outline

BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.

Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.

Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …

Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Exploiting Domain Structure for Named Entity Recognition Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign.

Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,

Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) Institute of Biomedical Sciences, Academia Sinica.

Bioinformatics: Theory and Practice – Striking a Balance (a plea for teaching, as well as doing, Bioinformatics) Practice (Molecular Biology) Theory: Central.

Organizing information in the post-genomic era The rise of bioinformatics.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Introduction to Bioinformatics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 21, 2004 ChengXiang Zhai Department of Computer Science University.

Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.

Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Motif discovery and Protein Databases Tutorial 5.

NCBI Literature Databases: PubMed

Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.

Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.

A collaborative tool for sequence annotation. Contact:

Bioinformatics and Computational Biology

Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Course Summary ChengXiang Zhai ( 翟成祥 ) Department of.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.

NCBI Molecular Biology Resources February 2007 Part 1.

Wolbachia Bioinformatics

Bioinformatics for your classroom

Course Summary (Lecture for CS410 Intro Text Info Systems)

생물정보학 Bioinformatics.

Algorithms in Computational Biology

Course Summary ChengXiang “Cheng” Zhai Department of Computer Science

Introduction to Information Retrieval

Problems from last section

Introduction to Search Engines

Presentation transcript:

Opportunities & Challenges in Applying IR Techniques to Bioinformatics ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Graduate School of Library & Information Science University of Illinois at Urbana-Champaign Include slides from NCBI training tutorials & slides from the website of the book “An Intro. to Bioinformatics Algorithms”

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 2 Where in the US is UIUC? Picture from Netherlands Consulate Website Chicago

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 3 Outline What is Bioinformatics? Typical Problems in Bioinformatics Information Retrieval & Bioinformatics Biomedical Literature Access & Mining Summary

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 4 What is Bioinformatics Management & Exploitation of Biological Data/Info –Biological information (DNA, Gene expression, Proteins, Literature….) –Information management (search, organization, classification) –Information exploitation (pattern analysis, data mining)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 5 Why is Bioinformatics Important? Biology perspective –More and more biological information is available –Need to effectively access and use the information –Information analysis supplements (even may replace) wet lab experiments Computer science perspective –Excellent application domain –Poses special computational challenges –Brings computer science closer to scientific discovery Currently growing …

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 6 Theoretical CS Bioinformatics and Other Fields Molecular Biology Machine Learning Data Mining Information Management Biophysics Bioinformatics Biochemistry Applied Mathematics & Statistics Biology Computer Science Optimization

Some background about molecular biology…

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 8 Life begins with Cell A cell is the smallest structural unit of an organism that is capable of independent functioning Cells have some common features

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 9 All Cells have common Cycles Born, eat, replicate, and die

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 10 Example of cell signaling

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 11 Some Terminology The genome is an organism’s complete set of DNA. –a bacteria contains about 600,000 DNA base pairs –human and mouse genomes have some 3 billion. Gene –basic physical and functional unit of heredity. –specific sequence of DNA bases that encodes instructions on how to make a protein. Protein –Makes up the cellular structure –large, complex molecules made up of smaller subunits called amino acids.

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 12 Life Depends on 3 Critical Molecules DNAs –Hold information on how cell works RNAs –Act to transfer short pieces of information to different parts of cell –Provide templates to synthesize proteins Proteins –Form enzymes that send signals to other cells and regulate gene activities –Form body’s major components (e.g. hair, skin, etc.)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 13 The Central Dogma Central Dogma: DNA  RNA  protein Transcription: DNA  mRNA Translation mRNA  protein

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 14 Biology Research Questions Are two genes the same? How are genes regulated? What are the relations between gene functions and transcription factors? How can we detect gene regulation networks? How can we determine protein structures? How can we determine protein functions? ….

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 15 The Central Dogma & Biological Data Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 16 Entrez Integrates Most Biological DBs Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO GEO Datasets MeSH CancerChromosomes Homologen e

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 17 Web Access:

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 18

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 19 Number of Users and Hits Per Day Christmas & New Year’s Days Currently more than 10,000,000 to 50,000,000 hits per day!

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 20 Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, RefSNP, GEO Datasets, UniGene, TPA, NCBI Protein, Structure, Conserved Domain

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 21 Primary vs. Derivative Sequence Databases SequencingCenters Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI GenBank TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Updated ONLY by submitters

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 22 A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // Header Feature Table Sequence The Flatfile Format

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 23 Bioinformatics Tools

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 24 Topics in Bioinformatics AATTCATGAAAATCGTATACTGGTCTGGTACCGGC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTA TCTGGTAAAGACGTCAACACCATCAACGTGTC ACATCGATGAACTGCTGAACGAAGATATCCTG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGG MKIVYWSGTGNTEKMAELIAKGIIESGKDV DELLNEDILILGCSAMGDEVLEESEFEPFIE KVALFGSYGWGDGKWMRDFEERMNGYG PDEAEQDCIEFGKKIANI GenesProteins (Function) Gene expression & regulation Microarray data DNA Sequences Protein Sequences …In this paper, we report the discovery of a new gene that affects DNA reproduction in … … … Biology Literature Genomics Transcriptomics Proteomics Retrieval & Text Mining

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 25 Typical Problems in Bioinformatics Sequence alignment –Pairwise –Multisequence Motif finding Gene finding Protein structure/function prediction Protein motif function prediction Literature access & mining … This is a quite incomplete list…

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 26 Topic 1: Sequence Search Biological problems –How do we know whether two genes are similar? –Given a gene, how can we find similar genes in the genome of another organism? –Given a protein, how can we find similar proteins Computational problems –Sequence matching/alignment/search –Given a query sequence, retrieve similar sequences from a database of sequences Related IR techniques –Inverted index, sequence similarity, sequence retrieval Their sequence search engine is BLAST, which is the most useful bioinformatics tool and is used routinely by all biologists!

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 27 Topic 2: Multiple Sequence Alignment Biological problem: Given a family of proteins, how can we characterize their function domains/motifs? Computational problem: Given a set of sequences, find the best alignment Related IR techniques: Summarization

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 28 Topic 3: Motif Finding Biological problem: –Given a set of genes with similar functions –Find the common transcription factor binding site Computational problem: –Given a positive set of sequences and a background set of sequences, find a common pattern that is shared by all/many of the positive sequences, but not common in the background Related IR techniques: relevance feedback

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 29 Motif Discovery Motif = subsequence pattern Motif discovery –Given a target set of sequences (and possibly a background set of sequences) –Find motifs that characterize the target set …. G-G-T-C-C-T-G-G …

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 30 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. AlignACE Example: Input Data Set

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 31 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = (maximum) …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example: The Target Motif

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 32 Topic 4: Microarray Data Analysis Biological problem: Given expression values of a set of genes in different conditions, how do we detect genes that are co- expressed/co-regulated? Computational problem: Given 2-D matrix data, perform clustering Related IR techniques: clustering Funcational group

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 33 Topic 5: Profile HMM Biological problem: How do we know if a new protein has the same function as any known proteins? Computational problem: –Given a set of proteins in the same function family, build an HMM profile for the family –Given examples of proteins in different families, learn a classifier to classify new proteins Related IR techniques: text categorization, HMMs

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 34 An Example Profile HMM 3 kinds of states: Match, Insertion, Deletion Output symbols: amino acids Can be trained with aligned multiple sequences Begin MjMj End IjIj DjDj

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 35 Uses of Profile HMMs Detecting potential membership in a family –Matching a sequence to the profile HMMs –Score a sequence S by p(S|HMM)/p(S|Random) Return top k best matching profile HMMs for a given sequence Given an HMM, find additional sequences in the family Aligning a sequence to an existing family –Decoding the sequence using Viterbi –Using the state transition path to align the sequence with the existing sequences in the family HMMs have many other uses (e.g., gene finding)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 36 Protein Motif Function Prediction MAPVRKPDMRGLAVFIS DIRNCKPDSKGLEAEVKR … … MLQPAKPDLPGLCIYPSVKE FMLKPDKMGLLTDFGQIA Protein sequences Potential Motif Patterns KPD.. GL LQ.. D.. FTD … KAVFS.... GQIA TEIRESIAS SPLASH Functions=? Tyrosine kinase ? Signal transducer … ? How to determine the function of a new protein motif?

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 37 Motif Function Prediction Method GO: "intracellular cyc. n. a. action channel" fb|FBgn GO: "voltage-gated ion channel" GO: “plasma membrane" APLL..VQY KCI..SP..LR GSGSGS sequenceGO termMotif Exploit the correlation between motif matching and GO assignment Which GO term is most strongly correlated with “ APLL..VQY ”? Related IR Techniques: Cross-Lingual IR, Mutual Information

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 38 Opportunities for Applying IR Techniques DNA Sequences Protein Sequences Proteins Structures Microarray Data Text Functional Annotations … Search Filtering Clustering Classification Summarization Text Mining …

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 39 Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Search Applications Mining Applications Information Access Knowledge Acquisition Information Organization Text Information Management

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 40 Search Sequences Filtering Categorization Summarization Clustering Sequence Content Analysis Extraction Mining Visualization Search Applications Mining Applications Information Access Knowledge Acquisition Information Organization Sequence Information Management

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 41 Challenges in Applying IR to Bioinformatics Domain expertise barrier –Causes difficulty in problem definition & evaluation Signal/noise ratio is poor –Unlike English, which we know well, the “DNA language” is largely unknown –Techniques working well for English text may not work well for DNA sequences Inaccuracy and errors inevitably exist in biological data –Measurement errors (e.g., sequencing errors) –Very few derived data (e.g., annotations) have been validated

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 42 Challenges in Applying IR to Bioinformatics Exploiting all available information about a problem is critical –How to incorporate domain/prior knowledge? (Need to formalize a biologist’s knowledge) –Many resources are available, but figuring out how to appropriately take advantage of them is a challenge Variation of problem formulation –While a problem may be similar to one in IR at the high-level, it is often quite different at the low level –E.g., sequence search differs from text search in two ways: Query is different Matching criterion is different (need alignment) –Direct applications of standard IR techniques may not be effective

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 43 Biomedical Literature Access & Mining: General Problems Basic literature search – High accuracy, vocabulary matching/switching, entity recognition Integrative information access –Literature linked with databases Information/Knowledge extraction –Genes, relations, networks, inferences Hypothesis generation/testing –Exploratory analysis & QA

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 44 Some of Our Work Applying language models to biomedical literature retrieval (TREC 2003 & 2005) Applying entity recognition to gene name recognition (HLT/NAACL 2006) Applying summarization to gene summarization (PSB 2006) Developing an integrated and exploratory biological information system ($5M NSF project)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 45 Biomedical Literature Retrieval Task: Given an ad hoc query, find relevant literature abstracts from Medline Challenge: Semi-structured queries Standard language models are not directly applicable Solution: Semi-structured query language models

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 46 Semi-Structured Queries TREC-2003 Genomics Track, Topic 1: Problems with unstructured representation –Intuitively, matching “ATF2” should be counted more than matching “transcription” –Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach Find articles about the following gene : OFFICIAL_GENE_NAME activating transcription factor 2 OFFICIAL_SYMBOL ATF2 ALIAS_SYMBOL HB16 ALIAS_SYMBOL CREB2 ALIAS_SYMBOL TREB7 ALIAS_SYMBOL CRE-BP1 Bag-of-word Representation: activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 47 Semi-Structured Language Models Semi-structured query Semi-structured query model Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using EM TREC 2003 (Uniform weights)TREC 2005 (Estimated weights) Query Model UnstructSemi-structImp.UnstructSemi-structImp. MAP % %

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 48 Biomedical Named Entity Recognition Task: Recognizing gene names in biomedical literature Challenge: Irregular name variations Standard machine learning suffers from overfitting Solution: Domain-Aware adaptation

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 49 Challenges in Recognizing Gene Names No complete dictionaries –Biologists constantly name newly discovered genes Long, descriptive gene names –muscle-specific Xenopus cardiac actin gene promoter Ambiguity –Synonyms: octopamine receptor (oa1, oar, amoa1) –Lexical variations: MIP-1-alpha, MIP-1alpha, (MIP)-1alpha –Confused with common English words: for (foraging), at (arctops)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 50 Domain overfitting problem When a learning based gene tagger is applied to a domain different from the training domain(s), the performance tends to decrease significantly. The same problem occurs in other types of text, e.g., named entities in news articles. Training domainTest domainF1 mouse flymouse0.281 Reuters ReutersWSJ0.643

T1T1 TmTm … training data E testing test data O1O1 OmOm … individual domain feature ranking domain-specific features feature re-ranking O’ generalizable features feature selection for D 1 feature selection for D 0 top d 0 features for D 0 top d 1 features for D 1 feature selection for D m top d m features for D m … learning entity recognizer d = λ 0 d 0 + (1 – λ 0 ) (λ 1 d 1 + … + λ m d m ) d features Adapting Biological Named Entity Recognizer λ 0, λ 1, …, λ m

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 52 Preliminary Evaluation Results Recognizing gene names Maximum entropy/Logistic regression recognizer Text data from BioCreAtIvE (Medline) 3 organisms (Fly, Mouse, Yeast), each contributes 5,000 sentences with 2,500 with gene mentions Training SetFly, MouseFly, YeastMouse, Yeast Test SetYeastMouseFly Baseline Domain Improvement+2.2%+8.6%+45%

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 53 Gene Summarization Task: Automatically generate a text summary for a given gene Challenge: Need to summarize different aspects of a gene Standard summarization methods would generate an unstructured summary Solution: A new method for generating semi- structured summaries

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 54 An Ideal Gene Summary GP EL SI GI MP WFPI

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 55 Semi-structured Text Summarization

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 56 Summary example (Abl)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 57 BeeSpace Project $5M National Science Foundation Project A campus-wide collaborative project involving computer scientists, biologists, and information scientists Develop an integrative exploratory information system allowing a user to navigate from biological experiment data to literature for functional information about honeybee’s social behavior URL:

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 58

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 59

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 60

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 61

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 62 Summary Bioinformatics has already involved a lot of IR –PubMed search –Entrez integrated search (e.g., find related articles) –BLAST Many IR techniques can be either directly applied to or adapted to biomedical literature access & mining High similarities between bioinformatics problems and text mining problems. High similarities between the methods used in bioinformatics and in IR (should be mutually beneficial) Many opportunities for an IR researcher to contribute to bioinformatics research/development

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 63 Stepping into Bioinformatics Find a biologist partner and treat him/her as your “customer” Learn basic molecular biology to eliminate “language barrier” Attend bioinformatics conferences (ISMB, ECCB, RECOMB, PSB, CSB, …) Start with biomedical literature access & mining –Participate in TREC Genomics Track –Apply/Adapt existing IR techniques –Develop new IR techniques Move to information integration (Text + Databases) Look for methodology connections –Language models (especially HMMs, translation models) –IR heuristics (TF-IDF, pseudo feedback) –Machine learning Build systems! (Biologists love easy-to-use software tools)

DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 64 The Road to Bioinformatics… IR Biomedical science Human health … $10 $100 $1x10 9 $1x10 6 …. Good luck!

Thank You!