Opportunities & Challenges in Applying IR Techniques to Bioinformatics ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Graduate School of Library & Information Science University of Illinois at Urbana-Champaign Include slides from NCBI training tutorials & slides from the website of the book “An Intro. to Bioinformatics Algorithms”
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 2 Where in the US is UIUC? Picture from Netherlands Consulate Website Chicago
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 3 Outline What is Bioinformatics? Typical Problems in Bioinformatics Information Retrieval & Bioinformatics Biomedical Literature Access & Mining Summary
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 4 What is Bioinformatics Management & Exploitation of Biological Data/Info –Biological information (DNA, Gene expression, Proteins, Literature….) –Information management (search, organization, classification) –Information exploitation (pattern analysis, data mining)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 5 Why is Bioinformatics Important? Biology perspective –More and more biological information is available –Need to effectively access and use the information –Information analysis supplements (even may replace) wet lab experiments Computer science perspective –Excellent application domain –Poses special computational challenges –Brings computer science closer to scientific discovery Currently growing …
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 6 Theoretical CS Bioinformatics and Other Fields Molecular Biology Machine Learning Data Mining Information Management Biophysics Bioinformatics Biochemistry Applied Mathematics & Statistics Biology Computer Science Optimization
Some background about molecular biology…
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 8 Life begins with Cell A cell is the smallest structural unit of an organism that is capable of independent functioning Cells have some common features
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 9 All Cells have common Cycles Born, eat, replicate, and die
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 10 Example of cell signaling
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 11 Some Terminology The genome is an organism’s complete set of DNA. –a bacteria contains about 600,000 DNA base pairs –human and mouse genomes have some 3 billion. Gene –basic physical and functional unit of heredity. –specific sequence of DNA bases that encodes instructions on how to make a protein. Protein –Makes up the cellular structure –large, complex molecules made up of smaller subunits called amino acids.
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 12 Life Depends on 3 Critical Molecules DNAs –Hold information on how cell works RNAs –Act to transfer short pieces of information to different parts of cell –Provide templates to synthesize proteins Proteins –Form enzymes that send signals to other cells and regulate gene activities –Form body’s major components (e.g. hair, skin, etc.)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 13 The Central Dogma Central Dogma: DNA RNA protein Transcription: DNA mRNA Translation mRNA protein
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 14 Biology Research Questions Are two genes the same? How are genes regulated? What are the relations between gene functions and transcription factors? How can we detect gene regulation networks? How can we determine protein structures? How can we determine protein functions? ….
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 15 The Central Dogma & Biological Data Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 16 Entrez Integrates Most Biological DBs Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO GEO Datasets MeSH CancerChromosomes Homologen e
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 17 Web Access:
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 18
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 19 Number of Users and Hits Per Day Christmas & New Year’s Days Currently more than 10,000,000 to 50,000,000 hits per day!
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 20 Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, RefSNP, GEO Datasets, UniGene, TPA, NCBI Protein, Structure, Conserved Domain
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 21 Primary vs. Derivative Sequence Databases SequencingCenters Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI GenBank TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Updated ONLY by submitters
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 22 A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // Header Feature Table Sequence The Flatfile Format
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 23 Bioinformatics Tools
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 24 Topics in Bioinformatics AATTCATGAAAATCGTATACTGGTCTGGTACCGGC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTA TCTGGTAAAGACGTCAACACCATCAACGTGTC ACATCGATGAACTGCTGAACGAAGATATCCTG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGG MKIVYWSGTGNTEKMAELIAKGIIESGKDV DELLNEDILILGCSAMGDEVLEESEFEPFIE KVALFGSYGWGDGKWMRDFEERMNGYG PDEAEQDCIEFGKKIANI GenesProteins (Function) Gene expression & regulation Microarray data DNA Sequences Protein Sequences …In this paper, we report the discovery of a new gene that affects DNA reproduction in … … … Biology Literature Genomics Transcriptomics Proteomics Retrieval & Text Mining
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 25 Typical Problems in Bioinformatics Sequence alignment –Pairwise –Multisequence Motif finding Gene finding Protein structure/function prediction Protein motif function prediction Literature access & mining … This is a quite incomplete list…
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 26 Topic 1: Sequence Search Biological problems –How do we know whether two genes are similar? –Given a gene, how can we find similar genes in the genome of another organism? –Given a protein, how can we find similar proteins Computational problems –Sequence matching/alignment/search –Given a query sequence, retrieve similar sequences from a database of sequences Related IR techniques –Inverted index, sequence similarity, sequence retrieval Their sequence search engine is BLAST, which is the most useful bioinformatics tool and is used routinely by all biologists!
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 27 Topic 2: Multiple Sequence Alignment Biological problem: Given a family of proteins, how can we characterize their function domains/motifs? Computational problem: Given a set of sequences, find the best alignment Related IR techniques: Summarization
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 28 Topic 3: Motif Finding Biological problem: –Given a set of genes with similar functions –Find the common transcription factor binding site Computational problem: –Given a positive set of sequences and a background set of sequences, find a common pattern that is shared by all/many of the positive sequences, but not common in the background Related IR techniques: relevance feedback
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 29 Motif Discovery Motif = subsequence pattern Motif discovery –Given a target set of sequences (and possibly a background set of sequences) –Find motifs that characterize the target set …. G-G-T-C-C-T-G-G …
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 30 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. AlignACE Example: Input Data Set
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 31 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = (maximum) …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example: The Target Motif
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 32 Topic 4: Microarray Data Analysis Biological problem: Given expression values of a set of genes in different conditions, how do we detect genes that are co- expressed/co-regulated? Computational problem: Given 2-D matrix data, perform clustering Related IR techniques: clustering Funcational group
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 33 Topic 5: Profile HMM Biological problem: How do we know if a new protein has the same function as any known proteins? Computational problem: –Given a set of proteins in the same function family, build an HMM profile for the family –Given examples of proteins in different families, learn a classifier to classify new proteins Related IR techniques: text categorization, HMMs
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 34 An Example Profile HMM 3 kinds of states: Match, Insertion, Deletion Output symbols: amino acids Can be trained with aligned multiple sequences Begin MjMj End IjIj DjDj
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 35 Uses of Profile HMMs Detecting potential membership in a family –Matching a sequence to the profile HMMs –Score a sequence S by p(S|HMM)/p(S|Random) Return top k best matching profile HMMs for a given sequence Given an HMM, find additional sequences in the family Aligning a sequence to an existing family –Decoding the sequence using Viterbi –Using the state transition path to align the sequence with the existing sequences in the family HMMs have many other uses (e.g., gene finding)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 36 Protein Motif Function Prediction MAPVRKPDMRGLAVFIS DIRNCKPDSKGLEAEVKR … … MLQPAKPDLPGLCIYPSVKE FMLKPDKMGLLTDFGQIA Protein sequences Potential Motif Patterns KPD.. GL LQ.. D.. FTD … KAVFS.... GQIA TEIRESIAS SPLASH Functions=? Tyrosine kinase ? Signal transducer … ? How to determine the function of a new protein motif?
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 37 Motif Function Prediction Method GO: "intracellular cyc. n. a. action channel" fb|FBgn GO: "voltage-gated ion channel" GO: “plasma membrane" APLL..VQY KCI..SP..LR GSGSGS sequenceGO termMotif Exploit the correlation between motif matching and GO assignment Which GO term is most strongly correlated with “ APLL..VQY ”? Related IR Techniques: Cross-Lingual IR, Mutual Information
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 38 Opportunities for Applying IR Techniques DNA Sequences Protein Sequences Proteins Structures Microarray Data Text Functional Annotations … Search Filtering Clustering Classification Summarization Text Mining …
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 39 Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Search Applications Mining Applications Information Access Knowledge Acquisition Information Organization Text Information Management
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 40 Search Sequences Filtering Categorization Summarization Clustering Sequence Content Analysis Extraction Mining Visualization Search Applications Mining Applications Information Access Knowledge Acquisition Information Organization Sequence Information Management
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 41 Challenges in Applying IR to Bioinformatics Domain expertise barrier –Causes difficulty in problem definition & evaluation Signal/noise ratio is poor –Unlike English, which we know well, the “DNA language” is largely unknown –Techniques working well for English text may not work well for DNA sequences Inaccuracy and errors inevitably exist in biological data –Measurement errors (e.g., sequencing errors) –Very few derived data (e.g., annotations) have been validated
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 42 Challenges in Applying IR to Bioinformatics Exploiting all available information about a problem is critical –How to incorporate domain/prior knowledge? (Need to formalize a biologist’s knowledge) –Many resources are available, but figuring out how to appropriately take advantage of them is a challenge Variation of problem formulation –While a problem may be similar to one in IR at the high-level, it is often quite different at the low level –E.g., sequence search differs from text search in two ways: Query is different Matching criterion is different (need alignment) –Direct applications of standard IR techniques may not be effective
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 43 Biomedical Literature Access & Mining: General Problems Basic literature search – High accuracy, vocabulary matching/switching, entity recognition Integrative information access –Literature linked with databases Information/Knowledge extraction –Genes, relations, networks, inferences Hypothesis generation/testing –Exploratory analysis & QA
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 44 Some of Our Work Applying language models to biomedical literature retrieval (TREC 2003 & 2005) Applying entity recognition to gene name recognition (HLT/NAACL 2006) Applying summarization to gene summarization (PSB 2006) Developing an integrated and exploratory biological information system ($5M NSF project)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 45 Biomedical Literature Retrieval Task: Given an ad hoc query, find relevant literature abstracts from Medline Challenge: Semi-structured queries Standard language models are not directly applicable Solution: Semi-structured query language models
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 46 Semi-Structured Queries TREC-2003 Genomics Track, Topic 1: Problems with unstructured representation –Intuitively, matching “ATF2” should be counted more than matching “transcription” –Such a query is not a natural sample of a unigram language model, violating the assumption of the language modeling retrieval approach Find articles about the following gene : OFFICIAL_GENE_NAME activating transcription factor 2 OFFICIAL_SYMBOL ATF2 ALIAS_SYMBOL HB16 ALIAS_SYMBOL CREB2 ALIAS_SYMBOL TREB7 ALIAS_SYMBOL CRE-BP1 Bag-of-word Representation: activating transcription factor 2, ATF2, HB16, CREB2, TREB7, CRE-BP1
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 47 Semi-Structured Language Models Semi-structured query Semi-structured query model Semi-structured LM estimation: Fit a mixture model to pseudo feedback documents using EM TREC 2003 (Uniform weights)TREC 2005 (Estimated weights) Query Model UnstructSemi-structImp.UnstructSemi-structImp. MAP % %
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 48 Biomedical Named Entity Recognition Task: Recognizing gene names in biomedical literature Challenge: Irregular name variations Standard machine learning suffers from over- fitting Solution: Domain-Aware adaptation
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 49 Challenges in Recognizing Gene Names No complete dictionaries –Biologists constantly name newly discovered genes Long, descriptive gene names –muscle-specific Xenopus cardiac actin gene promoter Ambiguity –Synonyms: octopamine receptor (oa1, oar, amoa1) –Lexical variations: MIP-1-alpha, MIP-1alpha, (MIP)-1alpha –Confused with common English words: for (foraging), at (arctops)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 50 Domain overfitting problem When a learning based gene tagger is applied to a domain different from the training domain(s), the performance tends to decrease significantly. The same problem occurs in other types of text, e.g., named entities in news articles. Training domainTest domainF1 mouse flymouse0.281 Reuters ReutersWSJ0.643
T1T1 TmTm … training data E testing test data O1O1 OmOm … individual domain feature ranking domain-specific features feature re-ranking O’ generalizable features feature selection for D 1 feature selection for D 0 top d 0 features for D 0 top d 1 features for D 1 feature selection for D m top d m features for D m … learning entity recognizer d = λ 0 d 0 + (1 – λ 0 ) (λ 1 d 1 + … + λ m d m ) d features Adapting Biological Named Entity Recognizer λ 0, λ 1, …, λ m
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 52 Preliminary Evaluation Results Recognizing gene names Maximum entropy/Logistic regression recognizer Text data from BioCreAtIvE (Medline) 3 organisms (Fly, Mouse, Yeast), each contributes 5,000 sentences with 2,500 with gene mentions Training SetFly, MouseFly, YeastMouse, Yeast Test SetYeastMouseFly Baseline Domain Improvement+2.2%+8.6%+45%
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 53 Gene Summarization Task: Automatically generate a text summary for a given gene Challenge: Need to summarize different aspects of a gene Standard summarization methods would generate an unstructured summary Solution: A new method for generating semi- structured summaries
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 54 An Ideal Gene Summary GP EL SI GI MP WFPI
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 55 Semi-structured Text Summarization
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 56 Summary example (Abl)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 57 BeeSpace Project $5M National Science Foundation Project A campus-wide collaborative project involving computer scientists, biologists, and information scientists Develop an integrative exploratory information system allowing a user to navigate from biological experiment data to literature for functional information about honeybee’s social behavior URL:
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 58
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 59
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 60
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 61
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 62 Summary Bioinformatics has already involved a lot of IR –PubMed search –Entrez integrated search (e.g., find related articles) –BLAST Many IR techniques can be either directly applied to or adapted to biomedical literature access & mining High similarities between bioinformatics problems and text mining problems. High similarities between the methods used in bioinformatics and in IR (should be mutually beneficial) Many opportunities for an IR researcher to contribute to bioinformatics research/development
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 63 Stepping into Bioinformatics Find a biologist partner and treat him/her as your “customer” Learn basic molecular biology to eliminate “language barrier” Attend bioinformatics conferences (ISMB, ECCB, RECOMB, PSB, CSB, …) Start with biomedical literature access & mining –Participate in TREC Genomics Track –Apply/Adapt existing IR techniques –Develop new IR techniques Move to information integration (Text + Databases) Look for methodology connections –Language models (especially HMMs, translation models) –IR heuristics (TF-IDF, pseudo feedback) –Machine learning Build systems! (Biologists love easy-to-use software tools)
DIR 2006 Keynote Talk, 2006 © ChengXiang Zhai 64 The Road to Bioinformatics… IR Biomedical science Human health … $10 $100 $1x10 9 $1x10 6 …. Good luck!
Thank You!