AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright.

Slides:

Advertisements

Similar presentations

Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.

Advertisements

François Fages MPRI Bio-info 2007 Formal Biology of the Cell Protein structure prediction with constraint logic programming François Fages, Constraint.

Review of Basic Principles of Chemistry, Amino Acids and Proteins Brian Kuhlman: The material presented here is available on the.

Fundamentals of Protein Structure August, 2006 Tokyo University of Science Tadashi Ando.

Proteins Function and Structure.

• Exam II Tuesday 5/10 – Bring a scantron with you!

5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.

A Zero-Knowledge Based Introduction to Biology Cory McLean 26 Sep 2008 Thanks to George Asimenos.

© 2010 Pearson Education, Inc. Lectures by Chris C. Romero, updated by Edward J. Zalisko PowerPoint ® Lectures for Campbell Essential Biology, Fourth Edition.

Exciting Developments in Molecular Biology As seen by an amateur.

DNA and RNA. I. DNA Structure Double Helix In the early 1950s, American James Watson and Britain Francis Crick determined that DNA is in the shape of.

You Must Know How the sequence and subcomponents of proteins determine their properties. The cellular functions of proteins. (Brief – we will come back.

Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Proteins and Enzymes Nestor T. Hilvano, M.D., M.P.H. (Images Copyright Discover Biology, 5 th ed., Singh-Cundy and Cain, Textbook, 2012.)

Proteins account for more than 50% of the dry mass of most cells

Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.

Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3

How does DNA work? What is a gene?

Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.

Proteins account for more than 50% of the dry mass of most cells

Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.

CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.

©CMBI 2006 Amino Acids “ When you understand the amino acids, you understand everything ”

Copyright © 2004 by Limsoon Wong A Biology Review.

How Proteins Are Made Mrs. Wolfe. DNA: instructions for making proteins Proteins are built by the cell according to your DNA What kinds of proteins are.

. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.

Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read Chapters 4 and 7 of The Practical Bioinformatician.

Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.

© 2010 Pearson Education, Inc. Lectures by Chris C. Romero, updated by Edward J. Zalisko PowerPoint ® Lectures for Campbell Essential Biology, Fourth Edition.

LESSON 4: Using Bioinformatics to Analyze Protein Sequences PowerPoint slides to accompany Using Bioinformatics : Genetic Research.

Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND.

Genetics in ~1920: 1. Cells have chromosomes Sketch of Drosophila chromosomes (Bridges, C. 1913)

Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.

Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 2, May 2004 For written notes.

Bertinoro, Nov 2005 Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore.

Outline 1.What is an amino acid / protein naturally occurring amino acids 3.Codon – triplet coding for an amino acid 1.How are proteins synthesized.

Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.

NOTES: 2.3 part 2 Nucleic Acids & Proteins. So far, we’ve covered… the following MACROMOLECULES: ● CARBOHYDRATES… ● LIPIDS… Let’s review…

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Outline What is an amino acid / protein

Macromolecules of Life Proteins and Nucleic Acids

CELL REPRODUCTION: MITOSIS INTERPHASE: DNA replicates PROPHASE: Chromatin condenses into chromosomes, centrioles start migrating METAPHASE: chromosomes.

Chap. 1 basic concepts of Molecular Biology Introduction to Computational Molecular Biology Chapter 1.

End Show Slide 1 of 39 Copyright Pearson Prentice Hall 12-3 RNA and Protein Synthesis 12–3 RNA and Protein Synthesis.

RNA 2 Translation.

Amino Acids ©CMBI 2001 “ When you understand the amino acids, you understand everything ”

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Chapter 3 Proteins.

Copyright © 2005 by Limsoon Wong Some Interesting Issues in Constructing Gene/Protein Networks Limsoon Wong Institute for Infocomm Research Singapore.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

CS273a A Zero-Knowledge Based Introduction to Biology Courtesy of George Asimenos.

Limsoon Wong Laboratories for Information Technology Singapore From Datamining to Bioinformatics.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

GOVERNMENT ENGINEERING COLLEGE, BHARUCH Subject : Organic Chemistry and Unit Process.

Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.

Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.

Protein Synthesis. Review Questions What is the function of DNA? Stores genetic information and holds the instructions for building proteins Why is DNA.

NUCLEIC ACIDS AND PROTEIN SYNTHESIS. DNA complex molecule contains the complete blueprint for every cell in every living thing Amount of DNA that would.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

BIOLOGY 12 Protein Synthesis.

Fanfan Zeng & Roland Yap National University of Singapore Limsoon Wong

Protein Sequence Alignments

Chapter 3 Proteins.

Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.

Presentation transcript:

AAAI05 Tutorial on Bioinformatics & Machine Learning Jinyan Li & Limsoon Wong Institute for Infocomm Research 21 Heng Mui Keng Terrace Singapore Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

1. Intro to Biology & Bioinformatics 2. TIS Prediction feature generation & feature selection 3. Tx Optimization for Childhood ALL emerging patterns & PCL 4. Analysis of Proteomics Data decision tree ensembles & CS4 5. Prognosis based on Gene Expression extreme sample selection & ERCOF 6. Subgraphs in Protein Interactions closed patterns 7. Mining Errors from Bio Databases association rules Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Tutorial Plan

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Coverage Data: –DNA sequences, gene expression profiles, proteomics mass-spectra, protein sequences, protein structures, & biomedical text/annotations Feature selection: –  2, t-statistics, signal-to-noise, entropy, CFS Machine learning and data mining: –Decision trees, SVMs, emerging patterns, closed patterns, graph theories

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Body and Cell Our body consists of a number of organs Each organ composes of a number of tissues Each tissue composes of cells of the same type Cells perform two types of function –Chemical reactions needed to maintain our life –Pass info for maintaining life to next generation In particular –Protein performs chemical reactions –DNA stores & passes info –RNA is intermediate between DNA & proteins

DNA Stores instructions needed by the cell to perform daily life function Consists of two strands interwoven together and form a double helix Each strand is a chain of some small molecules called nucleotides Francis Crick shows James Watson the model of DNA in their room number 103 of the Austin Wing at the Cavendish Laboratories, Cambridge Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

ACGT U Classification of Nucleotides 5 diff nucleotides: adenine(A), cytosine(C), guanine(G), thymine(T), & uracil(U) A, G are purines. They have a 2-ring structure C, T, U are pyrimidines. They have a 1-ring structure DNA only uses A, C, G, & T Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

A T  10Å G C Watson-Crick Rule DNA is double stranded in a cell One strand is reverse complement of the other Complementary bases: –A with T (two hydrogen-bonds) –C with G (three hydrogen-bonds) Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Chromosome DNA is usually tightly wound around histone proteins and forms a chromosome The total info stored in all chromosomes constitutes a genome In most multi-cell organisms, every cell contains the same complete set of chromosomes –May have some small diff due to mutation Human genome has 3G bases, organized in 23 pairs of chromosomes

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Gene A gene is a sequence of DNA that encodes a protein or an RNA molecule About 30,000 – 35,000 (protein-coding) genes in human genome For gene that encodes protein –In Prokaryotic genome, one gene corresponds to one protein –In Eukaryotic genome, one gene may correspond to more than one protein because of the process “alternative splicing”

Central Dogma Gene expression consists of two steps –Transcription DNA  mRNA –Translation mRNA  Protein Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Genetic Code Start codon: ATG (code for M) Stop codon: TAA, TAG, TGA Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Protein A sequence composed from an alphabet of 20 amino acids –Length is usually 20 to 5000 amino acids –Average around 350 amino acids Folds into 3D shape, forming the building block & performing most of the chemical reactions within a cell Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Classification of Amino Acids Amino acids can be classified into 4 types Positively charged (basic) –Arginine (Arg, R) –Histidine (His, H) –Lysine (Lys, K) Negatively charged (acidic) –Aspartic acid (Asp, D) –Glutamic acid (Glu, E)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Classification of Amino Acids Polar (overall uncharged, but uneven charge distribution. can form hydrogen bonds with water. they are called hydrophilic) –Asparagine (Asn, N) –Cysteine (Cys, C) –Glutamine (Gln, Q) –Glycine (Gly, G) –Serine (Ser, S) –Threonine (Thr, T) –Tyrosine (Tyr, Y) Nonpolar (overall uncharged and uniform charge distribution. cant form hydrogen bonds with water. they are called hydrophobic) –Alanine (Ala, A) –Isoleucine (Ile, I) –Leucine (Leu, L) –Methionine (Met, M) –Phenylalanine (Phe, F) –Proline (Pro, P) –Tryptophan (Trp, W) –Valine (Val, V)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Bioinformatics Applications Bio Data Searching Gene finding Cis-regulatory DNA Gene/Protein Network Protein/RNA Structure Prediction Evolutionary Tree reconstruction Infer Protein Function Disease Diagnosis, Prognosis, & Treatment Optimization,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Part 2: Translation Initiation Site Recognition

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong 299 HSU CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE A Sample cDNA What makes the second ATG the TIS?

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Approach Training data gathering Signal generation –k-grams, distance, domain know-how,... Signal selection –Entropy,  2, CFS, t-test, domain know-how... Signal integration –SVM, ANN, PCL, CART, C4.5, kNN,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Training & Testing Data Vertebrate dataset of Pedersen & Nielsen [ISMB’97] 3312 sequences ATG sites 3312 (24.5%) are TIS (75.5%) are non-TIS Use for 3-fold x-validation expts

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Generation K-grams (ie., k consecutive letters) –K = 1, 2, 3, 4, 5, … –Window size vs. fixed position –Up-stream, downstream vs. any where in window –In-frame vs. any frame

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Too Many Signals For each k, there are 4 k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have = 8184 features! This is too many for most machine learning algorithms  Need to do signal selection –t-stats,  2, CFS, signal-to-noise, entropy, gini index, info gain, info gain ratio,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection Basic Idea Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection by T-Statistics

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Selection by  2

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Sample K-Grams Selected by CFS Position – 3 in-frame upstream ATG in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Kozak consensus Leaky scanning Stop codon Codon bias?

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Signal Integration kNN –Given a test sample, find the k training samples that are most similar to it. Let the majority class win SVM –Given a group of training samples from two classes, determine a separating plane that maximises the margin of error Naïve Bayes, ANN, C4.5, CS4,...

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Results (3-fold x-validation)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Technique Comparisons Pedersen&Nielsen [ISMB’97] –Neural network –No explicit features Zien [Bioinformatics’00] –SVM+kernel engineering –No explicit features Hatzigeorgiou [Bioinformatics’02] –Multiple neural networks –Scanning rule –No explicit features This approach –Explicit feature generation –Explicit feature selection –Use any machine learning method w/o any form of complicated tuning –Scanning rule is optional

F L I M V S P T A Y H Q N K D E C W R G A T E L R S stop How about using k-grams from the translation? mRNA  protein Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Amino-Acid Features Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Amino-Acid Features Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Amino Acid K-Grams Discovered (by entropy) Copyright © 2003, 2004, 2005 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong ATGpr Our method Validation Results (on Chr X and Chr 21) Using top 100 features selected by entropy and train SVM on Pedersen & Nielsen’s

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Part 3: Treatment Optimization of Childhood Leukemia Image credit: FEER

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Childhood ALL Major subtypes are: –T-ALL, E2A-PBX, TEL- AML, MLL genome rearrangements, BCR- ABL, Hyperdiploid>50 Diff subtypes respond differently to same Tx  Over-intensive Tx –Development of secondary cancers –Reduction of IQ  Under-intensiveTx –Relapse The subtypes look similar Conventional diagnosis –Immunophenotyping –Cytogenetics –Molecular diagnostics  Unavailable in most ASEAN countries

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Image credit: Affymetrix Single-Test Platform of Microarray & Machine Learning

Overall Strategy Diagnosis of subtype Subtype- dependent prognosis Risk- stratified treatment intensity For each subtype, select genes to develop classification model for diagnosing that subtype For each subtype, select genes to develop prediction model for prognosis of that subtype Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

Subtype Diagnosis by PCL Gene expression data collection Gene selection by  2 PCL Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) Apply PCL for diagnosis of future cases

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Emerging Patterns An emerging pattern is a set of conditions –usually involving several features –that most members of a class satisfy –but none or few of the other class satisfy

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong PCL: Prediction by Collective Likelihood

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Childhood ALL Subtype Diagnosis Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Training and Testing Sets

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong The classifiers are all applied to the 20 genes selected by  2 at each level of the tree Accuracy of Various Classifiers

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Understandability of EP & PCL E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs These give us the diagnostic rule

Conclusions Conventional Tx: intermediate intensity to everyone  10% suffers relapse  50% suffers side effects  costs US$150m/yr Our optimized Tx: high intensity to 10% intermediate intensity to 40% low intensity to 50% costs US$100m/yr Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong High cure rate of 80% Less relapse Less side effects Save US$51.6m/yr

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Part 4: Topology of Protein Interaction Networks: Hubs, Cores, Bipartites

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Courtesy of JE Wampler Protein Representations Four types of representations for proteins: primary, secondary, tertiary, and quaternay Protein interactions play an important role in inter cellular communication, in signal transduction, in the regulation of gene expression

Complex based Computational Methods Motif Discovery Domain Interactions Sequence based Structure based (Docking) Experimental Methods (Phage Display, mutagenesis, two hybrid) Interaction Determination Approaches Copyright © 2005 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Protein Interaction Databases PDB---Protein Databank ( reference article: H.M. Berman, etal: The Protein Data Bank. NAR, 28 pp (2000) ) structure info; complex info. Protein Data Bank. BIND---The Biomolecular Interaction Network Database ( Bader et al. NAR, ) The GRID: The General Repository for Interaction Datasets DIP---Database of Interacting Proteins ( Xml format.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Maslov & Sneppen, Science, v296, 2002 A power law: Most neighbors of highly connected nodes have rather low connectivity. The highly connected nodes are called hubs 318 edges 329 nodes In nucleus of S. cerevisiae Graph Representation of Protein Interactions

Yeast SH3 domain-domain Interaction network: 394 edges, 206 nodes Tong et al. Science, v proteins containing SH3 5 binding at least 6 of them K-cores in Protein Interaction Graphs Copyright © 2005 by Jinyan Li and Limsoon Wong

Bipartite Subgraphs Copyright © 2005 by Jinyan Li and Limsoon Wong SH3 Proteins SH3-Binding Proteins

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Some Definitions A graph G=  V G, E G , where V G is a set of vertices and E G is a set of edges. No self loops, no parallel edges, undirected A graph H=  V H, E H  is a subgraph of G, if V H is contained in V G and E H is contained in E G H =  V 1  V 2, E H  is bipartite subgraph of G if –H is a subgraph of G, and –V 1  V 2 = {}, and –every edge in E H joins a vertex in V 1 and a vertex in V 2

V1V1 V2V2 Example Bipartite Subgraph of Protein Interaction Network Copyright © 2005 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Maximal Complete Bipartite Subgraphs A complete bipartite subgraph H =  V 1  V 2, E H  of G is said to be maximal if there is no other complete bipartite subgraph H’ =  V’ 1  V’ 2, E H’  of G such that V 1  V’ 1 and V 2  V’ 2

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong The problem previously studied by Epptein (1994, Info Proc Lett); Makino & Uno (2004, SWAT); Zaki & Ogihara (1998). Study of Protein Interaction Network Topology List all maximal complete bipartite subgraphs from a protein interaction network (Step 1) Discover motifs from these protein groups (Step 2) Form binding motif pairs (Step 3)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Efficient Mining of Maximal Complete Bipartite Subgraphs Equiv to mining of closed patterns from adjacency matrix of G Why: –the number of closed patterns is even, –is precisely double of the number of the maximal complete bipartite subgraphs, and –there is one-to-one correspondence betw a maximal complete bipartite subgraph & a closed pattern pair Many state-of-the-art algorithms for mining closed patterns

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong A is symmetrical, main diagonal is 0 Adjacency Matrix: A Special Transactional Database (TDB) Let G be a graph with V G = {v 1, v 2, …, v p } Adjacency matrix A of G is a p x p matrix defined by A[i,j] = 1 if (v i, v j )  E G 0 otherwise

Support threshold = 2 Pasquier etal, ICDT 99; Grahne & Zhu, FIMI03 TDB, Freq Patterns, Equiv Classes, & Closed Patterns Copyright © 2005 by Jinyan Li and Limsoon Wong

A protein interaction network taken from GRID Breitkreutz, Genome Biol proteins, interactions, ave size of neighborhood is 3.56 Some Expt Results Mining algorithm---FPclose* Grahne & Zhu, 2003 Copyright © 2005 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Fixed Point Idea The idea: f(x)=x Instantiation in life science, f(DNA)=DNA, f(proteinType) = proteinType (Meng et al. Bioinformatics, 2004) The variable x can be a pair of AA segment patterns, then after the transformations by f, the stable pattern is binding motif pair. (Li and Li, Bioinformatics, 2005)

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Part 5: Treatment Prognosis for DLBC Lymphoma Image credit: Rosenwald et al, 2002

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Diffuse Large B-Cell Lymphoma DLBC lymphoma is the most common type of lymphoma in adults Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients  DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Intl Prognostic Index (IPI) –age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease,... Not very good for stratifying DLBC lymphoma patients for therapeutic trials  Use gene-expression profiles to predict outcome of chemotherapy?

“extreme” sample selection ERCOF Copyright © 2005 by Jinyan Li and Limsoon Wong Our Approach

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Short-term Survivors v.s. Long-term Survivors T: sample F(T): follow-up time E(T): status (1:unfavorable; 0: favorable) c 1 and c 2 : thresholds of survival time Short-term survivors who died within a short period F(T) < c 1 and E(T) = 1 Long-term survivors who were alive after a long follow-up time F(T) > c 2 Extreme Sample Selection

ERCOF Entropy- Based Rank Sum Test & Correlation Filtering Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong

llm(X) = sign(  k  k *Y k * (X k X) + b) svm(X) = sign(  k  k *Y k * (  X k  X) + b)  svm(X) = sign(  k  k *Y k * K(X k,X) + b) where K(X k,X) = (  X k  X) SVM Kernel Function Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Linear Learning Machine SVM = LLM + Kernel

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Linear Kernel SVM regression function T: test sample, x(i): support vector, y i : class label (1: short-term survivors; -1: long-term survivors) Transformation function (posterior probability) S(T): risk score of sample T Risk Score Construction

Knowledge Discovery from Gene Expression of “Extreme” Samples “extreme” sample selection: 8 yrs knowledge discovery from gene expression 240 samples 80 samples 26 long- term survivors 47 short- term survivors 7399 genes 84 genes T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7 Copyright © 2005 by Jinyan Li and Limsoon Wong From Rosenwald et al., NEJM 2002

80 test cases p-value of log-rank test: < Risk score thresholds: 0.7, 0.3 Kaplan-Meier Plots Copyright © 2005 by Jinyan Li and Limsoon Wong Improvement over IPI Cases rated as IPI low can still be split p-value =

W/o sample selection (p =0.38) No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted Merit of “Extreme” Samples Copyright © 2005 by Jinyan Li and Limsoon Wong W/ sample selection (p < )

Effectiveness of ERCOF Copyright © 2005 by Jinyan Li and Limsoon Wong 22 Total wins

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Conclusions Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information ERCOF is a systematic feature selection method that is very useful –ERCOF is very suitable for SVM, 3-NN, CS4, Random Forest, as it gives these learning algos highest no. of wins –ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Part 6: Decision Tree Based In-Silico Cancer Diagnosis Bagging, Boosting, Random Forest, Randomization Trees, & CS4

AB A BA x1x1 x2x2 x4x4 x3x3 > a 1 > a 2 Structure of Decision Trees If x 1 > a 1 & x 2 > a 2, then it’s A class C4.5 & CART, two of the most widely used Easy interpretation, but accuracy generally unattractive Leaf nodes Internal nodes Root node B A Copyright © 2005 by Jinyan Li and Limsoon Wong

Elegance of Decision Trees Diagnosis of central nervous system in hematooncologic patients (Lossos et al, Cancer, 2000) Prediction of neurobehavioral outcome in head-injury survivors (Temin et al, J. of Neurosurgery, 1995) Reconstruction of molecular networks from gene expression data (Soinov et al, Genome Biology, 2003) A B BA A Copyright © 2005 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Single coverage & Fragmentation Problem For Decision trees Decision Tree Construction Steps –Select the “best” feature as the root node of the whole tree –After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree –Recursively, until the partitions become pure or almost pure 3 Measures to Evaluate Which Feature is Best –Gini index –Information gain –Information gain ratio

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong h 1, h 2, h 3 are indep classifiers w/ accuracy = 60% C 1, C 2 are the only classes t is a test instance in C 1 h(t) = argmax C  {C1,C2} |{h j  {h 1, h 2, h 3 } | h j (t) = C}| Then prob(h(t) = C 1 ) = prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 2 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 2 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 2 & h 2 (t)=C 1 & h 3 (t)=C 1 ) = 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60% = 64.8% Motivating Example

Bagging: Bootstrap Aggregating, Breiman p + 50 n Original training set 48 p + 52 n49 p + 51 n53 p + 47 n … A base inducer such as C4.5 A committee H of classifiers: h 1 h 2 …. h k Draw 100 samples with replacement (injecting radomness) Equal voting Copyright © 2005 by Jinyan Li and Limsoon Wong

AdaBoost: Adaptive Boosting, Freund & Schapire instances with equal weight A classifier h 1 error If error is 0 or >0.5 stop Otherwise re-weight correctly classified instances by: e1/(1-e1) Renormalize total weight to instances with different weights A classifier h 2 error Samples misclassified by h i Copyright © 2005 by Jinyan Li and Limsoon Wong

Random Forest Breiman p + 50 n Original training set 48 p + 52 n49 p + 51 n53 p + 47 n … A base inducer (not C4.5 but revised) A committee H of classifiers: h 1 h 2 …. h k Equal voting Copyright © 2005 by Jinyan Li and Limsoon Wong

Root node Original n number of features Selection is from m try number of randomly chosen features A Revised C4.5 as Base Classifier Similar to Bagging, but the base inducer is not the standard C4.5 Make use twice of randomness Copyright © 2005 by Jinyan Li and Limsoon Wong

Root node Original n number of features Select one randomly from {feature 1: choice 1,2,3 feature 2: choice 1, 2,. feature 8: choice 1, 2, 3 } Total 20 candidates Equal voting on the committee of such decision trees Randomization Trees Dietterich 2000 Make use of randomness in the selection of the best split point Copyright © 2005 by Jinyan Li and Limsoon Wong

1 2 k tree-1 tree-2 tree-k total k trees root nodes CS4: Cascading & Sharing for Decision Trees, Li et al 2003 Selection of root nodes is in a cascading manner! Doesn’t make use of randomness Not equal voting Copyright © 2005 by Jinyan Li and Limsoon Wong

Bagging Random Forest AdaBoost.M1 Randomization Trees CS4 Rules may not be correct when applied to training data Rules correct Summary of Ensemble Classifiers Copyright © 2005 by Jinyan Li and Limsoon Wong

Example of Decision Tree Yeoh et al., Cancer Cell 1: , 2002; Differentiating MLL subtype from other subtypes of childhood leukemia Training data (14 MLL vs 201 others), Test data (6 MLL vs 106 others), Number of features: Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision! Copyright © 2005 by Jinyan Li and Limsoon Wong

Performance of CS4 Whether top-ranked features have similar gain ratios Whether cascading trees have similar training performance Whether the trees have similar structure Whether the expanding tree committees can reduce the test errors gradually Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, Copyright © 2005 by Jinyan Li and Limsoon Wong

Discovery of Diagnostic Biomarkers for Ovarian Cancer Motivation: cure rate ~ 95% if correct diagnosis at early stage Proteomic profiling data obtained from patients’ serum samples The first data set by Petricoin et al was published in Lancet, 2002 Data set of June samples: 91 controls and 162 patients suffering from the disease; features (proteins, peptides, precisely, mass/charge identities) SVM: 0 errors; Na ï ve Bayes: 19 errors; k-NN: 15 errors. Copyright © 2005 by Jinyan Li and Limsoon Wong

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Part 7: Mining Errors from Bio Databases

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Data Cleansing, Koh et al, DBiDB types & 28 subtypes of data artifacts –Critical artifacts (vector contaminated sequences, duplicates, sequence structure violations) –Non-critical artifacts (misspellings, synonyms) > 20,000 seq records in public contain artifacts Identification of these artifacts are impt for accurate knowledge discovery Sources of artifacts –Diverse sources of data Repeated submissions of seqs to db’s Cross-updating of db’s –Data Annotation Db’s have diff ways for data annotation Data entry errors can be introduced Different interpretations –Lack of standardized nomenclature Variations in naming Synonyms, homonyms, & abbrevn –Inadequacy of data quality control mechanisms Systematic approaches to data cleaning are lacking

A Classification of Errors Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Uninformative sequences Undersized sequences Annotation error Dubious sequences Sequence redundancy Data provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases, 3,327 protein sequences are shorter than four residues (as of Sep, 2004). In Nov 2004, the total number of undersized protein sequences increases to 3,350. Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004). In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711. Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Example Meaningless Seqs

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Overlapping intron/exon Annotation error Dubious sequences Sequence redundancy Data Provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Syn7 gene of putative polyketide synthase in NCBI TPA record BN has overlapping intron 5 and exon 6. rpb7+ RNA polymerase II subunit in GENBANK record AF has overlapping exon 1 and exon 2. Example Overlapping Intron/Exon Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

RECORD SINGLE SOURCE DATABASE Invalid values Ambiguity Incompatible schema ATTRIBUTE Annotation error Dubious sequences Sequence redundancy Data provenance flaws Cross- annotation error Sequence structure violation Vector contaminated sequence Erroneous data transformation MULTIPLE SOURCE DATABASE Replication of sequence information Different views Overlapping annotations of the same sequence Submission of the same sequence to different databases Repeated submission of the same sequence to the same database Initially submitted by different groups Protein sequences may be translated from duplicate nucleotide sequences =protein&list_uids= &dopt=GenPept Example Seqs w/ Identical Info Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Association Rule Mining for De-duplication Compute similarity scores from known duplicate pairs Select matching criteria Generate association rules Detect duplicates using the rules Copyright © 2005 by Jinyan Li and Limsoon Wong.

Features to Match Copyright © 2005 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0 AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0 P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0 S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0 P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 LE1.0 PD0 SQ1.0 (99.7%) SP1 PD0 SQ1.0 (97.1%) SP1 LE1.0 PD0 SQ1.0 (96.8%) DB0 PD0 SQ1.0 (93.1%) DB0 LE1.0 PD0 SQ1.0 (92.8%) DB0 SP1 PD0 SQ1.0 (90.4%) DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%) RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%) RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%) Similarity scores of known duplicate pairs Association rule mining Frequent item-set with support Association Rule Mining

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Scorpion venom dataset containing 520 records 695 duplicate pairs are collectively identified. Snake PLA2 venom dataset containing 780 records Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB) 251 duplicate pairs 444 duplicate pairs scorpion AND (venom OR toxin) serpentes AND venom AND PLA2 Expert annotation Dataset

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7 S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6 S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5 S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1 Results Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Acknowledgements TIS Prediction Huiqing Liu, Roland Yap, Fanfan Zeng Treatment Optimization for Childhood ALL James Downing, Huiqing Liu, Allen Yeoh Prognosis based on Gene Expression Profiling Huiqing Liu Analysis of Proteomics Data Huiqing Liu Subgraphs in Protein Interactions Haiquan Li, Donny Soh, Chris Tan, See Kiong Ng Mining Errors from Bio Databases Vladimir Brusic, Judice Koh, Mong Li Lee