Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong.

Slides:



Advertisements
Similar presentations
Vocabulary Key Terms DNA DNA replication Codon Intron Exon Translation
Advertisements

Transformation Principle In 1928 Fredrick Griffith heated the S bacteria and mixed with the harmless bacteria thinking that neither would make the mice.
Recombinant DNA technology
Nucleic Acids & Protein Synthesis
• Exam II Tuesday 5/10 – Bring a scantron with you!
5’ C 3’ OH (free) 1’ C 5’ PO4 (free) DNA is a linear polymer of nucleotide subunits joined together by phosphodiester bonds - covalent bonds between.
Replication, Transcription and Translation
© 2010 Pearson Education, Inc. Lectures by Chris C. Romero, updated by Edward J. Zalisko PowerPoint ® Lectures for Campbell Essential Biology, Fourth Edition.
LECTURE 5: DNA, RNA & PROTEINS
DNA and RNA. I. DNA Structure Double Helix In the early 1950s, American James Watson and Britain Francis Crick determined that DNA is in the shape of.
Principles of Biology By Frank H. Osborne, Ph. D. Molecular Genetics.
10-2: RNA and 10-3: Protein Synthesis
Central Dogma of Biology
Unit 7 RNA, Protein Synthesis & Gene Expression Chapter 10-2, 10-3
Molecular Biology. I. History:Ground breaking discoveries T.H. Morgan Griffith, Avery and McCleod Hershey and Chase Watson and Crick (refer to your article.
Protein Synthesis. DNA RNA Proteins (Transcription) (Translation) DNA (genetic information stored in genes) RNA (working copies of genes) Proteins (functional.
Biology 10.1 How Proteins are Made:
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
CHAPTER 12 PROTEIN SYNTHESIS AND MUTATIONS -RNA -PROTEIN SYNTHESIS -MUTATIONS.
Manipulating DNA.
DNA & PROTEIN SYNTHESIS CHAPTERS 9 &10. Main Idea How are proteins made in our bodies?
Copyright © 2004 by Limsoon Wong A Biology Review.
Unit 4 Genetics Ch. 12 DNA & RNA.
How Proteins Are Made Mrs. Wolfe. DNA: instructions for making proteins Proteins are built by the cell according to your DNA What kinds of proteins are.
RNA and Protein Synthesis
UNIT 1 INFORMATION METHODS OF A CELL. What do you know about DNA? Building blocks are called? –nucleotides The shape is ? –Double helix The three primary.
© 2010 Pearson Education, Inc. Lectures by Chris C. Romero, updated by Edward J. Zalisko PowerPoint ® Lectures for Campbell Essential Biology, Fourth Edition.
Genetics in ~1920: 1. Cells have chromosomes Sketch of Drosophila chromosomes (Bridges, C. 1913)
From Gene to Protein A.P. Biology. Regulatory sites Promoter (RNA polymerase binding site) Start transcription DNA strand Stop transcription Typical Gene.
Deoxyribonucleic Acid (DNA) & Ribonucleic Acid (RNA)
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
DNA Function: Information Transmission. ● DNA is called the “code of life.” What does it code for? *the information (“code”) to make proteins!
DNA & GENETICS. There are four kinds of bases in DNA: adenine guanine cytosine thymine.
DNA Deoxyribonucleic Acid Structure and Function.
Macromolecules of Life Proteins and Nucleic Acids
CELL REPRODUCTION: MITOSIS INTERPHASE: DNA replicates PROPHASE: Chromatin condenses into chromosomes, centrioles start migrating METAPHASE: chromosomes.
Chap. 1 basic concepts of Molecular Biology Introduction to Computational Molecular Biology Chapter 1.
Chapter 10: DNA and RNA.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
 How does information flows in the cell?  What controls cell function?  Is it DNA, RNA, Proteins, Genes, Chromosomes or the Nucleus?
End Show Slide 1 of 39 Copyright Pearson Prentice Hall 12-3 RNA and Protein Synthesis 12–3 RNA and Protein Synthesis.
RNA 2 Translation.
Chapter 12 DNA, RNA, Gene function, Gene regulation, and Biotechnology.
Chapter 11: DNA & Genes Sections 11.1: DNA: The Molecular of Heredity Subsections: What is DNA? Replication of DNA.
Nucleic Acids and Protein Synthesis 10 – 1 DNA 10 – 2 RNA 10 – 3 Protein Synthesis.
DNA, RNA & Protein Synthesis Chapters 12 & 13. The Structure of DNA.
8.2 Structure of DNA KEY CONCEPT DNA structure is the same in all organisms.
1 Human chromosomes: 50->250 million base pairs. Average gene: 3000 base pairs.
CS273a A Zero-Knowledge Based Introduction to Biology Courtesy of George Asimenos.
Transcription and Translation The Objective : To give information about : 1- The typical structure of RNA and its function and types. 2- Differences between.
Chapter 12 DNA and RNA.
CHAPTER 10 DNA REPLICATION & PROTEIN SYNTHESIS. DNA and RNA are polymers of nucleotides – The monomer unit of DNA and RNA is the nucleotide, containing.
Transcription and The Genetic Code From DNA to RNA.
Microbial Genetics Structure and Function of Genetic Material The Regulation of Bacterial Gene Expression Mutation: Change in Genetic Material Genetic.
Gene Expression DNA, RNA, and Protein Synthesis. Gene Expression Genes contain messages that determine traits. The process of expressing those genes includes.
Chapter 14 GENETIC TECHNOLOGY. A. Manipulation and Modification of DNA 1. Restriction Enzymes Recognize specific sequences of DNA (usually palindromes)
The Central Dogma of Life. replication. Protein Synthesis The information content of DNA is in the form of specific sequences of nucleotides along the.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Genomics Lecture 3 By Ms. Shumaila Azam. Proteins Proteins: large molecules composed of one or more chains of amino acids, polypeptides. Proteins are.
Protein Synthesis. Review Questions What is the function of DNA? Stores genetic information and holds the instructions for building proteins Why is DNA.
8.2 KEY CONCEPT DNA structure is the same in all organisms.
DNA and Protein Synthesis
Ch 10: How Proteins Are Made
Unit 5: DNA and Protein Synthesis
Protein Synthesis Human Biology.
BIOLOGY 12 Protein Synthesis.
Chapter 3 Proteins.
Chapter 14.
LECTURE 5: DNA, RNA & PROTEINS
Presentation transcript:

Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong Copyright © 2004 by Jinyan Li and Limsoon Wong

Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 1: Background and Introduction Copyright © 2004 by Jinyan Li and Limsoon Wong

Outline Practical Introduction to Biology Brief Introduction to Bioinformatics Overview of Gene Expression Profiling Overview of Proteomic Profiling A motivating Biomedical Application Brief Introduction to Some Data Sets

Copyright © 2004 by Jinyan Li and Limsoon Wong Practical Introduction to Biology

Copyright © 2004 by Jinyan Li and Limsoon Wong History 1866 Mendel discovered genetics 1869 DNA discovered 1944 Avery & McCarty demonstrated DNA as carrier of genetic info 1953 Watson & Crick deduced 3D struct of DNA 1960 Elucidation of genetic code, mapping DNA to protein 1970 Development of DNA sequencing techniques: sequence segmentation and electrophoresis 1980 Development of PCR: exploiting natural replication, amplify DNA samples so that they are enough for doing expt 1990 Human Genome Project 2002 Human genome published Now U nderstanding the detail mechanism of the cell

Copyright © 2004 by Jinyan Li and Limsoon Wong Body Our body consists of a number of organs Each organ composes of a number of tissues Each tissue composes of cells of the same type

Copyright © 2004 by Jinyan Li and Limsoon Wong Cell Performs two types of function –Chemical reactions necessary to maintain our life –Pass info for maintaining life to next generation In particular –Protein performs chemical reactions –DNA stores & passes info –RNA is intermediate between DNA & proteins

Protein A sequence composed from an alphabet of 20 amino acids –Length is usually 20 to 5000 amino acids –Average around 350 amino acids Folds into 3D shape, forming the building blocks & performing most of the chemical reactions within a cell Copyright © 2004 by Jinyan Li and Limsoon Wong

Amino Acid Each amino acid consist of –Amino group –Carboxyl group –R group Carboxyl group Amino group C  (the central carbon) R group NH 2 H CC R OH O Copyright © 2004 by Jinyan Li and Limsoon Wong

Classification of Amino Acids Amino acids can be classified into 4 types. Positively charged (basic) –Arginine (Arg, R) –Histidine (His, H) –Lysine (Lys, K) Negatively charged (acidic) –Aspartic acid (Asp, D) –Glutamic acid (Glu, E)

Copyright © 2004 by Jinyan Li and Limsoon Wong Classification of Amino Acids Polar (overall uncharged, but uneven charge distribution. can form hydrogen bonds with water. they are called hydrophilic) –Asparagine (Asn, N) –Cysteine (Cys, C) –Glutamine (Gln, Q) –Glycine (Gly, G) –Serine (Ser, S) –Threonine (Thr, T) –Tyrosine (Tyr, Y) Nonpolar (overall uncharged and uniform charge distribution. cant form hydrogen bonds with water. they are called hydrophobic) –Alanine (Ala, A) –Isoleucine (Ile, I) –Leucine (Leu, L) –Methionine (Met, M) –Phenylalanine (Phe, F) –Proline (Pro, P) –Tryptophan (Trp, W) –Valine (Val, V)

N H CC R’ OH O NH 2 H CC R O H Peptide bond NH 2 H CC R OH O NH 2 H CC R’ OH O + Protein & Polypeptide Chain Formed by joining amino acids via peptide bond One end the amino group, called N-terminus The other end is the carboxyl group, called C-terminus Copyright © 2004 by Jinyan Li and Limsoon Wong

DNA Stores instruction needed by the cell to perform daily life function Consists of two strands interwoven together and form a double helix Each strand is a chain of some small molecules called nucleotides Francis Crick shows James Watson the model of DNA in their room number 103 of the Austin Wing at the Cavendish Laboratories, Cambridge Copyright © 2004 by Jinyan Li and Limsoon Wong

Base (Adenine) Deoxyribose Phosphate 5` 4` 3` 2` 1` Nucleotide Consists of three parts: –Deoxyribose –Phosphate (bound to the 5’ carbon) –Base (bound to the 1’ carbon) Copyright © 2004 by Jinyan Li and Limsoon Wong

ACGT U Classification of Nucleotides 5 diff nucleotides: adenine(A), cytosine(C), guanine(G), thymine(T), & uracil(U) A, G are purines. They have a 2-ring structure C, T, U are pyrimidines. They have a 1-ring structure DNA only uses A, C, G, & T Copyright © 2004 by Jinyan Li and Limsoon Wong

A T  10Å G C Watson-Crick rules Complementary bases: –A with T (two hydrogen-bonds) –C with G (three hydrogen-bonds) Copyright © 2004 by Jinyan Li and Limsoon Wong

PPPP 5’ 3’ ACGTA Orientation of a DNA One strand of DNA is generated by chaining together nucleotides, forming a phosphate-sugar backbone It has direction: from 5’ to 3’, because DNA always extends from 3’ end: –Upstream, from 5’ to 3’ –Downstream, from 3’ to 5’ Copyright © 2004 by Jinyan Li and Limsoon Wong

Double Stranded DNA DNA is double stranded in a cell. The two strands are anti-parallel. One strand is reverse complement of the other The double strands are interwoven to form a double helix Copyright © 2004 by Jinyan Li and Limsoon Wong

Locations of DNAs in a Cell? Two types of organisms –Prokaryotes (single-celled organisms with no nuclei. e.g., bacteria) –Eukaryotes (organisms with single or multiple cells. their cells have nuclei. e.g., plant & animal) In Prokaryotes, DNA swims within the cell In Eukaryotes, DNA locates within the nucleus

Copyright © 2004 by Jinyan Li and Limsoon Wong Chromosome DNA is usually tightly wound around histone proteins and forms a chromosome The total info stored in all chromosomes constitutes a genome In most multi-cell organisms, every cell contains the same complete set of chromosomes –May have some small different due to mutation Human genome has 3G base pairs, organized in 23 pairs of chromosomes

Copyright © 2004 by Jinyan Li and Limsoon Wong Gene A gene is a sequence of DNA that encodes a protein or an RNA molecule About 30,000 – 35,000 (protein-coding) genes in human genome For gene that encodes protein –In Prokaryotic genome, one gene corresponds to one protein –In Eukaryotic genome, one gene can corresponds to more than one protein because of the process “alternative splicing”

Copyright © 2004 by Jinyan Li and Limsoon Wong Complexity of Organism vs. Genome Size Human Genome: 3G base pairs Amoeba dubia (a single cell organism): 600G base pairs  Genome size has no relationship with the complexity of the organism

Copyright © 2004 by Jinyan Li and Limsoon Wong Number of Genes vs. Genome Size Prokaryotic genome (e.g., E. coli) –Number of base pairs: 5M –Number of genes: 4k –Average length of a gene: 1000 bp Eukaryotic genome (e.g., human) –Number of base pairs: 3G –Estimated number of genes: 30k – 35k –Estimated average length of a gene: bp ~ 90% of E. coli genome are of coding regions. < 3% of human genome is believed to be coding regions  Genome size has no relationship with the number of genes!

Base (Adenine) Ribose Sugar Phosphate 5` 4` 3` 2` 1` RNA RNA has both the properties of DNA & protein –Similar to DNA, it can store & transfer info –Similar to protein, it can form complex 3D structure & perform some functions Nucleotide for RNA has of three parts: –Ribose Sugar (has an extra OH group at 2’) –Phosphate (bound to 5’ carbon) –Base (bound to 1’ carbon) Copyright © 2004 by Jinyan Li and Limsoon Wong

RNA vs DNA RNA is single stranded Nucleotides of RNA are similar to that of DNA, except that have an extra OH at position 2’ –Due to this extra OH, it can form more hydrogen bonds than DNA –So RNA can form complex 3D structure RNA use the base U instead of T –U is chemically similar to T –In particular, U is also complementary to A

Mutation Sudden change of genome Basis of evolution Cause of cancer Can occur in DNA, RNA, & Protein Copyright © 2004 by Jinyan Li and Limsoon Wong

Central Dogma Gene expression consists of two steps –Transcription DNA  mRNA –Translation mRNA  Protein Copyright © 2004 by Jinyan Li and Limsoon Wong

Transcription Synthesize mRNA from one strand of DNA –An enzyme RNA polymerase temporarily separates double- stranded DNA –It begins transcription at transcription start site –A  A, C  C, G  G, & T  U –Once RNA polymerase reaches transcription stop site, transcription stops Additional “steps” for Eukaryotes –Transcription produces pre-mRNA that contains both introns & exons –5’ cap & poly-A tail are added to pre-mRNA –RNA splicing removes introns & mRNA is made –mRNA are transported out of nucleus

Copyright © 2004 by Jinyan Li and Limsoon Wong Translation Synthesize protein from mRNA Each amino acid is encoded by consecutive seq of 3 nucleotides, called a codon The decoding table from codon to amino acid is called genetic code 4 3 =64 diff codons  Codons are not 1-to-1 corr to 20 amino acids All organisms use the same decoding table Recall that amino acids can be classified into 4 groups. A single-base change in a codon is usually not sufficient to cause a codon to code for an amino acid in different group

Genetic Code Start codon: ATG (code for M) Stop codon: TAA, TAG, TGA Copyright © 2004 by Jinyan Li and Limsoon Wong

Gene Structure Coding region Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Bajic

Copyright © 2004 by Jinyan Li and Limsoon Wong Ribosome Translation is handled by a molecular complex, ribosome, which consists of both proteins & ribosomal RNA (rRNA) Ribosome reads mRNA & the translation starts at a start codon (the translation start site) With help of tRNA, each codon is translated to an amino acid Translation stops once ribosome reads a stop codon (the translation stop site)

tRNA 61 diff tRNAs, each corresponds to a non- termination codon Each tRNA folds to form a cloverleaf-shaped structure –One side holds an anticodon –The other side holds the appropriate amino acid Copyright © 2004 by Jinyan Li and Limsoon Wong

Introns and exons Eukaryotic genes contain introns & exons –Introns are seq that are ultimately spliced out of mRNA –Introns normally satisfy GT-AG rule, viz. begin w/ GT & end w/ AG –Each gene can have many introns & each intron can have thousands bases Introns can be very long An extreme example is a gene associated with cystic fibrosis in human: –Length of 24 introns ~1Mb –Length of exons ~1kb

Copyright © 2004 by Jinyan Li and Limsoon Wong Basic Biotechnology Tools Cutting & breaking DNA –Restriction Enzymes –Shortgun method Copying DNA –Cloning –Polymerase Chain Reaction (PCR) Measuring length of DNA –Gel Electrophoresis

Copyright © 2004 by Jinyan Li and Limsoon Wong Restriction Enzymes Recognize certain point, called restriction site, in DNA w/ particular pattern & break it. This process is called digestion In nature, restriction enzymes are used to break foreign DNA to avoid infection Example –EcoRI is the 1st restriction enzyme discovered that cuts DNA wherever the sequence GAATTC is found –Similar to most of the other restriction enzymes, GAATTC is a palindrome > 300 known restriction enzymes have been discovered

Copyright © 2004 by Jinyan Li and Limsoon Wong Shotgun Method Break DNA molecule into small pieces randomly like this: –Solution w/ a large amount of purified DNA –Apply high vibration to break each molecule randomly into small fragments

Cloning by Plasmid Vector Insert DNA X into plasmid vector w/ antibiotic-resistant gene & a recombinant DNA molecule is formed Insert recombinants into host cells Grow host cells in presence of antibiotic –Only cells w/ antibiotic- resistant gene can grow –When we duplicate host cell, X is also duplicated Select cells w/ antibiotic- resistance genes Kill them & extract X Copyright © 2004 by Jinyan Li and Limsoon Wong

Polymerase Chain Reaction (PCR) Allows rapid replication of a selected region of a DNA w/o need for a living cell Inputs for PCR: –Two oligonucleotides are synthesized, each complementary to the two ends of the region. They are used as primers. –Thermostable DNA polymerase TaqI Repeats a cycle w/ 3 phases times. Each cycle takes ~5min –Phase 1: separate double stranded DNA by heat –Phase 2: cool; add synthesis primers –Phase 3: add DNA polymerase TaqI to catalyze 5’ to 3’ DNA synthesis  Selected region has been amplified exponentially

Copyright © 2004 by Jinyan Li and Limsoon Wong Gel Electrophoresis Used to separate a mixture of DNA fragments of different lengths Apply an electrical field to mixture of DNA Note that DNA is negative charged. Small molecules travel faster than large molecules Mixture is separated into bands, each containing DNA molecules of same length

Copyright © 2004 by Jinyan Li and Limsoon Wong Sequencing by Gel Electrophoresis An application of gel electrophoresis is to reconstruct DNA sequence of length within a few hours –Generate all sequences ending with A –Separate sequences ending with A into diff bands using gel electrophoresis  Such info tells us positions of A’s in the DNA –Similarly for C, G, & T

……TCAACATGT Sequencing by Gel Electrophoresis: Reading the Sequence Four groups of fragments: A, C, G, & T Fragments are placed in negative end Fragments move to positive end From relative distances of fragments, reconstruct the sequence Copyright © 2004 by Jinyan Li and Limsoon Wong

Hybridization Among thousands of DNA fragments, biologists routinely need to find a DNA fragment that contains a particular DNA subsequence This can be done based on hybridization –Suppose we need to find DNA fragments that contain ACCGAT –Make probes inversely complement to ACCGAT –Mix probes w/ DNA fragments –Due to hybridization rule (A=T, C  G), DNA fragments containing ACCGAT will hybridize with probes

Copyright © 2004 by Jinyan Li and Limsoon Wong Brief Introduction to Bioinformatics

Copyright © 2004 by Jinyan Li and Limsoon Wong Some Bioinformatics Problems Biological Data Searching Gene/Promoter finding Cis-regulatory DNA Gene/Protein Network Protein/RNA Structure Prediction Evolutionary Tree reconstruction Infer Protein Function Disease Diagnosis Disease Prognosis Disease Treatment Optimization,...

Biological Data Searching Biological Data is increasing rapidly Biologists need to locate required info Difficulties: –Too much –Too heterogeneous –Too distributed –Too many errors –Due to mutation, need approximate search Copyright © 2004 by Jinyan Li and Limsoon Wong

Cis-Regulatory DNAs Cis-regulatory DNAs control whether genes should express or not Cis-regulatory may locate in promoter region, intron, or exon Finding and understanding cis- regulatory DNAs is one of the key problem in coming years Image credit: US DOE Copyright © 2004 by Jinyan Li and Limsoon Wong

Gene Networks Inside a cell is a complex system Expression of one gene depends on expression of another gene Such interactions can be represented using gene network Understanding such networks helps identify association betw genes & diseases Copyright © 2004 by Jinyan Li and Limsoon Wong

Protein/RNA structure prediction Structure of Protein/RNA is essential to its functionality Important to have some ways to predict the structure of a protein/RNA given its sequence This problem is important & it is always considered as a “grand challenge” problem in bioinformatics Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Kolatkar

189, 217, 247, , , 217, 261 Evolutionary Tree Reconstruction Protein/RNA/DNA mutates Evolutionary Tree studies evolutionary relationship among set of protein/RNA/DNAs Figures out origin of species years ago years ago years ago present AfricanAsianPapuan European Root Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Sykes

Copyright © 2004 by Jinyan Li and Limsoon Wong Gene Expression Profiling

Copyright © 2004 by Jinyan Li and Limsoon Wong What’s a Microarray? Idea of hybridization leads to DNA array tech In the past, “one gene in one experiment”  Hard to get whole picture DNA array is a technology that contains large number of “DNA molecules” spotted on glass slides, nylon membranes, or silicon wafers  Measure expression of thousands of genes simultaneously

Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Affymetrix Affymetrix GeneChip™ Array

quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules exposed linkers become deprotected and are available for nucleotide coupling Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Affymetrix Making GeneChip™ Array

Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Affymetrix Gene Expression Measurement by GeneChip™ Array

Copyright © 2004 by Jinyan Li and Limsoon Wong A Sample GeneChip™ File

Copyright © 2004 by Jinyan Li and Limsoon Wong Applications of DNA arrays Sequencing by hybridization –Promising alternative to sequencing by gel electrophoresis –May be able to reconstruct longer DNA sequences in shorter time  SNP discovery Profiling of gene expression –DNA arrays allow us to monitor activities within a cell –Each spot contains complement of a gene –Due to hybridization, we can measure concentration of diff mRNAs within cell  disease diagnosis  disease prognosis  target discovery

Copyright © 2004 by Jinyan Li and Limsoon Wong Proteomic Profiling

Copyright © 2004 by Jinyan Li and Limsoon Wong What is Proteomics? It is proteins that are directly involved in both normal and disease- associated biochemical processes A more complete understanding of disease may be gained by looking directly at the proteins present within a diseased cell or tissue Achieved thru the proteome & proteomics Proteomics is scientific discipline that detects proteins associated with a disease by means of their altered levels of expression betw control & disease states. Proteome research permits discovery of new protein markers for diagnostic purposes & of novel molecular targets for drug discovery

Expt Procedures in Proteomics Separation –electrophoresis (1-D, 2-D) –chromatography (SEC, ion exchange, reversed phase) Digestion –chemical (BrCN) –enzymatic (trypsin, Lys-C, Asp-C) –reduction (Di-Thio-Threitol, b- Mercapto-Ethanol) –alkylation (IodoAcAcid, IodoAcAmide, Vynil Pyridine) Sample clean-up –chromatography (rev phase) –solid phase extraction (Zip Tip) MS ANALYSIS –protein identification (peptide mass fingerprinting) –peptide structural information (post source decay) Copyright © 2004 by Jinyan Li and Limsoon Wong

Ciphergen Protein Chip® System Image credit: Ciphergen

Copyright © 2004 by Jinyan Li and Limsoon Wong Protein Chip® Processes SELDI- TOF-MS Image credit: Ciphergen

Copyright © 2004 by Jinyan Li and Limsoon Wong Schematic of Protein Chip® Reader Image credit: Ciphergen

Copyright © 2004 by Jinyan Li and Limsoon Wong Protein Chip® Array Surfaces Image credit: Ciphergen

A sample proteomic profile Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Petricoin

Copyright © 2004 by Jinyan Li and Limsoon Wong Applications of Proteomics  Disease diagnosis  Disease prognosis  Target discovery

Copyright © 2004 by Jinyan Li and Limsoon Wong A Motivating Application : Optimizing Treatment of Childhood ALL Image credit: FEER

Copyright © 2004 by Jinyan Li and Limsoon Wong Childhood ALL Major subtypes are: T- ALL, E2A-PBX, TEL-AML, MLL genome rearrangements, Hyperdiploid>50, BCR-ABL Diff subtypes respond differently to same Tx Over-intensive Tx –Development of secondary cancers –Reduction of IQ Under-intensiveTx –Relapse The subtypes look similar Conventional diagnosis –Immunophenotyping –Cytogenetics –Molecular diagnostics Unavailable in most ASEAN countries

Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Affymetrix Single-Test Platform of Microarray & Machine Learning

Impact Conventional Tx: intermediate intensity to everyone  10% suffers relapse  50% suffers side effects  costs US$150m/yr Our optimized Tx: high intensity to 10% intermediate intensity to 40% low intensity to 50% costs US$100m/yr Copyright © 2004 by Jinyan Li and Limsoon Wong High cure rate of 80% Less relapse Less side effects Save US$51.6m/yr

Copyright © 2004 by Jinyan Li and Limsoon Wong Some Sample Data at star.edu.sg/rp

Copyright © 2004 by Jinyan Li and Limsoon Wong Kent Ridge Biomedical Data Set Repository Store high-dimensional biomedical data sets: –gene expression data –protein profiling data –genomic sequence data All data are for classification purposes: –disease diagnosis –subtype classification –relapse study –genomic feature prediction –....

Copyright © 2004 by Jinyan Li and Limsoon Wong Convenient File Formats Original raw data are of formats diff from c’mon ones used in machine learning softwares All original raw data files have been converted into plain text data files under the same schema, where every row in the new file is a comma- punctuated string, like a vector, representing a labelled data sample Re-formatted data files have extension.data; & feature names are saved in a separate file w/ extension.names. Equiv.arff format also provided Such transformed data files can be directly fed to c’mon machine learning s/w packages such as C5.0, MLC++, & WEKA

Breast Cancer Outcome Prediction Gene Expression Data Van't Veer et al., Nature 415: , 2002 Training set contains 78 patient samples –34 patients develop dist- ance metastases in 5 yrs –44 patients remain healthy from the disease after initial diagnosis for >5 yrs Testing set contains 12 relapse & 7 non-relapse samples No of genes is Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Veer

CNS Embryonal Tumour Outcome Prediction Gene Expression Data Pomeroy et al., Nature 415: , 2002 Survivors are patients who are alive after treatment, while the failures are those who succumb to their disease 60 patient samples –21 are survivors –39 are failures There are 7129 genes in the dataset Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Pomeroy

Ovarian Cancer Diagnosis Proteomic Profiling Data Petricoin et al., Lancet 359: , serum samples of proteomic spectra generated by mass spec –91 controls (Normal) –162 ovarian cancers Each sample contains M/Z identities Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Petricoin

Lung Cancer Diagnosis Gene Expression Data Gordon et al., Cancer Res 62: , 2002 Lung Malignant pleural mesothelioma (MPM) vs adenocarcinoma (ADCA) 149 testing samples –15 MPM –134 ADCA 32 training samples –16 MPM –16 ADCA Each sample described by genes Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Gordon

Translation Initiation Site Prediction Data Liu & Wong, JBCB 1: , 2003 Used to find Translation Initiation Site (TIS), where translation from mRNA to protein initiates seq in raw data –3312 true ATG –10063 false ATG 927 features Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Liu

Childhood Acute Lymphoblastic Leukemia Gene Expression Data Yeoh et al., Cancer Cell 1: , 2002 Classifying 6 subtypes of childhood acute lymphoblastic leukemia 215 training samples 112 testing samples Each sample has expression value of genes Copyright © 2004 by Jinyan Li and Limsoon Wong Image credit: Yeoh

Copyright © 2004 by Jinyan Li and Limsoon Wong Any Question?

Copyright © 2004 by Jinyan Li and Limsoon Wong Acknowledgements Many slides presented in this part of the tutorial are derived from a course jointly taught at NUS SOC with Ken Sung