STAT 254 -lecture1 An overview Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Study of Gene Expression: Statistics, Biology, and Microarrays Ker-Chau Li Statistics Department UCLA
August 19, 2002Slide 1 Bioinformatics at Virginia Tech David Bevan (BCHM) Lenwood S. Heath (CS) Ruth Grene (PPWS) Layne Watson (CS) Chris North (CS) Naren.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
1 Genetics The Study of Biological Information. 2 Chapter Outline DNA molecules encode the biological information fundamental to all life forms DNA molecules.
Workshop: computational gene prediction in DNA sequences (intro)
Gene Expression Chapter 9.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Summer Bioinformatics Workshop 2008 Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center Introduction to Bioinformatics.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Genomics and Its Impact on Medicine and Society: A 2001 Primer Human Genome Program, U.S. Department of Energy.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Gene Expression Analysis using Microarrays Anne R. Haake, Ph.D.
 Scientific study of life.  Present era is most exciting in biology  Scientists are trying to solve biological puzzles like:  How a single microscopic.
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
LEQ: WHAT ARE THE BENEFITS OF DNA TECHNOLOGY & THE HUMAN GENOME PROJECT? to
Whole Genome Expression Analysis
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Human Genome Project. In 2003 scientists in the Human Genome Project obtained the DNA sequence of the 3 billion base pairs making up the human genome.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
U.S. Department of Energy Genome Programs
Write down what you know about the human genome project.
Genomics and Its Impact on Science and Society: The Human Genome Project and Beyond U.S. Department of Energy Genome Programs
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Finish up array applications Move on to proteomics Protein microarrays.
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Literature reviews revised is due4/11 (Friday) turn in together: revised paper (with bibliography) and peer review and 1st draft.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Aim: What have we learned from the Human Genome Project ? Human Genome Project Progress Project goals were togoals 1.identify all the approximately 20,000-25,000.
Bioinformatics field of science in which biology, computer science, and information technology merge to form a single discipline.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Lecture 7. Functional Genomics: Gene Expression Profiling using
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EB3233 Bioinformatics Introduction to Bioinformatics.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Evolution and the Foundations of Biology
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Chapter 13 Section 13.3 The Human Genome. Genomes contain all the information needed for an organism to grow and survive The Human Genome Project (HGP)
Gene Expression Analysis
Microarray Technology and Applications
Genomes and Their Evolution
Complex methods in clustering and classification
Genetics: From Genes to Genomes
The Study of Biological Information
Presentation transcript:

STAT 254 -lecture1 An overview Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you read or hear Keep an eye on bigger picture; while working on specifics The shaping of bioinformatics falls on your shoulders What to take home : not just microarray, or high throughput data analysis methods, but a set of skills, ways of thinking about quantitative biology

Exploratory data analysis multivariate high dimensional 20 min

Study of Gene Expression: Statistics, Biology, and Microarrays Ker-Chau Li Statistics Department UCLA IMS ENAR Conference Time : March 31, 2003 Place:Tampa, FL

Outline Review of cell biology Microarray gene expression data collection Cell-cycle gene expression (Main Data set) PCA/Nested regression; SIR (Dim. red.) Similarity analysis - clustering (Why Popular?) Liquid association Closing remarks New statistical concept, fueled by Stein’s lemma Justification for IMS

PART I. Cellular Biology Macromolecules: DNA, mRNA, protein

Why Biology hot? Because of

Human Genome Project Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to Project goals are to Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Recent Milestones: ■ June 2000 completion of a working draft of the entire human genome ■ February 2001 analyses of the working draft are published ■ identify all the approximate 30,000 genes in human DNA, ■ determine the sequences of the 3 billion chemical base pairs that make up human DNA, ■ store this information in databases, ■ improve tools for data analysis, ■ transfer related technologies to the private sector, and ■ address the ethical, legal, and social issues (ELSI) that may arise from the project.

Gene number, exact locations, and functions DNA sequence organization Chromosomal structure and organization Noncoding DNA types, amount, distribution, information content, and functions Interaction of proteins in complex molecular machines Evolutionary conservation among organisms Protein conservation (structure and function) Proteomes (total protein content and function) in organisms Correlation of SNPs (single-base DNA variations among individuals) with health and disease Disease-susceptibility prediction based on gene sequence variation Genes involved in complex traits and multigene diseases Complex systems biology including microbial consortia useful for environmental restoration Developmental genetics, genomics Future Challenges: What We Still Don’t Know Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Predicted vs experimentally determined gene function {1} Gene regulation {2} (upstream regulatory region) Coordination of gene expression, protein synthesis, and post- translational events {3}

Medicine and the New Genomics Gene Testing Gene Therapy Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 improved diagnosis of disease earlier detection of genetic predispositions to disease rational drug design gene therapy and control systems for drugs Anticipated Benefits Pharmacogenomics personalized, custom drugs

Agriculture, Livestock Breeding, and Bioprocessing disease-, insect-, and drought-resistant crops healthier, more productive, disease-resistant farm animals more nutritious produce biopesticides edible vaccines incorporated into food products new environmental cleanup uses for plants like tobacco Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Anticipated Benefits

How does the cell work? The guiding principle is the so-called

Medicine and the New Genomics Gene Testing Gene Therapy Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 improved diagnosis of disease earlier detection of genetic predispositions to disease rational drug design gene therapy and control systems for drugs Anticipated Benefits Pharmacogenomics personalized, custom drugs

Agriculture, Livestock Breeding, and Bioprocessing disease-, insect-, and drought-resistant crops healthier, more productive, disease-resistant farm animals more nutritious produce biopesticides edible vaccines incorporated into food products new environmental cleanup uses for plants like tobacco Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Anticipated Benefits

How does the cell work? The guiding principle is the so-called

Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

Gene to protein 4 Nucleotides and 20 amino acids Protein is synthesized from amino acids by ribosome

Gene to Protein Transcription Translation

Transcription and translation

PART II. Microarray Genome-wide expression profiling

Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown*

Microarray

MicroArray Allows measuring the mRNA level of thousands of genes in one experiment -- system level response The data generation can be fully automated by robots Common experimental themes: –Time Course (when) –Tissue Type (where) –Response (under what conditions) –Perturbation: Mutation/Knockout, Knock-in Over-expression

Reverse-transcription Color : cy3, cy5 green, red

Example 1 Comparative expression Normal versus cancer cells ALL versus AML 5 min E.Lander’s group at MIT

PART III. Statistics Low-level analysis Comparative expression Feature extraction Clustering/classification Pearson correlation Liquid association

Issues related to image qualities Convert an image into a number representing the ratio of the levels of expression between red and green channels Color bias Spatial, tip, spot effects Background noises cDNA, oligonucleotide arrays, (not to be covered)

Genome-wide expression profile A basic structure cond1 cond2 …….. condp x11 x12 …….. x1p x21 x22 …….. x2p … …... xn1 xn2 …….. xnp Gene1 Gene2 Genen

Cond1, cond2, …, condp denote various environmental conditions, time points, cell types, etc. under which mRNA samples are taken Note : numerous cells are involved Data quality issues : 1. chip (manufacturer) 2. mRNA sample (user) It is important to have a homogeneous sample so that cellular signals can be amplified Yeast Cell Cycle data : ideally all cells are engaged in the same activities- synchronization

Two classes problem ALL (acute lymphoblastic leukemia) AML(acute myeloid leukemia) An application

Which Genes to select? For each gene (row) compute a score defined by sample mean of X - sample mean of Y divided by standard deviation of X + standard deviation of Y X=ALL, Y=AML Genes (rows) with highest scores are selected. Seems to work ! Improvement? 34 new leukemia samples 29 are predicated with 100% accuracy; 5 weak predication cases That seems to work well. They have a method

Study of cell-cycle regulated genes Rate of cell growth and division varies Yeast(120 min), insect egg(15-30 min); nerve cell(no);fibroblast(healing wounds) Regulation : irregular growth causes cancer Goal : find what genes are expressed at each state of cell cycle Yeast cells; Spellman et al (2000) Fourier analysis: cyclic pattern

Yeast Cell Cycle (adapted from Molecular Cell Biology, Darnell et al) Most visible event

Example of the time curve: Histone Genes: (HTT2) ORF: YNL031C Time course: Histone

EBP2: YKL172W TSM1: YCR042C YOR263C

Why clustering make sense biologically? Profile similarity implies functional association The rationale is Genes with high degree of expression similarity are likely to be functionally related and may participate in common pathways. They may be co-regulated by common upstream regulatory factors. Simply put, Rationale behind massive gene expression analysis:

Some protein complexes Protein rarely works as a single unit

Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity. 1. Cluster analysis : average linkage, self-organizing map, K-mean, Classification : nearest neighbor,linear discriminant analysis, support vector machine,… 3. Dimension reduction methods : PCA ( SVD) Gene profiles and correlation

CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton( ) “ Typical laws of heridity in man ” Karl Pearson modifies and popularizes the use. A building block in multivariate analysis, of which clustering, classification, dim. reduct. are recurrent themes As a statistician, how can you ignore the time order ? (Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)

Other methods for Finding Gene clusters Bayesian clustering : normal mixture, (hidden) indicator PCA plot, projection pursuit, grand tour Multi-Dimension Scaling( bi-plot for categorical responses, showing both cases (genes) and variables(different clustering methods), displaying results from many different clustering procedures) Generalized association plot (Chen 2001, Statistica Sinica) PLAID model ( Statistica Sinica 2002, Lazzeroni, Owen)

1st PCA direction2nd PCA direction 3rd PCA direction Eigenvalues

Smooth S G1 S/G2 G2/M M/G S G1 S/G2 G2/M M/G Non-smooth Phase Assignment

ARG1 ARG2 Book a flight from LA to KEGG, JAPAN in less than 10 seconds Glutamate

ARG1 Adapted from KEGG X Y Compute LA(X,Y|Z) for all Z Rank and find leading genes 8th place negative

Coverage of bioinformatics by areas | topics Sequence analysis Microarray Linkage, pedigree DNA RNA Protein EST Drug Evolution Promoter 3-D structure Functional prediction Pathway discovery System modeling SNPAlternative splicing MotifDomain Drug -gene - protein Protein-protein TRANSFAC Protein -gene

Coverage of Bioinformatics by expertise (hat, not person) Biologist Computer scientist Statistician/m athematician (huge data volume) (raw data provider) Literature searching Make researcher’s life easier (pipeline) Data cleaning Data mining (Bio-information distilling/ Bio-data refining) Web page browsing Pattern searching /comparison Physical/Math/prob/stat models, computer optimization Gene Ontology Data base/ visualization Oil-refining(Crude oil) Generalization /inference (Noise, garbage, or ignorance?)

CurrentNext mRNA protein kinase Nutrients- carbon, nitrogen sources Temperature Water ATP, GTP, cAMP, etc localization DNA methylation, chromatin structure Math. Modeling : a nightmare FITNESSFITNESS FUNCTIONFUNCTION mRNA Cytoplasm Nucleus Mitochondria Vacuolar Observed hidden Statistical methods become useful

Bioinformatics (knowledge integration center) When Where Who What Why Cell level Organ level Organism level Species level Ecology system level

Special issue on bioinformatics Statistica Sinica 2002 January My paper on liquid association : PNAS 2002, 99, Want to get a quick start ? Genome-wide co-expression dynamics: theory and application Classification: Biological Science, Genetics; Physical Science, Statistics

END

Cautionary Notes for Seriation and row-column sorting Hierarchical clustering is popular, but Sharp boundaries may be artifacts due to “clever” permutation how to fine-tune user-specified parameters-need some theoretical guidance What is a cluster ? Criteria needed

Popular methods for clustering/data mining Linkage : Eisen et al, Alon et al K-mean : Tavazoein et al Self-organizing map : Tamayo et al SVD : Holter et al; Alter, Brown, Botstein

Can statisticians take the lead? Difficult But not impossible The key : Willingness to learn more biology February 2002, Talk at UCLA Biochemistry, feedback from David Eisenberg; March 2002, David gave an inspiring review talk about several of his works (Nature, similarity)