Presentation is loading. Please wait.

Presentation is loading. Please wait.

STAT 254 -lecture1 An overview Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you.

Similar presentations


Presentation on theme: "STAT 254 -lecture1 An overview Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you."— Presentation transcript:

1

2 STAT 254 -lecture1 An overview Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you read or hear Keep an eye on bigger picture; while working on specifics The shaping of bioinformatics falls on your shoulders What to take home : not just microarray, or high throughput data analysis methods, but a set of skills, ways of thinking about quantitative biology

3 Exploratory data analysis multivariate high dimensional 20 min

4 Study of Gene Expression: Statistics, Biology, and Microarrays Ker-Chau Li Statistics Department UCLA kcli@stat.ucla.edu IMS ENAR Conference Time : March 31, 2003 Place:Tampa, FL

5 Outline Review of cell biology Microarray gene expression data collection Cell-cycle gene expression (Main Data set) PCA/Nested regression; SIR (Dim. red.) Similarity analysis - clustering (Why Popular?) Liquid association Closing remarks New statistical concept, fueled by Stein’s lemma Justification for IMS

6 PART I. Cellular Biology Macromolecules: DNA, mRNA, protein

7 Why Biology hot? Because of

8 Human Genome Project Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Recent Milestones: ■ June 2000 completion of a working draft of the entire human genome ■ February 2001 analyses of the working draft are published ■ identify all the approximate 30,000 genes in human DNA, ■ determine the sequences of the 3 billion chemical base pairs that make up human DNA, ■ store this information in databases, ■ improve tools for data analysis, ■ transfer related technologies to the private sector, and ■ address the ethical, legal, and social issues (ELSI) that may arise from the project.

9 Gene number, exact locations, and functions DNA sequence organization Chromosomal structure and organization Noncoding DNA types, amount, distribution, information content, and functions Interaction of proteins in complex molecular machines Evolutionary conservation among organisms Protein conservation (structure and function) Proteomes (total protein content and function) in organisms Correlation of SNPs (single-base DNA variations among individuals) with health and disease Disease-susceptibility prediction based on gene sequence variation Genes involved in complex traits and multigene diseases Complex systems biology including microbial consortia useful for environmental restoration Developmental genetics, genomics Future Challenges: What We Still Don’t Know Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Predicted vs experimentally determined gene function {1} Gene regulation {2} (upstream regulatory region) Coordination of gene expression, protein synthesis, and post- translational events {3}

10 Medicine and the New Genomics Gene Testing Gene Therapy Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 improved diagnosis of disease earlier detection of genetic predispositions to disease rational drug design gene therapy and control systems for drugs Anticipated Benefits Pharmacogenomics personalized, custom drugs

11 Agriculture, Livestock Breeding, and Bioprocessing disease-, insect-, and drought-resistant crops healthier, more productive, disease-resistant farm animals more nutritious produce biopesticides edible vaccines incorporated into food products new environmental cleanup uses for plants like tobacco Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Anticipated Benefits

12 How does the cell work? The guiding principle is the so-called

13 Medicine and the New Genomics Gene Testing Gene Therapy Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 improved diagnosis of disease earlier detection of genetic predispositions to disease rational drug design gene therapy and control systems for drugs Anticipated Benefits Pharmacogenomics personalized, custom drugs

14 Agriculture, Livestock Breeding, and Bioprocessing disease-, insect-, and drought-resistant crops healthier, more productive, disease-resistant farm animals more nutritious produce biopesticides edible vaccines incorporated into food products new environmental cleanup uses for plants like tobacco Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001 Anticipated Benefits

15 How does the cell work? The guiding principle is the so-called

16

17 Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

18 Gene to protein 4 Nucleotides and 20 amino acids Protein is synthesized from amino acids by ribosome

19 Gene to Protein Transcription Translation

20 Transcription and translation

21 PART II. Microarray Genome-wide expression profiling

22 Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown*

23 Microarray

24 MicroArray Allows measuring the mRNA level of thousands of genes in one experiment -- system level response The data generation can be fully automated by robots Common experimental themes: –Time Course (when) –Tissue Type (where) –Response (under what conditions) –Perturbation: Mutation/Knockout, Knock-in Over-expression

25 Reverse-transcription Color : cy3, cy5 green, red

26 Example 1 Comparative expression Normal versus cancer cells ALL versus AML 5 min E.Lander’s group at MIT

27 PART III. Statistics Low-level analysis Comparative expression Feature extraction Clustering/classification Pearson correlation Liquid association

28 Issues related to image qualities Convert an image into a number representing the ratio of the levels of expression between red and green channels Color bias Spatial, tip, spot effects Background noises cDNA, oligonucleotide arrays, (not to be covered)

29 Genome-wide expression profile A basic structure cond1 cond2 …….. condp x11 x12 …….. x1p x21 x22 …….. x2p … …... xn1 xn2 …….. xnp Gene1 Gene2 Genen

30 Cond1, cond2, …, condp denote various environmental conditions, time points, cell types, etc. under which mRNA samples are taken Note : numerous cells are involved Data quality issues : 1. chip (manufacturer) 2. mRNA sample (user) It is important to have a homogeneous sample so that cellular signals can be amplified Yeast Cell Cycle data : ideally all cells are engaged in the same activities- synchronization

31 Two classes problem ALL (acute lymphoblastic leukemia) AML(acute myeloid leukemia) An application

32 Which Genes to select? For each gene (row) compute a score defined by sample mean of X - sample mean of Y divided by standard deviation of X + standard deviation of Y X=ALL, Y=AML Genes (rows) with highest scores are selected. Seems to work ! Improvement? 34 new leukemia samples 29 are predicated with 100% accuracy; 5 weak predication cases That seems to work well. They have a method

33 Study of cell-cycle regulated genes Rate of cell growth and division varies Yeast(120 min), insect egg(15-30 min); nerve cell(no);fibroblast(healing wounds) Regulation : irregular growth causes cancer Goal : find what genes are expressed at each state of cell cycle Yeast cells; Spellman et al (2000) Fourier analysis: cyclic pattern

34 Yeast Cell Cycle (adapted from Molecular Cell Biology, Darnell et al) Most visible event

35 Example of the time curve: Histone Genes: (HTT2) ORF: YNL031C Time course: Histone

36 EBP2: YKL172W TSM1: YCR042C YOR263C

37

38 Why clustering make sense biologically? Profile similarity implies functional association The rationale is Genes with high degree of expression similarity are likely to be functionally related and may participate in common pathways. They may be co-regulated by common upstream regulatory factors. Simply put, Rationale behind massive gene expression analysis:

39 Some protein complexes Protein rarely works as a single unit

40 Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity. 1. Cluster analysis : average linkage, self-organizing map, K-mean,... 2. Classification : nearest neighbor,linear discriminant analysis, support vector machine,… 3. Dimension reduction methods : PCA ( SVD) Gene profiles and correlation

41 CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to Galton(1822-1911) “ Typical laws of heridity in man ” Karl Pearson modifies and popularizes the use. A building block in multivariate analysis, of which clustering, classification, dim. reduct. are recurrent themes As a statistician, how can you ignore the time order ? (Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)

42 Other methods for Finding Gene clusters Bayesian clustering : normal mixture, (hidden) indicator PCA plot, projection pursuit, grand tour Multi-Dimension Scaling( bi-plot for categorical responses, showing both cases (genes) and variables(different clustering methods), displaying results from many different clustering procedures) Generalized association plot (Chen 2001, Statistica Sinica) PLAID model ( Statistica Sinica 2002, Lazzeroni, Owen)

43

44 1st PCA direction2nd PCA direction 3rd PCA direction Eigenvalues

45 Smooth 108 31 352 90 295 S G1 S/G2 G2/M M/G1 103 27 255 239 S G1 S/G2 G2/M M/G1 90 165 Non-smooth Phase Assignment

46 ARG1 ARG2 Book a flight from LA to KEGG, JAPAN in less than 10 seconds Glutamate

47 ARG1 Adapted from KEGG X Y Compute LA(X,Y|Z) for all Z Rank and find leading genes 8th place negative

48 Coverage of bioinformatics by areas | topics Sequence analysis Microarray Linkage, pedigree DNA RNA Protein EST Drug Evolution Promoter 3-D structure Functional prediction Pathway discovery System modeling SNPAlternative splicing MotifDomain Drug -gene - protein Protein-protein TRANSFAC Protein -gene

49 Coverage of Bioinformatics by expertise (hat, not person) Biologist Computer scientist Statistician/m athematician (huge data volume) (raw data provider) Literature searching Make researcher’s life easier (pipeline) Data cleaning Data mining (Bio-information distilling/ Bio-data refining) Web page browsing Pattern searching /comparison Physical/Math/prob/stat models, computer optimization Gene Ontology Data base/ visualization Oil-refining(Crude oil) Generalization /inference (Noise, garbage, or ignorance?)

50 CurrentNext mRNA protein kinase Nutrients- carbon, nitrogen sources Temperature Water ATP, GTP, cAMP, etc localization DNA methylation, chromatin structure Math. Modeling : a nightmare FITNESSFITNESS FUNCTIONFUNCTION mRNA Cytoplasm Nucleus Mitochondria Vacuolar Observed hidden Statistical methods become useful

51 Bioinformatics (knowledge integration center) When Where Who What Why Cell level Organ level Organism level Species level Ecology system level

52 Special issue on bioinformatics Statistica Sinica 2002 January My paper on liquid association : PNAS 2002, 99, 16875-16880 Want to get a quick start ? Genome-wide co-expression dynamics: theory and application Classification: Biological Science, Genetics; Physical Science, Statistics

53 END

54 Cautionary Notes for Seriation and row-column sorting Hierarchical clustering is popular, but Sharp boundaries may be artifacts due to “clever” permutation how to fine-tune user-specified parameters-need some theoretical guidance What is a cluster ? Criteria needed

55 Popular methods for clustering/data mining Linkage : Eisen et al, Alon et al K-mean : Tavazoein et al Self-organizing map : Tamayo et al SVD : Holter et al; Alter, Brown, Botstein

56 Can statisticians take the lead? Difficult But not impossible The key : Willingness to learn more biology February 2002, Talk at UCLA Biochemistry, feedback from David Eisenberg; March 2002, David gave an inspiring review talk about several of his works (Nature, similarity)


Download ppt "STAT 254 -lecture1 An overview Cell biology, microarray, statistics Bioinformatics and Statistics Topics to cover Keep a skeptical eye on everything you."

Similar presentations


Ads by Google