Mutivariate statistical Analysis methods Ahmed Rebaï Centre of Biotechnology of Sfax

Slides:



Advertisements
Similar presentations
Traducción. Molécula de aminoácido Sitio de fijación del aminoácido Adaptador (RNAt) RNAm Triplete nucleotídico que codifica un aminoácido + -O 2 C—C—NH.
Advertisements

Click Here to Begin Your Lab
Translation By Josh Morris.
©2000 Timothy G. Standish Mutations Timothy G. Standish, Ph. D.
Mutations. DNA mRNA Transcription Introduction of Molecular Biology Cell Polypeptide (protein) Translation Ribosome.
Transcription & Translation Worksheet
Transcription and Translation
Proteins are made by decoding the Information in DNA Proteins are not built directly from DNA.
FEATURES OF GENETIC CODE AND NON SENSE CODONS
Chapter 17: From Gene to Protein.
How Proteins are Produced
Overview: The Flow of Genetic Information
Sec 5.1 / 5.2. One Gene – One Polypeptide Hypothesis early 20 th century – Archibald Garrod physician that noticed that some metabolic errors were found.
PowerPoint ® Lecture Slides prepared by Janice Meeking, Mount Royal College C H A P T E R Copyright © 2010 Pearson Education, Inc. 3 Cells: The Living.
GENE EXPRESSION. Gene Expression Our phenotype is the result of the expression of proteins Different alleles encode for slightly different proteins Protein.
Gene Expression: From Gene to Protein
Gene to Protein Gene Expression.
RNA Structure Like DNA, RNA is a nucleic acid. RNA is a nucleic acid made up of repeating nucleotides.
Figure 14.1 Figure 14.1 How does a single faulty gene result in the dramatic appearance of an albino deer? 1.
7. Protein Synthesis and the Genetic Code a). Overview of translation i). Requirements for protein synthesis ii). messenger RNA iii). Ribosomes and polysomes.
Introduction to Human Genetics
Cell Division and Gene Expression
Chapter 14 Genetic Code and Transcription. You Must Know The differences between replication (from chapter 13), transcription and translation and the.
Chapter 17 From Gene to Protein. Protein Synthesis  The information content of DNA  Is in the form of specific sequences of nucleotides along the DNA.
DNA sequence analysis IT Carlow Bioinformatics October 2006.
©1998 Timothy G. Standish From DNA To RNA To Protein Timothy G. Standish, Ph. D.
Parts is parts…. AMINO ACID building block of proteins contain an amino or NH 2 group and a carboxyl (acid) or COOH group PEPTIDE BOND covalent bond link.
Today 14.2 & 14.4 Transcription and Translation /student_view0/chapter3/animation__p rotein_synthesis__quiz_3_.html.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
G U A C G U A C C A U G G U A C A C U G UUU UUC UUA UCU UUG UCC UCA
Protein Synthesis Translation e.com/watch?v=_ Q2Ba2cFAew (central dogma song) e.com/watch?v=_ Q2Ba2cFAew.
Figure 17.4 DNA molecule Gene 1 Gene 2 Gene 3 DNA strand (template) TRANSCRIPTION mRNA Protein TRANSLATION Amino acid ACC AAACCGAG T UGG U UU G GC UC.
How Genes Work: From DNA to RNA to Protein Chapter 17.
Gene Translation:RNA -> Protein How does a particular sequence of nucleotides specify a particular sequence of amino acids?nucleotidesamino acids The answer:
Ms. Hatch, What are we doing today?
Ms. Hatch, What are we doing today?
F. PROTEIN SYNTHESIS [or translating the message]
DNA.
From DNA to Protein.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
Please turn in your homework
The blueprint of life; from DNA to Protein
Where is Cytochrome C? What is the role? Where does it come from?
Overview: The Flow of Genetic Information
Mutations.
Transcription and Translation
What is Transcription and who is involved?
From Gene to Phenotype- part 2
Ch. 17 From Gene to Protein Thought Questions
Gene Expression: From Gene to Protein
Overview: The Flow of Genetic Information
Section Objectives Relate the concept of the gene to the sequence of nucleotides in DNA. Sequence the steps involved in protein synthesis.
Protein Synthesis Translation.
Overview: The Flow of Genetic Information
Transcription You’re made of meat, which is made of protein.
Gene Expression: From Gene to Protein
Chapter 17 From Gene to Protein.
SC-100 Class 25 Molecular Genetics
Warm Up 3 2/5 Can DNA leave the nucleus?
Protein Structure Timothy G. Standish, Ph. D..
Today’s notes from the student table Something to write with
Transcription and Translation
Overview: The Flow of Genetic Information
Central Dogma and the Genetic Code
Bellringer Please answer on your bellringer sheet:
DNA, RNA, Amino Acids, Proteins, and Genes!.
How does DNA control our characteristics?
DNA and Words Activity.
Mutations Timothy G. Standish, Ph. D..
Presentation transcript:

Mutivariate statistical Analysis methods Ahmed Rebaï Centre of Biotechnology of Sfax

Basic statistical concepts and tools

Statistics  Statistics are concerned with the ‘optimal’ methods of analyzing data generated from some chance mechanism (random phenomena).  ‘Optimal’ means appropriate choice of what is to be computed from the data to carry out statistical analysis

Random variables  A random variable is a numerical quantity that in some experiment, that involve some degree of randomness, takes one value from some set of possible values  The probability distribution is a set of values that this random variable takes together with their associated probability

The Normal distribution  Proposed by Gauss ( ) : the distribution of errors in astronomical observations (error function)  Arises in many biological processes,  Limiting distribution of all random variables for a large number of observations.  Whenever you have a natural phenomemon which is the result of many contributiong factor each having a small contribution you have a Normal

The Quincunx Bell-shaped distribution

Distribution function  The distribution function is defined F(x)=Pr(X<x)  F is called the cumulative distribution function (cdf) and f the probability distrbution function (pdf) of X   and  ² are respectively the mean and the variance of the distribution

Moments of a distribution  The k th moment is defined as  The first moment is the mean  The k th moment about the mean  is  The second moment about the mean is called the variance  ²

Kurtosis: a useful moments’ function  Kurtosis  4 =   ² 2   4  0 for a normal distribution so it is a measure of Normality

Observations  Observations x i are realizations of a random variable X  The pdf of X can be visualized by a histogram: a graphics showing the frequency of observations in classes

Estimating moments  The Mean of X is estimated from a set of n observations (x 1, x 2,..x n ) as  The variance is estimated by Var(X) =

The fundamental of statistics  Drawing conclusions about a population on the basis on a set of measurments or observations on a sample from that population  Descriptive: get some conclusions based on some summary measures and graphics (Data Driven)  Inferential: test hypotheses we have in mind befor collecting the data (Hypothesis driven).

What about having many variables?  Let X=(X 1, X 2,..X p ) be a set of p variables  What is the marginal distribution of each of the variables X i and what is their joint distribution  If f(X 1, X 2,..X p ) is the joint pdf then the marginal pdf is

Independance  Variables are said to be independent if f(X 1, X 2,..X p )= f(X 1 ). f(X 2 )…. f(X p )

Covariance and correlation  Covariance is the joint first moment of two variables, that is Cov(X,Y)=E(X-  X )(Y-  Y )=E(XY)-E(X)E(Y)  Correlation: a standardized covariance   is a number between -1 and +1

For example: a bivariate Normal  Two variables X and Y have a bivariate Normal if   is the correlation between X and Y

Uncorrelatedness and independence  If  =0 (Cov(X,Y)=0) we say that the variables are uncorrelated  Two uncorrelated variables are independent if and only if their joint distribution is bivariate Normal  Two independent variables are necessarily uncorrelated

Bivariate Normal  If  =0 then  So f(x,y)=f(x).f(y)  the two variables are thus independent

Many variables  We can calculate the Covariance or correlation matrix of (X 1, X 2,..X p )  C=Var(X)=  A square ( p x p ) and symmetric matrix

A Short Excursion into Matrix Algebra

What is a matrix?

Operations on matrices Transpose

Properties

Some important properties

Other particular operations

Eigenvalues and Eigenvectors

Singular value decomposition

Multivariate Data

 Data for which each observation consists of values for more than one variables;  For example: each observation is a measure of the expression level of a gene i in a tissue j  Usually displayed as a data matrix

Biological profile data

The data matrix n observations (rows) for p variables (columns) an nxp matrix

Contingency tables  When observations on two categorial variables are cross-classified.  Entries in each cell are the number of individuals with the correponding combination of variable values Eyes colourHair colour FairRedMediumDark Blue Medium Dark Light

Mutivariate data analysis

Exploratory Data Analysis  Data analysis that emphasizes the use of informal graphical procedures not based on prior assumptions about the structure of the data or on formal models for the data  Data= smooth + rough where the smooth is the underlying regularity or pattern in the data. The objective of EDA is to separate the smooth from the rough with minimal use of formal mathematics or statistics methods

Reduce dimensionality without loosing much information

Overview on the techiques  Factor analysis  Principal components analysis  Correspondance analysis  Discriminant analysis  Cluster analysis

Factor analysis A procedure that postulates that the correlations between a set of p observed variables arise from the relationship of these variables to a small number k of underlying, unobservable, latent variables, usually known as common factors where k<p

Principal components analysis  A procedure that transforms a set of variables into new ones that are uncorrelated and account for a decreasing proportions of the variance in the data  The new variables, named principal components (PC), are linear combinations of the original variables

PCA  If the few first PCs account for a large percentage of the variance (say >70%) then we can display the data in a graphics that depicts quite well the original observations

Example

Correspondance Analysis  A method for displaying relationships between categorial variables in a scatter plot  The new factors are combinations of rows and columns  A small number of these derived coordinate values (usually two) are then used to allow the table to be displayed graphically

Example: analysis of codon usage and gene expression in E. coli (McInerny, 1997) A gene can be represented by a 59- dimensional vector (universal code) A genome consists of hundreds (thousands) of these genes Variation in the variables (RSCU values) might be governed by only a small number of factors For each gene and each codon i calculate RCSU=# observed codon /#expected codon

Codon usage in bacterial genomes

Evidence that all synonymous codons were not used with equal frequency: Fiers et al., 1975 A-protein gene of bacteriophage MS2, Nature 256, UUU Phe 6 UCU Ser 5 UAU Tyr 4 UGU Cys 0 UUC Phe 10 UCC Ser 6 UAC Tyr 12 UGC Cys 3 UUA Leu 8 UCA Ser 8 UAA Ter * UGA Ter * UUG Leu 6 UCG Ser 10 UAG Ter * UGG Trp 12 CUU Leu 6 CCU Pro 5 CAU His 2 CGU Arg 7 CUC Leu 9 CCC Pro 5 CAC His 3 CGC Arg 6 CUA Leu 5 CCA Pro 4 CAA Gln 9 CGA Arg 6 CUG Leu 2 CCG Pro 3 CAG Gln 9 CGG Arg 3 AUU Ile 1 ACU Thr 11 AAU Asn 2 AGU Ser 4 AUC Ile 8 ACC Thr 5 AAC Asn 15 AGC Ser 3 AUA Ile 7 ACA Thr 5 AAA Lys 5 AGA Arg 3 AUG MeU 7 ACG Thr 6 AAG Lys 9 AGG Arg 4 GUU Val 8 GCU Ala 6 GAU Asp 8 GGU Gly 15 GUC Val 7 GCC Ala 12 GAC Asp 5 GGC Gly 6 GUA Val 7 GCA Ala 7 GAA Glu 5 GGA Gly 2 GUG Val 9 GCG Ala 10 GAG Glu 12 GGG Gly 5

Multivariate reduction  Attempts to reduce a high-dimensional space to a lower-dimensional one. In other words, it tries to simplify the data set. Many of the variables might co-vary, therefore there might only be one, or a small few sources of variation in the dataset A gene can be represented by a 59-dimensional vector (universal code) A genome consists of hundreds (thousands) of these genes Variation in the variables (RSCU values) might be governed by only a small number of factors

Plot of the two most important axes Highly expressed genes Lowly-expressed genes Recently acquired genes

Discriminant analysis  Techniques that aim to assess whether or a not a set of variables distinguish or discriminate between two or more groups of individuals  Linear discriminant analysis (LDA): uses linear functions (called canonical discriminant functions) of variable giving maximal separation between groups (assumes tha covariance matrices within the groups are the same)  if not use Quadratic Discriminant analysis (QDA)

Example: Internal Exon prediction  Data: A set of exons and non-exons Variables : a set of features  donor/acceptor site recognizers  octonucleotide preferences for coding region  octonucleotide preferences for intron interiors  on either side

LDA or QDA

Cluster analysis  A set of methods (hierarchical clustering, K-means clustering,..) for constructing sensible and informative classification of an initially unclassified set of data  Can be used to cluster individuals or variables

Example: Microarray data

Other Methods  Independant component analysis (ICA): similar to PCA but components are defined as independent and not only uncorrelated; moreover they are not orthogonal and uniquely defined  Multidimensional Scaling (MDS): a clustering technique that construct a low-dimentional geometrical representation of a distance matrix (also Principal coordinates analysis)

Useful books: Data analysis

Useful book: R langage