CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
CSE182 CSE182-L13 Mass Spectrometry Quantitation and other applications.
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Fa 06CSE182 CSE182-L13 Mass Spectrometry Quantitation and other applications.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Microarrays: Tools for Proteomics
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
PROTEOMICS LECTURE. Genomics DNA (Gene) Functional Genomics TranscriptomicsRNA Proteomics PROTEIN Metabolomics METABOLITE Transcription Translation Enzymatic.
Whole Genome Assembly Microarray analysis. Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Previous Lecture: Regression and Correlation
CSE182-L12 Mass Spectrometry Peptide identification CSE182.
Proteomics Informatics Workshop Part III: Protein Quantitation
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
CSE182 CSE182-L13 Mass Spectrometry Quantitation and other applications.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Whole Genome Expression Analysis
Classification (Supervised Clustering) Naomi Altman Nov '06.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
The dynamic nature of the proteome
Scenario 6 Distinguishing different types of leukemia to target treatment.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Quantification of Membrane and Membrane- Bound Proteins in Normal and Malignant Breast Cancer Cells Isolated from the Same Patient with Primary Breast.
Genome of the week - Enterococcus faecalis E. faecalis - urinary tract infections, bacteremia, endocarditis. Organism sequenced is vancomycin resistant.
Sequencing DNA 1. Maxam & Gilbert's method (chemical cleavage) 2. Fred Sanger's method (dideoxy method) 3. AUTOMATED sequencing (dideoxy, using fluorescent.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Models for Classification
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Oct 2011 SDMBT1 Lecture 11 Some quantitation methods with LC-MS a.ICAT b.iTRAQ c.Proteolytic 18 O labelling d.SILAC e.AQUA f.Label Free quantitation.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Protein quantitation I: Overview (Week 5). Fractionation Digestion LC-MS Lysis MS Sample i Protein j Peptide k Proteomic Bioinformatics – Quantitation.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
EQTLs.
Semi-Supervised Clustering
LECTURE 11: Advanced Discriminant Analysis
Whole Genome Assembly.
Proteomics Informatics David Fenyő
A perspective on proteomics in cell biology
Generally Discriminant Analysis
NoDupe algorithm to detect and group similar mass spectra.
Expression profiling Journal of Allergy and Clinical Immunology
Is Proteomics the New Genomics?
Brandon Ho, Anastasia Baryshnikova, Grant W. Brown  Cell Systems 
Proteomics Informatics David Fenyő
Kuen-Pin Wu Institute of Information Science Academia Sinica
Clustering.
Presentation transcript:

CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis

CSE182 LC-MS Maps time m/z I Peptide 2 Peptide 1 x x x x x x x x x x x x x x time m/z Peptide 2 elution A peptide/feature can be labeled with the triple (M,T,I): –monoisotopic M/Z, centroid retention time, and intensity An LC-MS map is a collection of features

CSE182 Time scaling: Approach 1 (geometric matching) Match features based on M/Z, and (loose) time matching. Objective  f (t 1 -t 2 ) 2 Let t 2 ’ = a t 2 + b. Select a,b so as to minimize  f (t 1 -t’ 2 ) 2

CSE182 Geometric matching Make a graph. Peptide a in LCMS1 is linked to all peptides with identical m/z. Each edge has score proportional to t 1/ t 2 Compute a maximum weight matching. The ratio of times of the matched pairs gives a. Rescale and compute the scaling factor T M/Z

CSE182 Approach 2: Scan alignment Each time scan is a vector of intensities. Two scans in different runs can be scored for similarity (using a dot product) S 11 S 12 S 22 S 21 M(S 1i,S 2j ) =  k S 1i (k) S 2j (k) S 1i = S 2j =

CSE182 Scan Alignment Compute an alignment of the two runs Let W(i,j) be the best scoring alignment of the first i scans in run 1, and first j scans in run 2 Advantage: does not rely on feature detection. Disadvantage: Might not handle affine shifts in time scaling, but is better for local shifts S 11 S 12 S 22 S 21

CSE182 Chemistry based methods for comparing peptides

CSE182 ICAT The reactive group attaches to Cysteine Only Cys-peptides will get tagged The biotin at the other end is used to pull down peptides that contain this tag. The X is either Hydrogen, or Deuterium (Heavy) –Difference = 8Da

CSE182 ICAT ICAT reagent is attached to particular amino-acids (Cys) Affinity purification leads to simplification of complex mixture “diseased” Cell state 1 Cell state 2 “Normal” Label proteins with heavy ICAT Label proteins with light ICAT Combine Fractionate protein prep - membrane - cytosolic Proteolysis Isolate ICAT- labeled peptides Nat. Biotechnol. 17: ,1999

CSE182 Differential analysis using ICAT ICAT pairs at known distance heavy light Time M/Z

CSE182 ICAT issues The tag is heavy, and decreases the dynamic range of the measurements. The tag might break off Only Cysteine containing peptides are retrieved Non-specific binding to strepdavidin

CSE182 Serum ICAT data MA13_02011_02_ALL01Z3I9A* Overview (exhibits ’stack-ups’)

CSE182 Serum ICAT data Instead of pairs, we see entire clusters at 0, +8,+16,+22 ICAT based strategies must clarify ambiguous pairing.

CSE182 ICAT problems Tag is bulky, and can break off. Cys is low abundance MS 2 analysis to identify the peptide is harder.

CSE182 SILAC A novel stable isotope labeling strategy Mammalian cell-lines do not ‘manufacture’ all amino-acids. Where do they come from? Labeled amino-acids are added to amino-acid deficient culture, and are incorporated into all proteins as they are synthesized No chemical labeling or affinity purification is performed. Leucine was used (10% abundance vs 2% for Cys)

CSE182 SILAC vs ICAT Leucine is higher abundance than Cys No affinity tagging done Fragmentation patterns for the two peptides are identical –Identification is easier Ong et al. MCP, 2002

CSE182 Incorporation of Leu-d3 at various time points Doubling time of the cells is 24 hrs. Peptide = VAPEEHPVLLTEAPLNPK What is the charge on the peptide?

CSE182 Quantitation on controlled mixtures

CSE182 Identification MS/MS of differentially labeled peptides

CSE182 Peptide Matching Computational: Under identical Liquid Chromatography conditions, peptides will elute in the same order in two experiments. –These peptides can be paired computationally SILAC/ICAT allow us to compare relative peptide abundances in a single run using an isotope tag.

CSE182 MS quantitation Summary A peptide elutes over a mass range (isotopic peaks), and a time range. A ‘feature’ defines all of the peaks corresponding to a single peptide. Matching features is the critical step to comparing relative intensities of the same peptide in different samples. The matching can be done chemically (isotope tagging), or computationally (LCMS map comparison)

CSE182 Biol. Data analysis: Review Protein Sequence Analysis Sequence Analysis/ DNA signals Gene Finding Assembly

CSE182 Other static analysis is possible Protein Sequence Analysis Sequence Analysis Gene Finding Assembly ncRNA Genomic Analysis/ Pop. Genetics

CSE182 A Static picture of the cell is insufficient Each Cell is continuously active, –Genes are being transcribed into RNA –RNA is translated into proteins –Proteins are PT modified and transported –Proteins perform various cellular functions Can we probe the Cell dynamically? –Which transcripts are active? –Which proteins are active? –Which proteins interact? Gene Regulation Proteomic profiling Transcript profiling

CSE182 Micro-array analysis

CSE182 The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute Lymphocytic Leukemia) & AML (Acute Myelogenous Leukima) Possibly, the set of expressed genes is different in the two conditions

CSE182 Supplementary fig. 2. Expression levels of predictive genes in independent dataset. The expression levels of the 50 genes most highly correlated with the ALL-AML distinction in the initial dataset were determined in the independent dataset. Each row corresponds to a gene, with the columns corresponding to expression levels in different samples. The expression level of each gene in the independent dataset is shown relative to the mean of expression levels for that gene in the initial dataset. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates standard deviations above or below the mean. The top panel shows genes highly expressed in ALL, the bottom panel shows genes more highly expressed in AML.

CSE182 Gene Expression Data Gene Expression data: –Each row corresponds to a gene –Each column corresponds to an expression value Can we separate the experiments into two or more classes? Given a training set of two classes, can we build a classifier that places a new experiment in one of the two classes. g s1s1 s2s2 s

CSE182 Three types of analysis problems Cluster analysis/unsupervised learning Classification into known classes (Supervised) Identification of “marker” genes that characterize different tumor classes

CSE182 Supervised Classification: Basics Consider genes g 1 and g 2 –g 1 is up-regulated in class A, and down-regulated in class B. –g 2 is up-regulated in class A, and down-regulated in class B. Intuitively, g1 and g2 are effective in classifying the two samples. The samples are linearly separable. g1g1 g2g

CSE182 Basics With 3 genes, a plane is used to separate (linearly separable samples). In higher dimensions, a hyperplane is used.

CSE182 Non-linear separability Sometimes, the data is not linearly separable, but can be separated by some other function In general, the linearly separable problem is computationally easier.

CSE182 Formalizing of the classification problem for micro-arrays Each experiment (sample) is a vector of expression values. –By default, all vectors v are column vectors. –v T is the transpose of a vector The genes are the dimension of a vector. Classification problem: Find a surface that will separate the classes v vTvT

CSE182 Formalizing Classification Classification problem: Find a surface (hyperplane) that will separate the classes Given a new sample point, its class is then determined by which side of the surface it lies on. How do we find the hyperplane? How do we find the side that a point lies on? g1g1 g2g

CSE182 Basic geometry What is ||x|| 2 ? What is x/||x|| Dot product? x=(x 1,x 2 ) y

End of L14 CSE182

Dot Product Let  be a unit vector. –||  || = 1 Recall that –  T x = ||x|| cos  What is  T x if x is orthogonal (perpendicular) to  ?  x   T x = ||x|| cos 

CSE182 Hyperplane How can we define a hyperplane L? Find the unit vector that is perpendicular (normal to the hyperplane)

CSE182 Points on the hyperplane Consider a hyperplane L defined by unit vector , and distance  0 Notes; –For all x  L, x T  must be the same, x T  =  0 –For any two points x 1, x 2, (x 1 - x 2 ) T  =0 x1x1 x2x2

CSE182 Hyperplane properties Given an arbitrary point x, what is the distance from x to the plane L? –D(x,L) = (  T x -  0 ) When are points x1 and x2 on different sides of the hyperplane? x 00

CSE182 Separating by a hyperplane Input: A training set of +ve & -ve examples Goal: Find a hyperplane that separates the two classes. Classification: A new point x is +ve if it lies on the +ve side of the hyperplane, -ve otherwise. The hyperplane is represented by the line {x:-  0 +  1 x 1 +  2 x 2 =0} x2x2 x1x1 + -

CSE182 Error in classification An arbitrarily chosen hyperplane might not separate the test. We need to minimize a mis-classification error Error: sum of distances of the misclassified points. Let y i =-1 for +ve example i, –y i =1 otherwise. Other definitions are also possible. x2x2 x1x1 + - 

CSE182 Gradient Descent The function D(  ) defines the error. We follow an iterative refinement. In each step, refine  so the error is reduced. Gradient descent is an approach to such iterative refinement. D(  )  D’(  )

CSE182 Rosenblatt’s perceptron learning algorithm

CSE182 Classification based on perceptron learning Use Rosenblatt’s algorithm to compute the hyperplane L=( ,  0 ). Assign x to class 1 if f(x) >= 0, and to class 2 otherwise.

CSE182 Perceptron learning If many solutions are possible, it does no choose between solutions If data is not linearly separable, it does not terminate, and it is hard to detect. Time of convergence is not well understood

CSE182 Linear Discriminant analysis Provides an alternative approach to classification with a linear function. Project all points, including the means, onto vector . We want to choose  such that –Difference of projected means is large. –Variance within group is small x2x2 x1x1 + - 

CSE182 LDA Cont’d Fisher Criterion

CSE182 Maximum Likelihood discrimination Suppose we knew the distribution of points in each class. –We can compute Pr(x|  i ) for all classes i, and take the maximum

CSE182 ML discrimination Suppose all the points were in 1 dimension, and all classes were normally distributed.

CSE182 ML discrimination recipe We know the distribution for each class, but not the parameters Estimate the mean and variance for each class. For a new point x, compute the discrimination function g i (x) for each class i. Choose argmax i g i (x) as the class for x