Presentation is loading. Please wait.

Presentation is loading. Please wait.

Patrick Kemmeren Using EP:NG.

Similar presentations


Presentation on theme: "Patrick Kemmeren Using EP:NG."— Presentation transcript:

1 Patrick Kemmeren http://www.ebi.ac.uk/expressionprofiler Using EP:NG

2 Demonstrate the use of the following components in EP:NG: –Data Selection –Data Transformation –Missing Value Imputation –Principal Components Analysis –Hierarchical Clustering –Clustering Comparison … … for the purposes of microarray data analysis (here, tumor line classification) … … by examining the following paper: M. Crescenzi and A. Giuliani, The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data, FEBS Letters 507(2001) 114-118. Aims

3 Overview PCA: Principal Component Analysis –Unsupervised approach –Reduces complexity of data by reducing its dimensionality –Computes a new, smaller set of uncorrelated variables that best represent the original data Orientation by Genes –Genes are statistical variables, samples are statistical samples –Covariance matrix: records covariance of one gene vs. another, over samples.

4 PCA (cont.) Principal Components –A principal component is a mathematical entity, computed from the data, equivalent to a characteristic vector of the covariance matrix –In other words, finding a way to rotate the original coordinate axes and finding the directions of maximum variance of the scatter of points Summary –Eigenvalues are characteristic values of the principal components – the higher the eigenvalue, the more variability in the dataset it describes –The first few components can thus describe a large proportion of the data I - 1

5 Materials and Methods Data –http://discover.nci.nih.gov/nature2000data/selected_data/t_matrix1375.txt cDNA from 60 cancer cell lines, hyb’d to ~8000 individual gene cDNAs T-matrix: 1416 variables, corresponding to selected genes of highest variance (1375) – log ratios between the gene expression level and a reference mixture Strategy –Perform PCA analysis –Select top explaining components –Project + cluster cell lines into component space –Choose K (number of clusters) by visual observation + clustering comparison

6 Component: Data Upload “Provide URL” –The data matrix URL: http://discover.nci.nih.gov/nature2000/data/selecte d_data/t_matrix1375.txt –Data Format: Nr of columns after 1 for annotation => 3 –Species: Homo Sapiens >> Data Selection

7 Component: Data Selection Select columns: –Only the cell-line columns (format XX:Cell Line) Filter:.*:.* >> Data Transformation

8 Component: Data Transformation KNN imputation –10 neighbours >> Data Transformation - transpose >> Data Selection >> Side menu: Ordination

9 Component: Ordination Analysis Options: –Principal Components –Save 5 eigenvalues –Output: Graphs of Eigenvalues –Output: Summary and Eigenvalues, Arrays and Genes Co-ordinates >> Output Display –Examine outputs… Save the rows (cell lines) co-ordinates (keep top 5 eigenvalues) on the local hard drive (using original column annotations as row annotations here). Import it into excel (paste the orignal column annotations (cell lines)). Save this file as tab-delimited and upload it again.

10 Component: Hierarchical Clustering Cluster the cell lines (in the 5 component space now) –Euclidean Distance –Average Linkage >> Output Display –How many clusters can you see? –Try to zoom in

11 Components: K-Groups Clustering, Clustering Comparison Cluster the tumour cell lines in the components with the K-means algorithm several times… try K=10, K=6, K=5, K=4 Run Clustering Comparison several times –Which K seems most fitting? –An automated method for this process is being developed

12 Obtaining genes strongly correlated with components From the PCA results screen, import the columns (genes) co-ordinates into Excel –Sort (Ascending) on the first component column (Comp1) –What are the top genes there?

13 Original EP Development: Jaak Vilo (Tartu) Patrick Kemmeren (Utrecht) Misha Kapushesky EP:NG Framework Development: Patrick Kemmeren (Utrecht) Misha Kapushesky Visualization Components (under development): Steffen Durinck (Leuven) Clustering Comparison: Aurora Torrente Christine Körner (Leipzig) PCA/COA/BGA: Aedín Culhane (Cork) Gene Ordering: Karlis Freivalds (Riga) Normalization (under development): Tom Bogaert (Leuven) Discussions: EBI Microarray Informatics Team Contributors from the open source community EP:NG is an open source project – if you are interested in contributing, testing or just discussing ideas, let us know! Acknowledgements


Download ppt "Patrick Kemmeren Using EP:NG."

Similar presentations


Ads by Google