Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.

Slides:

Advertisements

Similar presentations

Basic Gene Expression Data Analysis--Clustering

Advertisements

Outlines Background & motivation Algorithms overview

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Introduction to Microarry Data Analysis - II BMI 730

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images BIOINFORMATICS Gene expression Vol. 26, no. 6, 2010, pages.

Gene expression analysis summary Where are we now?

DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Microarray Data Preprocessing and Clustering Analysis

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.

27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Evaluation of Two Methods to Cluster Gene Expression Data Odisse Azizgolshani Adam Wadsworth Protein Pathways SoCalBSI.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”

Quantitative Business Analysis for Decision Making Simple Linear Regression.

Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

NGS Analysis Using Galaxy

Department of Biomedical Informatics Biomedical Data Visualization Kun Huang Department of Biomedical Informatics OSUCCC Biomedical Informatics Shared.

Li and Dewey BMC Bioinformatics 2011, 12:323

ICBP, Stanford University 1 Implication Networks from Large Gene-expression Datasets Debashis Sahoo PhD Candidate, Electrical Engineering, Stanford University.

Stanford University Boolean Analysis of Large Gene-expression Datasets Debashis Sahoo PhD Candidate, Electrical Engineering Joint work with David Dill,

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.

RNAseq analyses -- methods

Lecture 11. Microarray and RNA-seq II

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Analysis of Affy 1.0 ST Gene Array Data in R To analyze Affymetrix 1.0 ST data (exon or gene) you need: Expression data in.CEL format A CDF (chip definition.

Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.

Cluster validation Integration ICES Bioinformatics.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Equivalent Opposite PTPRC low  CD19 low FAM60A low  NUAK1 high XIST high  RPS4Y1 low COL3A1 high  SPARC high Boolean analysis of large gene-expression.

The Broad Institute of MIT and Harvard Differential Analysis.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.

Principal Components Analysis ( PCA)

Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.

Practice:submit the ChIP_Streamline.pbs 1.Replace with your 2.Make sure the.fastq files are in your GMS6014 directory.

Clustering Manpreet S. Katari.

Introduction to Data Mining

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Significance analysis of microarrays (SAM)

Impact of Formal Methods in Biology and Medicine Final Review

Impact of Formal Methods in Biology and Medicine

Bioinformatics for Stem Cell Lecture 2

Impact of Formal Methods in Biology and Medicine

MiDReG: Mining Developmentally Regulated Genes

Department of Computer Science

Dimension reduction : PCA and Clustering

(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.

Additional file 2: RNA-Seq data analysis pipeline

Cancer Cell Line Encyclopedia

Department of Computer Science

Presentation transcript:

Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Outline Lecture 1 Recap Multivariate analysis Microarray data analysis Boolean analysis Sequencing data analysis

MULTIVARIATE ANALYSIS

Identify Markers of Human Colon Cancer and Normal Colon 4 Piero DalerbaTomer Kalisky

Single Cell Analysis of Normal Human Colon Epithelium

Hierarchical Clustering

Cluster 3.0 – Distance metric – Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation Linkage – Single, complete, average, median, centroid

Multivariate Analysis - PCA X = data matrix V = loading matrix U = scores matrix Principal Component Analysis

Fundamentals of PCA Reduces dimensions of the data PCA uses orthogonal linear transformation First principal component has the largest possible variance. Exploratory tool to uncover unknown trends in the data

PCA Analysis

HIGH-THROUGHPUT DATA ANALYSIS

MICROARRAY ANALYSIS

Microarray Spotted vs. in situ Two channel vs. one channel Probe vs. probeset vs. gene

Quantile Normalization Sort Average #1#2#3 Val(Probe_i) = SortedAvg[Rank(Probe_i)] SortedAvg

Invariant Set Normalization Before Normalization After Normalization Invariant set

Good to Check the Image

1.Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B? Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Group AGroup B SAM Two-Class Unpaired

Permutation tests i)For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Group AGroup B Exp 1Exp 4Exp 5Exp 2Exp 3Exp 6 Gene 1 Group AGroup B Original grouping Randomized grouping SAM Two-Class Unpaired

iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values

SAM Two-Class Unpaired Significant positive genes (i.e., mean expression of group B > mean expression of group A) Significant negative genes (i.e., mean expression of group A > mean expression of group B) “Observed d = expected d” line The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

GenePattern

AutoSOME Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117 Aaron Newman

Gene Set Analysis Cell Cycle Transcription factor TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Your Gene Set Compute enrichment in pathways and networks Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING

BOOLEAN ANALYSIS

Boolean Implication Analyze pairs of genes. Analyze the four different quadrants. Identify sparse quadrants. Record the Boolean relationships. – If ACPP high, then GABRB1 low – If GABRB1 high, then ACPP low ACPP GABRB1 [Sahoo et al. Genome Biology 08] 45,000 Affymetrix microarrays

Threshold Calculation A threshold is determined for each gene. The arrays are sorted by gene expression StepMiner is used to determine the threshold Sorted arrays CDH expression [Sahoo et al. 07] Threshold High Low Intermediate

BooleanNet Statistics [Sahoo et al. Genome Biology 08] nA low = (a 00 + a 01 ), nB low = (a 00 + a 10 ) total = a 00 + a 01 + a 10 + a 11, observed = a 00 expected = (nA low / total * nB low / total) * total a 00 (a 00 + a 01 ) a 00 (a 00 + a 10 ) + () 1 2 error rate = a 00 a 01 a 11 a 10 A B statistic = (expected – observed) expected √ Boolean Implication = (statistic > 3, error rate < 0.1)

Six Boolean Implications [Sahoo et al. Genome Biology 08]

MiDReG Algorithm [Sahoo et al. PNAS 2010] MiDReG = (Mining Developmentally Regulated Genes)

MiDReG Algorithm [Sahoo et al. PNAS 2010] MiDReG = (Mining Developmentally Regulated Genes)

MiDReG Algorithm [Sahoo et al. PNAS 2010] MiDReG = (Mining Developmentally Regulated Genes)

B Cell Genes [Sahoo et al. PNAS 2010] CD19 KIT Boolean Implications

Jun Seita [Seita, Sahoo et al. PLoS ONE, 2012]

SEQUENCING DATA ANALYSIS

Sequencing Data TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTA FASTQ S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)

Mapping

Mapping Software Long reads – BLAST, HMMER, SSEARCH Short reads – BLAT – Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA

Visualizations

UCSC Genome Browser GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 Integrative Genomics Viewer (IGV)

Quantification Peak calling – QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT Expression quantification – Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ SNP calling – samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH

Peak Discovery [Pepke et al. Nature Methods 2009]

Transcript Quantification [Pepke et al. Nature Methods 2009] RPKM, FPKM

SNP Calling

Typical RNA-seq Workflow [Trapnell et al. Nature Biotech 2010]