Bioinformatics for Stem Cell Lecture 2

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

Microarray technology and analysis of gene expression data Hillevi Lindroos.
Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images BIOINFORMATICS Gene expression Vol. 26, no. 6, 2010, pages.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Li and Dewey BMC Bioinformatics 2011, 12:323
ICBP, Stanford University 1 Implication Networks from Large Gene-expression Datasets Debashis Sahoo PhD Candidate, Electrical Engineering, Stanford University.
Stanford University Boolean Analysis of Large Gene-expression Datasets Debashis Sahoo PhD Candidate, Electrical Engineering Joint work with David Dill,
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Equivalent Opposite PTPRC low  CD19 low FAM60A low  NUAK1 high XIST high  RPS4Y1 low COL3A1 high  SPARC high Boolean analysis of large gene-expression.
The Broad Institute of MIT and Harvard Differential Analysis.
CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.
Principal Components Analysis ( PCA)
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Microarray data analysis
David Amar, Tom Hait, and Ron Shamir
Computational Biology
Introductory RNA-seq Transcriptome Profiling
Pathway Informatics 16th August, 2017
Clustering Manpreet S. Katari.
Introduction to Data Mining
Exploring Microarray data
Differential Gene Expression
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
M. Fu, G. Huang, Z. Zhang, J. Liu, Z. Zhang, Z. Huang, B. Yu, F. Meng 
Significance analysis of microarrays (SAM)
Impact of Formal Methods in Biology and Medicine Final Review
Impact of Formal Methods in Biology and Medicine
Day 2: Session 8: Questions and follow-up…. James C. Fleet, PhD
Sequencing Data Analysis
Impact of Formal Methods in Biology and Medicine
MiDReG: Mining Developmentally Regulated Genes
Department of Computer Science
Gene expression analysis
Schedule for the Afternoon
Dimension reduction : PCA and Clustering
ChIP-seq Robert J. Trumbly
Michal Levin, Tamar Hashimshony, Florian Wagner, Itai Yanai 
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Cancer Cell Line Encyclopedia
The CREBBP-modulated network is enriched in signaling pathways upregulated in the light zone (LZ). The CREBBP-modulated network is enriched in signaling.
Department of Computer Science
PD-L1 expression correlates with T-cell markers and an IFN response signature in human melanomas. PD-L1 expression correlates with T-cell markers and an.
Sequencing Data Analysis
Transcriptional and genomic targets of EN1 in TNBC cells.
Gene expression profiles of T cells.
Presentation transcript:

Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

Outline Lecture 1 Recap Multivariate analysis Microarray data analysis Boolean analysis Sequencing data analysis

Multivariate Analysis

Identify Markers of Human Colon Cancer and Normal Colon Piero Dalerba Tomer Kalisky

Single Cell Analysis of Normal Human Colon Epithelium

Hierarchical Clustering

Hierarchical Clustering http://bonsai.hgc.jp/~mdehoon/software/cluster/ Distance metric Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation Linkage Single, complete, average, median, centroid

Multivariate Analysis - PCA Principal Component Analysis X = data matrix V = loading matrix U = scores matrix

Fundamentals of PCA Reduces dimensions of the data PCA uses orthogonal linear transformation First principal component has the largest possible variance. Exploratory tool to uncover unknown trends in the data

PCA Analysis

High-throughput data analysis

Microarray analysis

Microarray Spotted vs. in situ Two channel vs. one channel Probe vs. probeset vs. gene

Quantile Normalization #1 #2 #3 SortedAvg Average Sort Val(Probe_i) = SortedAvg[Rank(Probe_i)]

Invariant Set Normalization Before Normalization After Invariant set

Good to Check the Image

SAM Two-Class Unpaired Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

SAM Two-Class Unpaired Permutation tests For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Group A Group B Original grouping Exp 1 Exp 4 Exp 5 Exp 2 Exp 3 Exp 6 Gene 1 Group A Group B Randomized grouping

SAM Two-Class Unpaired iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values

SAM Two-Class Unpaired Significant positive genes (i.e., mean expression of group B > mean expression of group A) SAM Two-Class Unpaired “Observed d = expected d” line The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. Significant negative genes (i.e., mean expression of group A > mean expression of group B)

GenePattern http://genepattern.broadinstitute.org/

AutoSOME http://jimcooperlab.mcdb.ucsb.edu/autosome/ Aaron Newman Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117

Gene Set Analysis Your Gene Set Cell Cycle Transcription factor Compute enrichment in pathways and networks TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING

Boolean Analysis

Boolean Implication Analyze pairs of genes. Analyze the four different quadrants. Identify sparse quadrants. Record the Boolean relationships. If ACPP high, then GABRB1 low If GABRB1 high, then ACPP low 45,000 Affymetrix microarrays GABRB1 Put the introductory slides How many microarrays Seems like a fundamental… If -> then Describe x and y axis. Describe a point. Statistical tests for identifying sparse quadrant. ACPP [Sahoo et al. Genome Biology 08]

Threshold Calculation A threshold is determined for each gene. The arrays are sorted by gene expression StepMiner is used to determine the threshold High CDH expression Intermediate Threshold Low Say about linear shape. Labels in the graph bigger. Put forbidden zone threshold. Labels. Sorted arrays [Sahoo et al. 07]

BooleanNet Statistics nAlow = (a00+ a01), nBlow = (a00+ a10) total = a00+ a01+ a10+ a11, observed = a00 expected = (nAlow/ total * nBlow/ total) * total a00 a01 a11 a10 A B statistic = (expected – observed) expected √ a00 (a00+ a01) (a00+ a10) + ( ) 1 2 error rate = Put the introductory slides How many microarrays Seems like a fundamental… If -> then Describe x and y axis. Describe a point. Statistical tests for identifying sparse quadrant. Boolean Implication = (statistic > 3, error rate < 0.1) [Sahoo et al. Genome Biology 08]

Six Boolean Implications Sparse quadrants are highlighted. Prepare a comparison slides. Correlation vs Boolean If then Get rid of slide numbers Divide the pictures: Two slides First show Asymmetric Symmetric [Sahoo et al. Genome Biology 08]

MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) Replace seed with Gene A Same slides with and without gene X Just Differentiation Make the arrow visible Spell MiDReG [Sahoo et al. PNAS 2010]

MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]

MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes) [Sahoo et al. PNAS 2010]

B Cell Genes Boolean Implications KIT CD19 [Sahoo et al. PNAS 2010] Show actual Boolean implication/RTPCR 19 genes – put numbers Say that cancer datasets can predict normal differentiation steps Take the Stanford logo out [Sahoo et al. PNAS 2010]

http://gexc.stanford.edu Jun Seita Explain it better [Seita, Sahoo et al. PLoS ONE, 2012]

Sequencing data analysis

Sequencing Data Format >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTA @HWI-EAS209:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba FASTQ S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)

Mapping

Mapping Software Long reads Short reads BLAST, HMMER, SSEARCH BLAT Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA

Visualizations

Visualizations UCSC Genome Browser GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 Integrative Genomics Viewer (IGV)

Quantification Peak calling Expression quantification SNP calling QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT Expression quantification Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ SNP calling samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH

Peak Discovery [Pepke et al. Nature Methods 2009]

Transcript Quantification RPKM, FPKM [Pepke et al. Nature Methods 2009]

SNP Calling

Typical RNA-seq Workflow [Trapnell et al. Nature Biotech 2010]

[Trapnell et al. Nature Biotech 2010]