Download presentation
Presentation is loading. Please wait.
1
Bioinformatics for Stem Cell Lecture 2
Debashis Sahoo, PhD
2
Outline Lecture 1 Recap Multivariate analysis Microarray data analysis
Boolean analysis Sequencing data analysis
3
Multivariate Analysis
4
Identify Markers of Human Colon Cancer and Normal Colon
Piero Dalerba Tomer Kalisky
5
Single Cell Analysis of Normal Human Colon Epithelium
6
Hierarchical Clustering
7
Hierarchical Clustering
Distance metric Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation Linkage Single, complete, average, median, centroid
8
Multivariate Analysis - PCA
Principal Component Analysis X = data matrix V = loading matrix U = scores matrix
9
Fundamentals of PCA Reduces dimensions of the data
PCA uses orthogonal linear transformation First principal component has the largest possible variance. Exploratory tool to uncover unknown trends in the data
10
PCA Analysis
11
High-throughput data analysis
12
Microarray analysis
13
Microarray Spotted vs. in situ Two channel vs. one channel
Probe vs. probeset vs. gene
14
Quantile Normalization
#1 #2 #3 SortedAvg Average Sort Val(Probe_i) = SortedAvg[Rank(Probe_i)]
15
Invariant Set Normalization
Before Normalization After Invariant set
16
Good to Check the Image
17
SAM Two-Class Unpaired
Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?
18
SAM Two-Class Unpaired
Permutation tests For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Group A Group B Original grouping Exp 1 Exp 4 Exp 5 Exp 2 Exp 3 Exp 6 Gene 1 Group A Group B Randomized grouping
19
SAM Two-Class Unpaired
iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values
20
SAM Two-Class Unpaired
Significant positive genes (i.e., mean expression of group B > mean expression of group A) SAM Two-Class Unpaired “Observed d = expected d” line The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. Significant negative genes (i.e., mean expression of group A > mean expression of group B)
21
GenePattern
22
AutoSOME http://jimcooperlab.mcdb.ucsb.edu/autosome/ Aaron Newman
Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117
23
Gene Set Analysis Your Gene Set Cell Cycle Transcription factor
Compute enrichment in pathways and networks TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING
24
Boolean Analysis
25
Boolean Implication Analyze pairs of genes.
Analyze the four different quadrants. Identify sparse quadrants. Record the Boolean relationships. If ACPP high, then GABRB1 low If GABRB1 high, then ACPP low 45,000 Affymetrix microarrays GABRB1 Put the introductory slides How many microarrays Seems like a fundamental… If -> then Describe x and y axis. Describe a point. Statistical tests for identifying sparse quadrant. ACPP [Sahoo et al. Genome Biology 08]
26
Threshold Calculation
A threshold is determined for each gene. The arrays are sorted by gene expression StepMiner is used to determine the threshold High CDH expression Intermediate Threshold Low Say about linear shape. Labels in the graph bigger. Put forbidden zone threshold. Labels. Sorted arrays [Sahoo et al. 07]
27
BooleanNet Statistics
nAlow = (a00+ a01), nBlow = (a00+ a10) total = a00+ a01+ a10+ a11, observed = a00 expected = (nAlow/ total * nBlow/ total) * total a00 a01 a11 a10 A B statistic = (expected – observed) expected √ a00 (a00+ a01) (a00+ a10) + ( ) 1 2 error rate = Put the introductory slides How many microarrays Seems like a fundamental… If -> then Describe x and y axis. Describe a point. Statistical tests for identifying sparse quadrant. Boolean Implication = (statistic > 3, error rate < 0.1) [Sahoo et al. Genome Biology 08]
28
Six Boolean Implications
Sparse quadrants are highlighted. Prepare a comparison slides. Correlation vs Boolean If then Get rid of slide numbers Divide the pictures: Two slides First show Asymmetric Symmetric [Sahoo et al. Genome Biology 08]
29
MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes)
Replace seed with Gene A Same slides with and without gene X Just Differentiation Make the arrow visible Spell MiDReG [Sahoo et al. PNAS 2010]
30
MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes)
[Sahoo et al. PNAS 2010]
31
MiDReG Algorithm MiDReG = (Mining Developmentally Regulated Genes)
[Sahoo et al. PNAS 2010]
32
B Cell Genes Boolean Implications KIT CD19 [Sahoo et al. PNAS 2010]
Show actual Boolean implication/RTPCR 19 genes – put numbers Say that cancer datasets can predict normal differentiation steps Take the Stanford logo out [Sahoo et al. PNAS 2010]
33
http://gexc.stanford.edu Jun Seita
Explain it better [Seita, Sahoo et al. PLoS ONE, 2012]
34
Sequencing data analysis
35
Sequencing Data Format
>SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTA @HWI-EAS209:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba FASTQ S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)
36
Mapping
37
Mapping Software Long reads Short reads BLAST, HMMER, SSEARCH BLAT
Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA
38
Visualizations
39
Visualizations UCSC Genome Browser
GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 Integrative Genomics Viewer (IGV)
40
Quantification Peak calling Expression quantification SNP calling
QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT Expression quantification Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ SNP calling samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH
41
Peak Discovery [Pepke et al. Nature Methods 2009]
42
Transcript Quantification
RPKM, FPKM [Pepke et al. Nature Methods 2009]
43
SNP Calling
44
Typical RNA-seq Workflow
[Trapnell et al. Nature Biotech 2010]
45
[Trapnell et al. Nature Biotech 2010]
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.