Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.

Similar presentations


Presentation on theme: "Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD."— Presentation transcript:

1 Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD

2 Outline Lecture 1 Recap Multivariate analysis Microarray data analysis Boolean analysis Sequencing data analysis

3 MULTIVARIATE ANALYSIS

4 Identify Markers of Human Colon Cancer and Normal Colon 4 Piero DalerbaTomer Kalisky

5 Single Cell Analysis of Normal Human Colon Epithelium

6 Hierarchical Clustering

7 Cluster 3.0 – http://bonsai.hgc.jp/~mdehoon/software/cluster/ http://bonsai.hgc.jp/~mdehoon/software/cluster/ Distance metric – Euclidian, Squared Euclidean, Manhattan, maximum, cosine, Pearson’s correlation Linkage – Single, complete, average, median, centroid

8 Multivariate Analysis - PCA X = data matrix V = loading matrix U = scores matrix Principal Component Analysis

9 Fundamentals of PCA Reduces dimensions of the data PCA uses orthogonal linear transformation First principal component has the largest possible variance. Exploratory tool to uncover unknown trends in the data

10 PCA Analysis

11 HIGH-THROUGHPUT DATA ANALYSIS

12 MICROARRAY ANALYSIS

13 Microarray Spotted vs. in situ Two channel vs. one channel Probe vs. probeset vs. gene

14 Quantile Normalization Sort Average #1#2#3 Val(Probe_i) = SortedAvg[Rank(Probe_i)] SortedAvg

15 Invariant Set Normalization Before Normalization After Normalization Invariant set

16 Good to Check the Image

17 1.Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B? Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Group AGroup B SAM Two-Class Unpaired

18 Permutation tests i)For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Exp 1Exp 2Exp 3Exp 4Exp 5Exp 6 Gene 1 Group AGroup B Exp 1Exp 4Exp 5Exp 2Exp 3Exp 6 Gene 1 Group AGroup B Original grouping Randomized grouping SAM Two-Class Unpaired

19 iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values

20 SAM Two-Class Unpaired Significant positive genes (i.e., mean expression of group B > mean expression of group A) Significant negative genes (i.e., mean expression of group A > mean expression of group B) “Observed d = expected d” line The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

21 GenePattern http://genepattern.broadinstitute.org/

22 AutoSOME http://jimcooperlab.mcdb.ucsb.edu/autosome/ Aaron Newman and James Cooper, BMC Bioinformatics, 2010, 11:117 Aaron Newman

23 Gene Set Analysis Cell Cycle Transcription factor TGF-beta Signaling Pathway Wnt-signaling Pathway Protein-protein interaction network Your Gene Set Compute enrichment in pathways and networks Tools: GSEA, DAVID, Toppfun, MSigDB, and STRING

24 BOOLEAN ANALYSIS

25 Boolean Implication Analyze pairs of genes. Analyze the four different quadrants. Identify sparse quadrants. Record the Boolean relationships. – If ACPP high, then GABRB1 low – If GABRB1 high, then ACPP low ACPP GABRB1 [Sahoo et al. Genome Biology 08] 45,000 Affymetrix microarrays

26 Threshold Calculation A threshold is determined for each gene. The arrays are sorted by gene expression StepMiner is used to determine the threshold Sorted arrays CDH expression [Sahoo et al. 07] Threshold High Low Intermediate

27 BooleanNet Statistics [Sahoo et al. Genome Biology 08] nA low = (a 00 + a 01 ), nB low = (a 00 + a 10 ) total = a 00 + a 01 + a 10 + a 11, observed = a 00 expected = (nA low / total * nB low / total) * total a 00 (a 00 + a 01 ) a 00 (a 00 + a 10 ) + () 1 2 error rate = a 00 a 01 a 11 a 10 A B statistic = (expected – observed) expected √ Boolean Implication = (statistic > 3, error rate < 0.1)

28 Six Boolean Implications [Sahoo et al. Genome Biology 08]

29 MiDReG Algorithm [Sahoo et al. PNAS 2010] MiDReG = (Mining Developmentally Regulated Genes)

30 MiDReG Algorithm [Sahoo et al. PNAS 2010] MiDReG = (Mining Developmentally Regulated Genes)

31 MiDReG Algorithm [Sahoo et al. PNAS 2010] MiDReG = (Mining Developmentally Regulated Genes)

32 B Cell Genes [Sahoo et al. PNAS 2010] CD19 KIT Boolean Implications

33 Jun Seita [Seita, Sahoo et al. PLoS ONE, 2012] http://gexc.stanford.edu

34 SEQUENCING DATA ANALYSIS

35 Sequencing Data Format @HWI-EAS209:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNT +HWI-EAS209:5:58:5894:21141#ATCACG/1 efcfffffcfeefffcffffffddf`feed]`]_Ba >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH FASTA FASTQ S - Sanger Phred+33, (0, 40) X - Solexa Solexa+64,(-5, 40) I - Illumina 1.3+ Phred+64, (0, 40) J - Illumina 1.5+ Phred+64, (3, 40) L - Illumina 1.8+ Phred+33, (0, 41)

36 Mapping

37 Mapping Software Long reads – BLAST, HMMER, SSEARCH Short reads – BLAT – Bowtie, BWA, Partek, SOAP, Tophat, Olego, BarraCUDA

38 Visualizations

39 UCSC Genome Browser GenoViewer, Samtools tview, MaqView, rtracklayer, BamView, gbrowse2 Integrative Genomics Viewer (IGV)

40 Quantification Peak calling – QuEST, MACS, PeakSeq, T-PIC, SIPeS, GLITR, SICER, SiSSRs, OMT Expression quantification – Cufflinks, NEUMA, RSEM, ABySS, ERANGE, RSAT, Velvet, MISO, RSEQ SNP calling – samtools, VarScan, GATK, SOAP2, realSFS, Beagle, QCall, MaCH

41 Peak Discovery [Pepke et al. Nature Methods 2009]

42 Transcript Quantification [Pepke et al. Nature Methods 2009] RPKM, FPKM

43 SNP Calling

44 Typical RNA-seq Workflow [Trapnell et al. Nature Biotech 2010]

45


Download ppt "Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD."

Similar presentations


Ads by Google