Download presentation
Published byJanet Worrick Modified over 9 years ago
1
Differentially expressed genes Sample class prediction etc.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Churchill, March 15 Microarray experiment Bult, Lecture 5 Image analysis Bult, Lecture 6 Normalization Hibbs, Lectures 10 and 11 Estimation Testing Clustering Discrimination Biological verification and interpretation Blake, Lecture 16 and 17
2
Project Steps Find and Download Array Data Normalize Array Data
Analyze Data i.e., generate gene lists Differentially expressed genes, genes in clusters, etc. Interpret Gene Lists Use the annotations of genes in your lists Gene Ontology terms are available for many organisms, but not all
3
Getting The Data Search GEO (or whatever) for a data set of interest.
Download the data files e.g., Affy .CEL files, Affy .CDF files, etc. Upload to home directory
4
Normalize the Data Sent you all a script (2/23/2012) to RMA normalize the Ackerman array data available from my home directory
5
library(affy) library(makecdfenv) Array.CDF=make.cdf.env(“MoGene-1_0-st-v1.cdf”) CELData=ReadAffy() rma.CELData = rma(CELData) rma.expr = exprs(rma.CELData) rma.expr.df = data.frame(ProbeID=row.names(rma.expr),rma.expr) write.table(rma.expr.df,"rma.expr.dat",sep="\t",row=F,quote=F)
6
What is a library? What does the ReadAffy() function do?What are possible arguments for the ReadAffy() function? What class of R object is rma.CELData? What class of R object is rma.expr? What class of R object is rma.expr.df?
7
slotNames(CELData) phenoData(CELData)
8
This is what rma.expr.df looks like in Excel……
9
Plotting summarized probeset intensities across the Ackerman arrays…
Plotting summarized probeset intensities across the Ackerman arrays….(non normalized) jpeg("boxplot.jpeg") boxplot(CELData, names=CELData$sample, col="blue") dev.off()
10
Plotting summarized probeset intensities across the Ackerman arrays…
Plotting summarized probeset intensities across the Ackerman arrays….(normalized) mydata=rma.expr.df jpeg("normal_boxplot.jpg") boxplot(mydata[-1], main = "Normalized Intensities", xlab="Array", ylab="Intensities", col="blue") dev.off()
11
Next time Posted articles from Gary Churchill.
If you only read one article, read Churchill 2004 See also Gary’s web site: Look at Sample Data and Tutorial After that lecture we will begin analysis of microarray data MAANOVA
13
Gigabases Cost per Kb Cost Throughput
Lucinda Fulton, The Genome Center at Washington University
14
Sequencing Technologies
15
Sequence “Space” Roche 454 – Flow space AB SOLiD – Color space
Measure pyrophosphate released by a nucleotide when it is added to a growing DNA chain Flow space describes sequence in terms of these base incorporations AB SOLiD – Color space Sequencing by DNA ligation via synthetic DNA molecules that contain two nested known bases with a flouorescent dye Each base sequenced twice Illumina/Solexa – Base space Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH groups Sequencing via cycles of base addition/detection followed deprotection of the 3’ OH GenomeTV – Next Generation Sequencing (lecture)
16
“Standard” File formats
Sequence containers FASTA FASTQ BAM/SAM Alignments BAM/SAM MAF Annotation BED GFF/GTF/GFF3 WIG Variation VCF GVF
17
Tools Alignments Transcriptomics Variant calling
BLAST: not for NGS BWA Bowtie Maq … Transcriptomics Tophat Cufflinks … Variant calling ssahaSNP Mosaic … Counting (Chip-Seq, etc) FindPeaks PeakSeq
18
FASTQ: Data Format FASTQ References/Documentation Text based
Encodes sequence calls and quality scores with ASCII characters Stores minimal information about the sequence read 4 lines per sequence Line 1: begins followed by sequence identifier and optional description Line 2: the sequence Line 3: begins with the “+” and is followed by sequence identifiers and description (both are optional) Line 4: encoding of quality scores for the sequence in line 2 References/Documentation Cock et al. (2009). Nuc Acids Res 38:
19
FASTQ Example For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example, Illumina stores quality scores ranging from 0-62; Sanger quality scores range from 0-93. Solexa quality scores have to be converted to PHRED quality scores. FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:
20
SAM (Sequence Alignment/Map)
It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format SAM is the output of aligners that map reads to a reference genome Tab delimited w/ header section and alignment section Header sections begin (are optional) Alignment section has 11 mandatory fields BAM is the binary format of SAM
21
Mandatory Alignment Fields
22
Alignment Examples Alignments in SAM format
23
Valid BED files chr nsv433165 chr nsv433166 chr nsv433167 chr nsv433168 chr nsv433169 chr nsv433170 chr nsv433171 chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: chr chr1: chr chr1:
24
Galaxy See Tutorial 1 http://main.g2.bx.psu.edu/
Build and share data and analysis workflows No programming experience required Strong and growing development and user community
25
Dialog/Parameter Selection
History Tools
26
Tutorial Web Site Tutorial 5
Tutorial 5
27
RNA Seq Workflow Convert data to FASTQ Upload files to Galaxy
Quality Control Throw out low quality sequence reads, etc. Map reads to a reference genome Many algorithms available Trade off between speed and sensitivity Data summarization Associating alignments with genome annotations Counts Data Visualization Statistical Analysis
28
Typical RNA_Seq Project Work Flow
Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC TopHat Cufflinks Gene/Transcript/Exon Expression Visualization Statistical Analysis JAX Computational Sciences Service
29
TopHat http://tophat.cbcb.umd.edu/
TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28: Trapnell et al. (2009). Bioinformatics 25:
30
TopHat is built on the Bowtie alignment algorithm.
The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences. Trapnell C et al. Bioinformatics 2009;25:
31
Cufflinks Assembles transcripts, Estimates their abundances, and
Assembles transcripts, Estimates their abundances, and Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.