Download presentation
1
RNAseq analyses -- methods
June 7, 2015
2
Lecture Objectives Terminology for RNA-seq Steps in RNAseq analysis
Downstream interpretation of expression and differential estimates multiple testing, clustering, heatmaps, classification, pathway analysis, etc.
3
Steps in RNAseq workflow
Obtain raw data from sequencer Align/assemble reads Process alignment to quantify reads/gene feature Conduct differential analysis Summarize and visualize Create gene lists, prioritize candidates for validation, etc. Conduct gene enrichment or pathway analysis
4
Some nomenclature Sequencing file types: Fasta (sequence only) Fastq (contains quality scores) >SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
5
Alignment files SAM – Sequence Alignment/MAP coordinate file
Tab-delimited ASCII file with all of the information needed for the alignment of the reads to reference HWI-ST155_0544:7:2:7658:34048#0 16 Chr M * 0 0 GTAGAGGTAGGACCAACAAGGACCAAGTTTCCCTGTTCCAAC ghghhhhgcghh_hhhhhhhhhhhhhhhhghhhhhghhhhhh NM:i:0 NH:i:4 CC:Z:Chr10 CP:i: HWI-ST155_0544:7:61:11040:141129#0 0 Chr M * 0 0 CTTTTCTGGCGTAACTTGGTTCCCTTTAGTTTGGAACAGATA hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhchfhhhh NM:i:0 NH:i:3 CC:Z:Chr10 CP:i: HWI-ST155_0544:7:8:11214:130793#0 0 Chr M * 0 0 CAAATAGTGGTTTGAAACCTATCAATCAAGTCACTTTCAAGT gggggggggggggggggeggdffffggggggggggggdggfc NM:i:0 NH:i:3 CC:Z:Chr10 CP:i: BAM – Binary version of SAM Machine readable, smaller more compact version More commonly used by viewers and analysis programs
6
Reference guided RNAseq alignment
You have a sequenced genome and a file with the gene model annotations Genome reference file is usually a fasta file with each chromosome as a separate entry GTF = gene transcript file GFF = gene feature file (GFF3) The GTF/GFF chromosome name must match the chromosome name in the genome reference file In this case we are using the human hg19 genome build and the Refseq transcript file from
7
RNAseq alignment
8
Number of reads mapped in the sample
RPKM (FPKM) Reads (fragments) per Kilobase Per Million RPKM = raw number of reads exon length X 1,000,000 Number of reads mapped in the sample In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However: The number of fragments is biased towards larger genes Total number of fragments is related to total library depth Use of FPKM/RPKM normalizes for gene size and library depth
9
Alternatives to FPKM Raw read counts HTSeq (htseq-count)
Instead of calculating FPKM, simply assign reads/fragments to a defined set of genes/transcripts and determine “raw counts” HTSeq (htseq-count) Python code that converts aligned reads to counts You give it an alignment file and the associated transcript file and it will output a list of counts by feature FeatureCounts (part of Subread package) BAM file + GTF file => # reads/feature PGS also reports reads/feature We can try differential expression analysis with the RPKM and the reads file to explore the different outcomes
10
How to quantify reads per feature?
11
Counts vs FPKM(RPKM)? Count files can be analyzed by a number of different R programs EdgeR DESeq2 NOISeq More robust statistical methods for differential expression Accommodates more sophisticated experimental designs with appropriate statistical tests
12
Gene expression differences
Once you have your data in RPKM or counts format, you can use a variety of tools to compare the different conditions and determine what genes are differentially expressed Authors used Cufflinks/CuffDiff
13
How does cuffdiff work? More replicates = more confidence
The variability in fragment count for each gene across replicates is modeled. Fragment count for each isoform is estimated in each replicate along with a measure of uncertainty in this estimate arising from ambiguously mapped reads transcripts with more shared exons and few uniquely assigned fragments will have greater uncertainty The algorithm combines estimates of uncertainty and cross-replicate variability under a negative binomial model of fragment count variability to estimate count variances for each transcript in each library These variance estimates are used during statistical testing to report significantly differentially expressed genes and transcripts. More replicates = more confidence
14
Multiple approaches advisable
15
Multiple testing correction
As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone. Well known from array studies 10,000s genes/transcripts 100,000s exons With RNA-seq, more of a problem than ever All the complexity of the transcriptome Almost infinite number of potential features Genes, transcripts, exons, juntions, retained introns, microRNAs, lncRNAs, etc, etc Bioconductor multtest
16
Lessons learned from microarray days
Hansen et al. “Sequencing Technology Does Not Eliminate Biological Variability.” Nature Biotechnology 29, no. 7 (2011): 572–573. Power analysis for RNA-seq experiments RNA-seq need for biological replicates RNA-seq study design
17
Today in computer lab Analyze the BAM files downloaded from Google drive Do QA/Qc Do mRNA quantification Do comparisons between groups The analysis will take several hours…plan accordingly
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.