Learning to count: quantifying signal

Learning to count: quantifying signal

Read counting Count normalization
The two steps of abundance estimation Read counting Count normalization However this process is non-trivial---there are many features of transcription that can complicate this process: 5’ and 3’ UTR boundaries Alternative splicing Cryptic exons and TSS’s

Read counting: Count normalizing:
The two steps of abundance estimation Read counting: Requires that the reads have been confidently mapped to locations in the genome Requires prior knowledge of well-defined regions over which one wants to count, i.e. features (exons, genes, isoforms, promoters, etc). We call this information the genome “annotation.” Count normalizing: There are numerous methods of normalizing, all of which have trade-offs. It is critical in order to be able to draw any biologically meaningful conclusions from one’s data.

Summarizing read coverage (basics):
The goal: to determine the expression level of a particular genomic feature (gene/exon/transcript), using a set of reads that have been mapped to a reference genome. The (simplest) answer: define the boundaries of your genomic feature and count the total number of reads that overlap that region. The reads in green map to genomic locations that overlap the genomic feature, while those in red do not. genome genomic feature

Comparing read coverage (expression):
Gene 1 coverage: 100 reads Gene 2 coverage: 250 reads Is the expression of Gene 1 < expression of Gene 2? The number of reads is (roughly) proportional to… the length of the gene the total number of reads in the library AND the expression level of the gene

Reads versus Fragments:
Fragments are pieces of cDNA generated from your original RNA sample---they are a direct reflection of the biological expression of your sample. Reads are the sequence of bases read from a fragment and recorded in your fastq file. For single-end data there is one read per fragment. For paired-end data there are two reads per fragment. Read 1 cDNA fragment We want our summarization of coverage to reflect the number of fragments present in our sample. This means, for paired-end data the two reads (forward and reverse) are redundant pieces of information.

Basic schema for normalizing counts:
Reads Per Million mapped (RPM/CPM): The counts for each feature are divided through by the total number of millions of mapped reads. This adjusts for the differences in sequencing depth. R𝑃𝑀= 𝑐𝑜𝑢𝑛𝑡𝑠 𝑝𝑒𝑟 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 ∗ 10 6 Reads Per Kilobase per Million mapped (RPKM/FPKM): 1) Count total number of reads in a feature. 2) Divide by the total number of millions mapped in the sample (RPM). 3) Divide the RPM value by the length of the gene/feature, in kilobases. Can be used to compare between different genes within the same sample. Transcripts Per Million mapped (TPM): 1) Count total number of reads in a feature. 2) Divide count by the length of the gene/feature, in kilobases. This results in reads per kilobase (RPK). 3) Sum up all RPK values within the sample and divide by This is your “per million” scaling factor. 4) Divide each RPK value by the “per million” scaling factor. This gives you TPM for each feature. Sum of TPM’s is the same across samples—good for comparing different genes between different samples.

Effect of RPKM normalization
Comparing low and high read count for the same gene Comparing low and high read count for different genes Does transcript #4 actually have the same abundance as transcript #2 and greater abundance than transcript #3? RPKM normalization:

Gene-wise vs. Isoform abundance
However, these simple methods of summarizing coverage don’t apply if we’re interested in the abundance of gene isoforms (i.e., transcripts) Exon 1 Exon 2 Exon 3 Abundance Isoform 1 𝑥 1 𝑥 2 Isoform 2 Isoform 3 𝑥 3 Length 𝑙 1 𝑙 2 𝑙 3 # Reads 𝑛 1 𝑛 2 𝑛 3 RPKM/TPM combine the read coverage across all exons and thus is not sensitive to the contributions of the various isoforms. In other words, for a given read, it is not clear from what isoform it originated.

One method for read summarizing – featureCounts()
featureCounts is a counting method that performs this kind of gene-level summarizing of reads, which is then used in differential expression analysis (e.g. input into DESeq, DESeq2, edgeR, etc.) featureCounts is a part of the Rsubread package for R found on the Bioconductors repository featureCounts input: SAM or BAM file(s) containing mapped and sorted reads Annotation file containing features and meta-features (GTF or SAF) featureCounts output: A count table (R dataframe) with rows corresponding to features and columns corresponding to samples

Defining “feature” and “overlap”:
Features and meta-features featureCounts performs read summarization at feature level or meta-feature level A feature is a continuous region in the genome, such as an exon A meta-feature is an aggregation of one or more features, such as a gene or transcript Features and meta-features must be provided to featureCounts (in the form of a GTF or SAF file) Overlap between reads and features A read is said to overlap with a feature if there is at least 1 base overlap found between them. A read is said to overlap with a meta-feature if it overlaps with at least one of its features. Multi-overlapping A read is a multi-overlapping read if it overlaps with more than one feature when summarization is performed at feature level, or if it overlaps with more than one meta-feature when summarization is performed at meta-feature level.

featureCounts algorithm:
Features are sorted by their 5’ coordinate and arranged into a 2-level hierarchy: bins and blocks Reference sequence divided into non-overlapping 128kb bins Within each bin, equal numbers of consecutive features are group into blocks Number of blocks in a bin is the square root of number of features in the bin Thus, #𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑖𝑛 𝑏𝑙𝑜𝑐𝑘 ≈(#𝑏𝑙𝑜𝑐𝑘𝑠 𝑖𝑛 𝑏𝑖𝑛) which is optimal for hierarchical search. Reads are first assigned to their bins, then within each bin the reads are assigned to their blocks Finally, the reads in each block are assigned to any feature with which they overlap

featureCounts inputs:
As inputs, featureCounts requires, at minimum 1) an annotation file (either GTF or SAF) containing information about your features and 2) a list of SAM or BAM files to summarize. Gene Transfer Format (GTF) file: Refinement of the General Feature Format (GFF) file. File format: <seq name> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] Simplified Annotation Format (SAF) file: Similar to the BED file format Required fields (5): <GeneID> <Chr> <Start> <End> <Strand> Contains header line: “GeneID Chr Start End Strand”

featureCounts options:
featureCounts has many options for tailoring how it summarizes reads. Some of the most relevant ones are as follows: files= a character vector giving names of input files containing read mapping results. annot.ext= a character string giving name of a user-provided annotation file or data frame. isGTFAnnotationFile= logical indicating whether the annotation (annot.ext) is in GTF format (default: FALSE) useMetaFeatures= logical indicating whether the read summarization should be performed at the meta-feature level allowMultiOverlap= logical indicating if a read is allowed to be assigned to more than one feature it overlaps largestOverlap= if TRUE, read (or pair) is assigned to feature with the largest overlap minOverlap= integer giving the min number of overlapped bases required for a read to be assigned to a feature isPairedEnd= logical indicating if input files contain paired-end reads requireBothEndsMapped= logical indicating if both ends of the same read pair are required to be successfully aligned nthreads= integer giving the number of threads used to run this function You will need to specify at least files, annot.ext, isFTFAnnotationFile, useMetaFeatures, and isPairedEnd

featureCounts execution:
In this example we have 10 bam files to summarize, for which we’re specifying sample ID’s (“PH01” through “PH10”). The variable bamfilelist is a vector of character strings, each string being the total path to the corresponding BAM files. The sampleIDs are then categorized as factors, for later use in differential expression analysis (sampleIDs must be in the same order as files in bamfilelist) We call featureCounts with an SAF annotation file, for paired-end data, summarizing by meta-features, etc. Finally, we assign the sampleIDs to the columns of the output counts matrix (by default featureCounts will assign the path character strings in bamfilelist to the columns).

featureCounts output:
Here we see the summary of the featureCounts output: The primary component of the output we’re interested in is that named counts. If we look at the contents of output$counts, we see it is a matrix of integer values, where the column names are the samples (in this case PH01  PH10) and the row names are the features over which we summarized the reads (in this case gene IDs) This count matrix (at minimum) is the necessary input for downstream differential expression analysis programs, such as DESeq or edgeR

featureCounts performance:
NOTES: Results are given for genewise counts of either single-end reads or paired-end fragments. featureCounts yields the same read counts as summarizeOverlaps but is much faster and memory efficient. summarizeOverlaps counts fewer fragments because it excludes read pairs with only one end successfully mapped. htseq-count counts slightly fewer reads or fragments than featureCounts because it interprets GFF annotation differently and calls more ambiguously assigned fragments. The table gives the total number of reads counted when using single-end reads and the total number of fragments counted when using paired-end reads. Running time and memory usage are for fragment summarization. featureCounts was set to exclude reads or fragments overlapping multiple genes. summarizeOverlaps and htseq-count were run in ‘union’ mode. Results are shown for countOverlaps (i) when run on the whole genome at once and (ii) when run chromosome by chromosome. featureCounts is both faster and more memory efficient than other common read summarizing programs.

featureCounts algorithm complexity:
NOTES: The table gives proportionality factors for the number of computations (time complexity) and memory locations (space complexity) required by each algorithm. Time complexities depend on the number of features f, the number of reads r and the number of features included in genomic bins overlapping the query read, k. Space complexity also depends on the number of bins, b. Complexities are interpreted as O(x) where x is the expression given in the table. The number of bins used by coverageBED, b2, is greater than the number of bins used by featureCounts, b1. The number of within-bin features k2 for coverageBED is typically 4k1 for featureCounts. featureCounts algorithm complexity relative to other common read summarizing methods. 𝑓=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑟=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑎𝑑𝑠 𝑘 𝑖 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑝𝑒𝑟 𝑏𝑖𝑛 𝑏 𝑖 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑛𝑠

Learning to count: quantifying signal

Similar presentations

Presentation on theme: "Learning to count: quantifying signal"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to count: quantifying signal

Similar presentations

Presentation on theme: "Learning to count: quantifying signal"— Presentation transcript:

Similar presentations

About project

Feedback