VCF format: variants c.f. S. Brown NYU #CHROM POS ID REF ALT QUAL QUAL chr1 901559 . G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15 INFO AA ancestral allele AC allele count in genotypes, for each ALT allele, in the same order as listed AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes AN total number of alleles in called genotypes BQ RMS base quality at this position CIGAR cigar string describing how to align an alternate allele to the reference allele DB dbSNP membership DP combined depth across samples, e.g. DP=154 END end position of the variant described in this record (esp. for CNVs) H2 membership in hapmap2 MQ RMS mapping quality, e.g. MQ=52 MQ0 Number of MAPQ == 0 reads covering this record NS Number of samples with data SB strand bias at this position SOMATIC indicates that the record is a somatic mutation, for cancer genomics VALIDATED validated by follow-up experiment c.f. S. Brown NYU
Alignment or NGS What are the challenges? c.f. S. Brown NYU
Aligning millions of short reads Computationally intensive Aligning to reference genome => mapping the reads Aligning de novo => genome assembly Either way, using something like S-W or BLAST would take too long, so you modify them to take shortcuts (heuristics). Heuristics include using indexing methods Length of each read = 50-300 Number of reads = 107 – 108 Genome length = 3.109 or double if diploid
A short word about indexing Just like the word and page number of the index of a book, this creates an index of short sequences, with their location on the genome. The index file contains index entries made up of a search key value and a pointer to the block in the data file. Hashing (or keys) is when you add line numbers and describe the contents by a minimum of characters, called seeds. Techniques are very well defined in computer science End result is that the files takes up a lot less space, take a lot less time to search
Aligning millions of short reads Computationally intensive Aligning to reference genome => mapping the reads Aligning de novo => genome assembly Heuristics include using indexing methods, some use ungapped alignment with short words BWT (Burrows-Wheeler Transformation) greatly reduced computational time Approaches include using hash tables, spaced seeds, contiguous seeds etc
Mapping The step of aligning the reads to the reference genome. - index the reads then scan them against the reference - find the reference match that has the lowest mismatch - calculate the p-value, and assign each read its location Problems: accuracy, splice junctions, variants Challenges: false positives, repeats, parental alleles, HapMap
NGS alignment algorithms Smith Waterman BLAST Enter BWT BLAT is precomputed BLAST
MAQ algorithm Mapping and Aligning with Qualities MAQ first indexes read sequences and scans the reference genome to find hits that are extended and scored (first 28bp with 2 mismatches max), minimizing the sum of quality scores of mismatches. Scheme of indexing in hashtables (six noncontigous seed templates): indexed indexed eg, for an 8-base long fragment with 1 or 2 variants, you still get segments which are fully matched (2 and 6): c.f. Shamir 2011
MAQ algorithm Heuristic yet thoughtful: MAQ loads all reads in the memory For each read MAQ creates an integer depending on the 28bp templates and stores the read ID and the integer in a hashtable. When all reads are processed, we have many reads under the same integer as key. Now we are scanning the reference, using 28 bp subsquences. Using the corresponding hashes of the subsequences, we find reads that match and extend them. MAQ calculates a score for each extended match and retains the best score hits. Once the reference is scanned, the next template is used until no more templates are left. c.f. Shamir 2011
Burrows Wheeler Transformation The Burrows–Wheeler transform is an algorithm used in data compression techniques The output is easier to compress because it has many repeated characters. In this example the transformed string, there are a total of eight runs of identical characters: XX, II, XX, SS, PP, .., II, and III, which together make 17 out of the 44 characters. So, that would be great if we could compress our reads like that: How does it work?
Burrows Wheeler Transformation First we transform: This string now can be nicely compressed (i.e. we can do that with our reads too!)
Burrows Wheeler Transformation However, how do we get back to the original reads (i.e. how does the inverse Transformation work?)
BowTie Matrix has characteristic that similar rows are clustered together Starting with query, we find ranges of rows that match the query. Successively adding to the query the ranges become smaller. A match is found if query matches the complete row, no match if range = 0 Problems arise when mismatches occur Remedy backtracking (next slide)
BowTie
Dealing with splice junctions (from Garber et al, Nature, 2011)
TopHat
Pros and Cons Identifying splice junctions correctly is non-trivial Splice aligners that use exon first methods are computationally less intensive (eg TopHat, which starts with results of Bowtie2) Seed extend methods are better at finding new isoforms, but could equally come up with false positive splice junctions (eg GSNAP)
Identifying transcripts For RNA-Seq, might want to identify novel transcripts Not trivial Ignore isoforms Use genome-guided reconstruction, ie using reads that span the potential splice site of known genomes (eg Cufflinks) use genome independent reconstruction, ie de novo approaches (eg Velvet, Trinity)
Finding the real transcripts