Download presentation
Presentation is loading. Please wait.
Published byTheodore Pierce Modified over 9 years ago
1
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data
2
Next Generation Sequencing Today: Illumina NGS platform, Fastq files Sequence bioinformatics: Hash tables Suffix arrays Burrows-Wheeler transform
3
Illumina Slides from Kurt Strueber Genome Center MPIPZ Cologne
4
RNA-seq, ChIP-seq, Methyl-seq …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TCGGAAATT CGGAAATTT AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC GGAAATTTG …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC GAAATTTGC Goal: identify variations Goal: classify, measure significant peaks Short Read Applications Genotyping Reference genome Short reads
5
Illumina
17
Slides taken from Michael Main University of Colorado Hash tables
18
The simplest kind of hash table is an array of records. This example has 701 records. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] An array of records... [ 700] Hash tables
19
Each record has a special field, called its key. In this example, the key is a long integer field called Number. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ]... [ 700] [ 4 ] Number 506643548 Hash tables
20
The number might be a person's identification number, and the rest of the record has information about the person. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ]... [ 700] [ 4 ] Number 506643548 Hash tables
21
When a hash table is in use, some spots contain valid records, and other spots are "empty". [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Hash tables
22
In order to insert a new record, the key must somehow be converted to an array index. The index is called the hash value of the key. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 In our case: The keys are short sequences, and the records contain their location in the genome Inserting a new record
23
Typical way create a hash value: [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 (Number mod 701) What is (580625685 mod 701) ? Inserting a new record
24
Typical way to create a hash value: [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 (Number mod 701) What is (580625685 mod 701) ? 3 Inserting a new record
25
The hash value is used for the location of the new record. Number 580625685 [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... [3] Inserting a new record
26
The hash value is used for the location of the new record. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Inserting a new record
27
Here is another new record to insert, with a hash value of 2. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 My hash value is [2]. Collisions
28
This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 When a collision occurs, move forward until you find an empty spot. Collisions
29
This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 When a collision occurs, move forward until you find an empty spot. Collisions
30
This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 When a collision occurs, move forward until you find an empty spot. Collisions
31
This is called a collision, because there is already another valid record at [2]. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 The new record goes in the empty spot. Collisions
32
[ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322 Number 580625685 Number 701466868... If the keys were short sequences, where would you place the sequence ATACCG? (NB: this is an ill-posed question) A Quiz
33
The data that's attached to a key can be found fairly quickly. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 Searching for a Key
34
Calculate the hash value. Check that location of the array for the key. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 My hash value is [2]. Not me. Searching for a Key
35
Keep moving forward until you find the key, or you reach an empty spot. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 My hash value is [2]. Not me. Searching for a Key
36
Keep moving forward until you find the key, or you reach an empty spot. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 My hash value is [2]. Not me. Searching for a Key
37
Keep moving forward until you find the key, or you reach an empty spot. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 My hash value is [2]. Yes! Searching for a Key
38
When the item is found, the information can be copied to the necessary location. [ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322... Number 580625685 Number 701466868 My hash value is [2]. Yes! Searching for a Key
39
Hash tables store a collection of records with keys. The location of a record depends on the hash value of the record's key. When a collision occurs, the next available location is used. Searching for a particular key is generally quick. T HE E ND Summary
41
Suffix Arrays Suffix arrays were introduced by Manber and Myers in 1993 More space efficient than suffix trees A suffix array for a string x of length m is an array of size m that specifies the lexicographic ordering of the suffixes of x. Idea: Every substring is a prefix of a suffix
42
Example of a suffix array for acaaacatat$ 3 4 1 5 7 9 2 6 8 10 11 Starting position of that suffix in the search string Suffix Arrays
43
Naive in place construction – Similar to insertion sort – Insert all the suffixes into the array one by one making sure that the new inserted suffix is in its correct place – Running time complexity: O(m 2 ) where m is the length of the string Manber and Myers give a O(m log m) construction in their 1993 paper. Suffix Array Construction
44
O(n) space where n is the size of the database string Space efficient. However, there’s an increase in query time Lookup query – Binary search – O(m log n) time; m is the size of the query – Can reduce time to O(m + log n) using a more efficient implementation Suffix Array Construction
45
find(Pattern P in SuffixArray A): i = 0 lo = 0, hi = length(A) for 0<=i<length(P): Binary search for x,y where P[i]=S[A[j]+i] for lo<=x<=j<y<=hi lo = x, hi = y return {A[lo],A[lo+1],...,A[hi-1]} Suffix Array Search
46
Search ‘is’ in mississippi$ 011i$ 18ippi$ 25issippi$ 32ississippi$ 41mississippi$ 510pi$ 69ppi$ 77sippi$ 84sissippi$ 96ssippi$ 103ssissippi$ 1112$ Examine the pattern letter by letter, reducing the range of occurrence each time. - First letter i: occurs in indices from 0 to 3 - Second letter s: occurs in indices from 2 to 3 Done. Output: issippi$ and ississippi$ Suffix Array Search
47
It can be built very fast. It can answer queries very fast: – How many times ATG appears? Disadvantages: – Can’t do approximate matching – Hard to insert new stuff dynamically (need to rebuild the array) Summary
48
http://pauillac.inria.fr/~quercia/documents-info/Luminy- 98/albert/JAVA+html/SuffixTreeGrow.html http://home.in.tum.de/~maass/suffix.html http://homepage.usask.ca/~ctl271/857/suffix_tree.shtml http://homepage.usask.ca/~ctl271/810/approximate_matchin g.shtml http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/ http://dogma.net/markn/articles/suffixt/suffixt.htm http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffi x/ Links
50
Bowtie: A Highly Scalable Tool for Post-Genomic Datasets (Slides by Ben Langmead)
51
Short Read Applications Genotyping RNA-seq, ChIP-seq, Methyl-seq …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TCGGAAATT CGGAAATTT AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC GGAAATTTG …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC GAAATTTGC Goal: identify variations Goal: classify, measure significant peaks
52
Short Read Applications Finding the alignments is typically the performance bottleneck …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… GCGCCCTA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TTGCGGTA GCGGTATA GTATAC… TCGGAAATT CGGAAATTT CGGTATAC TAGGCTATA GCCCTATCG CCTATCGGA CTATCGGAAA AAATTTGC TTTGCGGT TCGGAAATT CGGAAATTT AGGCTATAT GGCTATATG CTATATGCG …CC …CCA …CCAT ATAC… C… …CCAT …CCATAGTATGCGCCC GGTATAC… CGGTATAC GGAAATTTG …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… ATAC… …CC GAAATTTGC
53
Short Read Alignment Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists –Approximate answer to: where in genome did read originate? …TGATCATA… GATCAA …TGATCATA… GAGAAT better than What is “good”? For now, we concentrate on: …TGATATTA… GATcaT …TGATcaTA… GTACAT better than –Fewer mismatches is better –Failing to align a low-quality base is better than failing to align a high-quality base
54
Indexing Genomes and reads are too large for direct approaches like dynamic programming Indexing is required Choice of index is key to performance Suffix tree Suffix array Seed hash tables Many variants, incl. spaced seeds
55
Indexing Genome indices can be big. For human: Large indices necessitate painful compromises 1.Require big-memory machine 2.Use secondary storage > 35 GBs > 12 GBs 3.Build new index each run 4.Subindex and do multiple passes
56
Burrows-Wheeler Transform Burrows Wheeler Matrix Last column contains the characters preceding the characters in the first column BWT(T) a c a a c g $ $ a c a a c g g $ a c a a c a c g $ a c a a a c g $ a c c a a c g $ a a c a a c g $ Rotate string one by one in each row Sort suffixes lexicographically Text T
57
Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded –Matrix will be shown for illustration only In long texts, BWT(T) contains more repeated character occurrences than the original text easier to compress! Burrows Wheeler Matrix Last column BWT(T) T
58
Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” –i th occurrence of a character in Last column is same text occurrence as the i th occurrence in First column BWT(T) Burrows Wheeler Matrix Rank: 2 Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994
59
Burrows-Wheeler Transform To recreate T from BWT(T), repeatedly apply rule: T BWT[ LF(i) ] + T; i = LF(i) –Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Could be called “unpermute” or “walk-left” algorithm Final T
60
BWT in Bioinformatics Oligomer counting –Healy J et al: Annotating large genomes with exact word matches. Genome Res 2003, 13(10):2306-2315. Whole-genome alignment –Lippert RA: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J Comp Bio 2005, 12(4):407-415. Smith-Waterman alignment to large reference –Lam TW et al: Compressed indexing and local alignment of DNA. Bioinformatics 2008, 24(6):791-797.
61
Comparison to Maq & SOAP PC: 2.4 GHz Intel Core 2, 2 GB RAM Server: 2.4 GHz AMD Opteron, 32 GB RAM Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 SOAP not run on PC due to memory constraints Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) Reference: Human (NCBI 36.3, contigs) CPU time Wall clock time Reads per hour Peak virtual memory footprint Bowtie speedup Reads aligned (%) Bowtie –v 2 (server)15m:07s15m:41s33.8 M1,149 MB-67.4 SOAP (server)91h:57m:35s91h:47m:46s0.08 M13,619 MB351x67.3 Bowtie (PC)16m:41s17m:57s29.5 M1,353 MB-71.9 Maq (PC)17h:46m:35s17h:53m:07s0.49 M804 MB59.8x74.7 Bowtie (server)17m:58s18m:26s28.8 M1,353 MB-71.9 Maq (server)32h:56m:53s32h:58m:39s0.27 M804 MB107x74.7 Bowtie delivers about 30 million alignments per CPU hour
62
TopHat: Bowtie for RNA-seq TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads using Bowtie, and then analyzes the mapping results to identify splice junctions between exons. –Contact: Cole Trapnell (cole@cs.umd.edu) –http://tophat.cbcb.umd.edu
63
Nicolas Delhomme, EMBL Heidelberg University of Umeå Acknowledgements NGS Exercises were designed by
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.