Download presentation
Presentation is loading. Please wait.
Published byLogan Farrell Modified over 11 years ago
1
A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson
2
Outline Background Indexing Solution Architecture
3
Motivation Solexa/Illumina and SOLiD ~billions of base pairs in hours 100s of millions of short reads (30-70 bp) read in parallel Computational cost rising Needed: hardware solution to improve speed and usability
4
Background Goal: quickly align millions of reads to the reference genome Read errors and SNPs prevent simple indexing Solutions Brute force comparison of all reads to reference Indexed-based using seeds Burroughs-Wheeler Transform
5
Index Based Solution Reference Index Table (RIT) Maps all seeds to positions in the reference Read Position Table (RPT) Maps reads to regions in the reference for comparison Smith Waterman Comparison Stream reference genome into SW units for scoring of reads
6
RIT Creation 21218 361736 219 0 113 CATGCTAT 65 Mask SeedCATGCTAT CATGCTAA CATGCTAC 11101101011 CAT_GC_TGAT CATGCTAG CATGCCGG Note: first column is number of entries
7
RPT Creation 1 31218 16 21914 0 32:63 RPT Read 23 Mask 11101101011 CAT_GC_T_ATSeedATACATTGCGTAATCG 0:31 64:95 CATGCTAT 23 96:127 21218 361736 21965 0 113 CATGCTAT RIT CATGCTAA CATGCTAC CATGCTAG CATGCCGG 128:159 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
8
Read Scoring SW Unit TAGTGTGATCGAA 123 3121823 16 21914 0 32:63 RPT 0:31 64:95 96:127 128:159 Read #6:
9
Buckets Buckets combine hits for a read along the reference Reduces number of SW units required Optimal bucket length unknown
10
Entries Per Location in RIT N = number of base pairs in reference genome k = characters in the seed (#1s in the mask) Note: Each entry in RIT ~ 4 Bytes, 2^2k total locations, N entries N=31,k=11: RIT = 2^31*2^2 = 8GB N=32,k=14: RIT = 2^32*2^2 = 16GB
11
Entries in RPT R = number of reads Seff = effective number of seeds per read Ex: R=2^27, Seff=2: 2^20 * 2048 * 4 = 8GB
12
Entries per Bucket b = bucket size Note: this determines the number of SW units required
13
Architecture Memory Required 8 GB for RIT, 8 GB for RPT Creation of RIT and RPT is random access Access time can be masked with buffering and multiple memory banks High bandwidth communication required between FPGAs
14
RIT Creation Algorithm 1.Move to the next reference character 2.Generate the next seed with the mask 3.Using seed as address, open DRAM row a)Read current array length b)Increment array length and write back c)Write reference position to array[length]
15
Memory Distribution RIT AA.. AC.. AG.. AT.. CA.. CC.. CG.. CT.. RIT TA.. TC.. TG.. TT.. RIT Distributed by Seed RPT part 0 RPT Buckets Partitioned across memory modules by reads RPT part 1 RPT part 2 RPT part 3 RPT part 4 RPT part 5 RPT part 6 RPT part 7 RPT part n-4 part n-3 part n-2 part n-1
16
RPT Creation Algorithm 1.Clear the bucket set P in the FPGA assigned to the read 2.For each seed in the read a)Using seed as address, read all reference positions from RIT b)Add the current read to the bucket associated with each position 3.After all seeds in read, for each bucket in P a)Using the reference position as address, read the current array length b)Increment the array length and write back c)Write the read ID to array[length]
17
Reassembly Process with Architecture Reference streamed from host source Reads loaded from RPT into SW units at start comparison point Max score and location for each read recorded by SW unit at end comparison point
18
Active SW Units at one time Lr = Read Length e = error window size
19
Performance Estimates Construction of RIT = 16 seconds Assuming 128MHz and process 1 reference character per clock Construction of RPT = 10 minutes Assuming R=130M, L R =64, N=2^31, k=14, 4 FPGAs Reassembly Phase = 16 seconds Assuming 128MHz, N=2^31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.