Genomic Data Clustering on FPGAs for Compression Andreas Zingg 24.10.2017
Background - Bioinformatics Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg 24.10.2017
Background - Bioinformatics Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg 24.10.2017
Background - Bioinformatics Genome Entirety of an organisms hereditary Information Encoded in DNA DNA Consists of nitrogenous Bases Bases appear in pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg 24.10.2017
Background - Bioinformatics Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg 24.10.2017
Background - Bioinformatics Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg 24.10.2017
Genomic Data DNA is cut into small sequences Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg 24.10.2017
Genomic Data DNA is cut into small sequences Sequences are read by machine Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response ACTGATTG GCCTATCGATGAC TGAT TATCGACG Andreas Zingg 24.10.2017
~ 300 GB The Problem Generated Data is really big One Human Genome generates data in the order of 300 GB ~ 300 GB This might take a while Andreas Zingg 24.10.2017
The Solution Compress the data! Andreas Zingg 24.10.2017
The Solution Compress the data! But how? Andreas Zingg 24.10.2017
Exploit data redundancy The Solution Exploit data redundancy Map the data to the human reference genome About 90% of genomic sequences share similarities with the human reference genome Andreas Zingg 24.10.2017
Mapping to the reference genome Human Reference Genome Aligned reads Andreas Zingg 24.10.2017
Mapping to the reference genome Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg 24.10.2017
Mapping to the reference genome What about the remaining 10%? Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg 24.10.2017
Clustering What about the remaining 10%? Find Clusters and map sequences to these Clusters Andreas Zingg 24.10.2017
Clustering What about the remaining 10%? Find Clusters and map sequences to these Clusters Using what algorithm? Andreas Zingg 24.10.2017
Clustering What about the remaining 10%? Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? Andreas Zingg 24.10.2017
Clustering What about the remaining 10%? Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? What should our K be? Andreas Zingg 24.10.2017
No Useful Clustering Algorithm No useful clustering algorithm for compression of genomic data Exact number of K does not matter As long as there are high correlated clusters, compression is possible Instead of a searching for exactly K clusters, find clusters using a small threshold neighbourhood function Present clustering Algorithm Andreas Zingg 24.10.2017
Matching function For 2 Sequences s1 and s2 a matching function is defined: le: sequence size d: Distance between sequences N: distance threshold Andreas Zingg 24.10.2017
Matching function N = 1 le = 8 Reverse Complement Match! Match! Match! No Match! Andreas Zingg 24.10.2017
Basic Clustering Idea Andreas Zingg 24.10.2017
Basic Clustering Idea Complexity: 𝑂 𝑛 2 Andreas Zingg 24.10.2017
Basic Clustering Idea Complexity: 𝑂 𝑛 2 More than 2 years on an Intel core i7 4790 Not practical Andreas Zingg 24.10.2017
Parallel Clustering Compare sequences with multiple cluster references at the same time Use FPGA board to implement parallel clustering algorithm To compare sequences FPGA can use 6-bit lookup tables Andreas Zingg 24.10.2017
Setup Modular interface to cluster sequences CPU and FPGA interchangeable Allows for performance and result comparison Andreas Zingg 24.10.2017
FPGA top hierarchy Andreas Zingg 24.10.2017
Matching Unit Andreas Zingg 24.10.2017
FPGA initialization phase Andreas Zingg 24.10.2017
FPGA main phase (multiple possible) Andreas Zingg 24.10.2017
Shortcomings Limited number of parallel clustering units Andreas Zingg 24.10.2017
Shortcomings Limited number of parallel clustering units Requires phase repetitions Andreas Zingg 24.10.2017
Shortcomings Limited number of parallel clustering units Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Andreas Zingg 24.10.2017
Shortcomings Limited number of parallel clustering units Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Worst case: None of the sequences match with the references of current clusters Cache must be able to store all sequences Andreas Zingg 24.10.2017
Proposed Workarounds Cache not big enough Increase memory capacity Cut input into smaller pieces that fit in cache, and handle those Parallelizable, however, solution might be sub-optimal Phase repetitions slow down clustering process Use HMC-Modules Use maximum number of parallel clustering units Maximum nr parallel units: limited FPGA size latency of sequence distribution over FPGA surface Andreas Zingg 24.10.2017
Test Setup Unmapped paired sequences of 126 bases from real human sample FPGA based version at 125MHz Software version on Intel Core i7-4790 Haswell 4-Core at 4GHz Andreas Zingg 24.10.2017
Runtime dependant on input size Andreas Zingg 24.10.2017
Times needed to cluster a real case file Software configuration ( : extrapolated) FPGA Hardware configuration ( : extrapolated) Andreas Zingg 24.10.2017
Results Software solution takes 2.6 years FPGAs take ~12 hours Make the task practical Speed gain: ~1000 x Energy saved: ~700 x Andreas Zingg 24.10.2017
Conclusion Goal achieved Opens path for new clustering based compression algorithms Proved even on large datasets, high Complexity algorithms ( 𝑂 𝑛 2 ) can run in reasonable amount of time when provided with specialized hardware Andreas Zingg 24.10.2017
My Take Well structured Easy to read and understand Interesting insight in a new field Speedup is not explained well Andreas Zingg 24.10.2017
Questions Andreas Zingg 24.10.2017