Genomic Data Clustering on FPGAs for Compression

Genomic Data Clustering on FPGAs for Compression
Andreas Zingg

Background - Bioinformatics
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

Genome Entirety of an organisms hereditary Information Encoded in DNA DNA Consists of nitrogenous Bases Bases appear in pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

Genomic Data DNA is cut into small sequences
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg

Genomic Data DNA is cut into small sequences
Sequences are read by machine Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response ACTGATTG GCCTATCGATGAC TGAT TATCGACG Andreas Zingg

~ 300 GB The Problem Generated Data is really big
One Human Genome generates data in the order of 300 GB ~ 300 GB This might take a while Andreas Zingg

The Solution Compress the data! Andreas Zingg

The Solution Compress the data! But how? Andreas Zingg

Exploit data redundancy
The Solution Exploit data redundancy Map the data to the human reference genome About 90% of genomic sequences share similarities with the human reference genome Andreas Zingg

Mapping to the reference genome
Human Reference Genome Aligned reads Andreas Zingg

Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg

What about the remaining 10%? Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg

Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Andreas Zingg

Find Clusters and map sequences to these Clusters Using what algorithm? Andreas Zingg

Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? Andreas Zingg

Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? What should our K be? Andreas Zingg

No Useful Clustering Algorithm
No useful clustering algorithm for compression of genomic data Exact number of K does not matter As long as there are high correlated clusters, compression is possible Instead of a searching for exactly K clusters, find clusters using a small threshold neighbourhood function Present clustering Algorithm Andreas Zingg

Matching function For 2 Sequences s1 and s2 a matching function is defined: le: sequence size d: Distance between sequences N: distance threshold Andreas Zingg

Matching function N = 1 le = 8 Reverse Complement Match! Match! Match!
No Match! Andreas Zingg

Basic Clustering Idea Andreas Zingg

Basic Clustering Idea Complexity: 𝑂 𝑛 2 Andreas Zingg

Basic Clustering Idea Complexity: 𝑂 𝑛 2
More than 2 years on an Intel core i7 4790 Not practical Andreas Zingg

Parallel Clustering Compare sequences with multiple cluster references at the same time Use FPGA board to implement parallel clustering algorithm To compare sequences FPGA can use 6-bit lookup tables Andreas Zingg

Setup Modular interface to cluster sequences
CPU and FPGA interchangeable Allows for performance and result comparison Andreas Zingg

FPGA top hierarchy Andreas Zingg

Matching Unit Andreas Zingg

FPGA initialization phase
Andreas Zingg

FPGA main phase (multiple possible)
Andreas Zingg

Shortcomings Limited number of parallel clustering units Andreas Zingg

Shortcomings Limited number of parallel clustering units
Requires phase repetitions Andreas Zingg

Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Andreas Zingg

Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Worst case: None of the sequences match with the references of current clusters Cache must be able to store all sequences Andreas Zingg

Proposed Workarounds Cache not big enough
Increase memory capacity Cut input into smaller pieces that fit in cache, and handle those Parallelizable, however, solution might be sub-optimal Phase repetitions slow down clustering process Use HMC-Modules Use maximum number of parallel clustering units Maximum nr parallel units: limited FPGA size latency of sequence distribution over FPGA surface Andreas Zingg

Test Setup Unmapped paired sequences of 126 bases from real human sample FPGA based version at 125MHz Software version on Intel Core i Haswell 4-Core at 4GHz Andreas Zingg

Runtime dependant on input size
Andreas Zingg

Times needed to cluster a real case file
Software configuration ( : extrapolated) FPGA Hardware configuration ( : extrapolated) Andreas Zingg

Results Software solution takes 2.6 years FPGAs take ~12 hours
Make the task practical Speed gain: ~1000 x Energy saved: ~700 x Andreas Zingg

Conclusion Goal achieved
Opens path for new clustering based compression algorithms Proved even on large datasets, high Complexity algorithms ( 𝑂 𝑛 2 ) can run in reasonable amount of time when provided with specialized hardware Andreas Zingg

My Take Well structured Easy to read and understand
Interesting insight in a new field Speedup is not explained well Andreas Zingg

Questions Andreas Zingg

Genomic Data Clustering on FPGAs for Compression

Similar presentations

Presentation on theme: "Genomic Data Clustering on FPGAs for Compression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomic Data Clustering on FPGAs for Compression

Similar presentations

Presentation on theme: "Genomic Data Clustering on FPGAs for Compression"— Presentation transcript:

Similar presentations

About project

Feedback