Download presentation
Presentation is loading. Please wait.
Published byGervase Shelton Modified over 6 years ago
1
Genomic Data Clustering on FPGAs for Compression
Andreas Zingg
2
Background - Bioinformatics
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg
3
Background - Bioinformatics
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg
4
Background - Bioinformatics
Genome Entirety of an organisms hereditary Information Encoded in DNA DNA Consists of nitrogenous Bases Bases appear in pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg
5
Background - Bioinformatics
Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg
6
Background - Bioinformatics
Base Pairs Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg
7
Genomic Data DNA is cut into small sequences
Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response Andreas Zingg
8
Genomic Data DNA is cut into small sequences
Sequences are read by machine Important tool to guide therapeutic intervention. Improve the knowledge available to researchers interested in evolutionary biology. -> May lay the foundation for predicting disease susceptibility and drug response ACTGATTG GCCTATCGATGAC TGAT TATCGACG Andreas Zingg
9
~ 300 GB The Problem Generated Data is really big
One Human Genome generates data in the order of 300 GB ~ 300 GB This might take a while Andreas Zingg
10
The Solution Compress the data! Andreas Zingg
11
The Solution Compress the data! But how? Andreas Zingg
12
Exploit data redundancy
The Solution Exploit data redundancy Map the data to the human reference genome About 90% of genomic sequences share similarities with the human reference genome Andreas Zingg
13
Mapping to the reference genome
Human Reference Genome Aligned reads Andreas Zingg
14
Mapping to the reference genome
Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg
15
Mapping to the reference genome
What about the remaining 10%? Can map about 90 % of sequences to the reference genome Compress Mapped sequences using their relative location to the reference Andreas Zingg
16
Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Andreas Zingg
17
Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Using what algorithm? Andreas Zingg
18
Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? Andreas Zingg
19
Clustering What about the remaining 10%?
Find Clusters and map sequences to these Clusters Using what algorithm? K-Means? What should our K be? Andreas Zingg
20
No Useful Clustering Algorithm
No useful clustering algorithm for compression of genomic data Exact number of K does not matter As long as there are high correlated clusters, compression is possible Instead of a searching for exactly K clusters, find clusters using a small threshold neighbourhood function Present clustering Algorithm Andreas Zingg
21
Matching function For 2 Sequences s1 and s2 a matching function is defined: le: sequence size d: Distance between sequences N: distance threshold Andreas Zingg
22
Matching function N = 1 le = 8 Reverse Complement Match! Match! Match!
No Match! Andreas Zingg
23
Basic Clustering Idea Andreas Zingg
24
Basic Clustering Idea Complexity: 𝑂 𝑛 2 Andreas Zingg
25
Basic Clustering Idea Complexity: 𝑂 𝑛 2
More than 2 years on an Intel core i7 4790 Not practical Andreas Zingg
26
Parallel Clustering Compare sequences with multiple cluster references at the same time Use FPGA board to implement parallel clustering algorithm To compare sequences FPGA can use 6-bit lookup tables Andreas Zingg
27
Setup Modular interface to cluster sequences
CPU and FPGA interchangeable Allows for performance and result comparison Andreas Zingg
28
FPGA top hierarchy Andreas Zingg
29
Matching Unit Andreas Zingg
30
FPGA initialization phase
Andreas Zingg
31
FPGA main phase (multiple possible)
Andreas Zingg
32
Shortcomings Limited number of parallel clustering units Andreas Zingg
33
Shortcomings Limited number of parallel clustering units
Requires phase repetitions Andreas Zingg
34
Shortcomings Limited number of parallel clustering units
Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Andreas Zingg
35
Shortcomings Limited number of parallel clustering units
Requires phase repetitions Number of executions and memory latency increases Clustering process is slowed down Worst case: None of the sequences match with the references of current clusters Cache must be able to store all sequences Andreas Zingg
36
Proposed Workarounds Cache not big enough
Increase memory capacity Cut input into smaller pieces that fit in cache, and handle those Parallelizable, however, solution might be sub-optimal Phase repetitions slow down clustering process Use HMC-Modules Use maximum number of parallel clustering units Maximum nr parallel units: limited FPGA size latency of sequence distribution over FPGA surface Andreas Zingg
37
Test Setup Unmapped paired sequences of 126 bases from real human sample FPGA based version at 125MHz Software version on Intel Core i Haswell 4-Core at 4GHz Andreas Zingg
38
Runtime dependant on input size
Andreas Zingg
39
Times needed to cluster a real case file
Software configuration ( : extrapolated) FPGA Hardware configuration ( : extrapolated) Andreas Zingg
40
Results Software solution takes 2.6 years FPGAs take ~12 hours
Make the task practical Speed gain: ~1000 x Energy saved: ~700 x Andreas Zingg
41
Conclusion Goal achieved
Opens path for new clustering based compression algorithms Proved even on large datasets, high Complexity algorithms ( 𝑂 𝑛 2 ) can run in reasonable amount of time when provided with specialized hardware Andreas Zingg
42
My Take Well structured Easy to read and understand
Interesting insight in a new field Speedup is not explained well Andreas Zingg
43
Questions Andreas Zingg
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.