Peipei Zhou Ph.D. Final Defense June 10th, 2019

Peipei Zhou Ph.D. Final Defense June 10th, 2019
Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective Peipei Zhou Ph.D. Final Defense June 10th, 2019 Committee: Jason Cong (Chair) Glenn Reinman Jae Hoon Sul Tony Nowatzki June 10th, 2019 Committee: Jason Cong (Chair) Glenn Reinman Jae Hoon Sul Tony Nowatzki

Publication Conference Publication
[ICCAD ‘18] Y. Chi, J. Cong, P. Wei, and P. Zhou, “SODA: Stencil with Optimized Dataflow Architecture”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Best Paper Candidate [FCCM ‘18] J. Cong, P. Wei, C.H. Yu, and P. Zhou, “Latte: Locality Aware Transformation for High-Level Synthesis”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. [FCCM ‘18] Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong. “ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. [ISPASS ‘18] P. Zhou, Z. Ruan, Z. Fang, M. Shand, D. Roazen, and J. Cong, “Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Best Paper Candidate (4 out of 21 accepted) [DAC ‘17] J. Cong, P. Wei, C.H. Yu, and P. Zhou, “Bandwidth optimization through on-chip memory restructuring for HLS”, Design Automation Conference (DAC), 2017 [ICCAD ‘16] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, “Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD) ,2016 [FCCM ‘16] P. Zhou, H. Park, Z. Fang, J. Cong, A. DeHon, “Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016 [FCCM ‘14] J. Cong, H. Huang, C. Ma, B. Xiao, P. Zhou, “A Fully Pipelined and Dynamically Composable Architecture of CGRA”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2014

Publication Conference Publication
[DAC ‘17] J. Cong, P. Wei, C.H. Yu, and P. Zhou, “Bandwidth optimization through on-chip memory restructuring for HLS”, Design Automation Conference (DAC), 2017 [ICCAD ‘16] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, “Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD) ,2016 [FCCM ‘16] P. Zhou, H. Park, Z. Fang, J. Cong, A. DeHon, “Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016 [FCCM ‘14] J. Cong, H. Huang, C. Ma, B. Xiao, P. Zhou, “A Fully Pipelined and Dynamically Composable Architecture of CGRA”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2014

Publication Journal and Poster Publication
[TCAD ‘18] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, J. Cong, “Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD),2018. TCAD Donald O. Pederson Best Paper Award 2019. [DAC ‘18] Y. Chi, P. Zhou, J. Cong, “An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism (Work-in-Progress Accept)”, Design Automation Conference (DAC), 2018 [FPGA ‘18] Y. Chi, P. Zhou, J. Cong, “An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism (Abstract Only)”, International Symposium on Field-Programmable Gate Arrays (FPGA), 2018 [FPGA ‘16] Y. Chen, J. Cong, Z. Fang, P. Zhou, “ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Abstract Only)”, ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016 [HITSEQ ‘15] Y. Chen, J. Cong, J. Lei, S. Li, M. Peto, P. Spellman, P. Wei, P. Zhou, “CS-BWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing”, High Throughput Sequencing, Algorithms and Applications (HiTSeq) (Poster), Best Poster Award

Accelerators CPU core scaling coming to an end
CPUs are power hungry Accelerators: FPGA (Field Programmable Gate Arrays) Energy efficiency, X than CPUs Parallelization I386, 1985, 33 years ago Customization

Accelerator at Different Levels
Single Chip Level System-on-Chip (SoC) FPGAs IoT devices, self-driving car Battery Life & Performance Matters! Metoak Technology Visual Sensors for ADAS and Autonomous Driving: Powered by Xilinx – Demo Shorts

Node Level CPU <-> FPGA accelerators through PCIE Host offload kernels onto FPGAs Performance & Dollars Matter! FPGA handles compute-intensive, deeply pipelined, hardware-accelerated operations, CPU handles the rest, Data is transferred to and from the FPGA via PCIE PCIe

Performance & Dollars Matter! Cluster Level FPGA accelerators in various cloud services IBM Cloud, Microsoft, FPGA and GPU in 2014 Amazon AWS, Baidu, Huawei, Alibaba starting 2016 Till 2019, Nimbix, Tencent 2014: Bing web search & Image search engine

Performance & Dollars Matter! Cluster Level Fully pipelining CPUs and FPGAs gives more cost benefit

Outline Introduction Chip Level Performance and Energy Modeling
Energy Efficiency of Full Pipelining (Chapter 2) Latte: Frequency and performance optimization (Chapter 3) Node Level Performance and Cost Modeling: Computation Resources (Chapter 4) Doppio, storage resources (Chapter 5) Cluster Level Performance and Cost Modeling Public Cloud Cost Optimization: Composable Compute Instance: Mocha (Chapter 6) Compute + Storage Co-optimization (Chapter 7) Private Cloud Throughput Optimization (Chapter 8) Conclusion

Applications: GATK Best Practice by Broad Institute
Genome Pipeline Stages: Alignment: BWA-MEM (Burrows-Wheeler Aligner) Base-Quality Score Recalibration: GATK (Genome Analysis Toolkit) Calling Variants: GATK, HaplotypeCaller (HTC) to find germline variants, Mutect2 for tumor sequences Applies to single-sample whole genome sequence (WGS) or whole exome sequence (WES) analysis

Applications: Genome Pipeline
Genome Pipeline Stages: Alignment: BWA-MEM, Smith-Waterman (FPGA accelerator) Base-Quality Score Recalibration: GATK Calling Variants: HaplotypeCaller+Mutect2, PairHMM (FPGA accelerator)

From CPU solution -> CPU + FPGA for HTC on AWS Cloud
Motivation: How Does Using FPGA Accelerators Impact an Application’s Out-of-Pocket Cost in Public Cloud Services? From CPU solution -> CPU + FPGA for HTC on AWS Cloud Latency improves 1.60x Cost 2.58x Mocha: Multinode Optimization of Cost in Heterogeneous Cloud with Accelerators

From CPU solution -> CPU + FPGA for Mutect2 on Huawei
Motivation: How Does Using FPGA Accelerators Impact an Application’s Out-of-Pocket Cost in Public Cloud Services? From CPU solution -> CPU + FPGA for Mutect2 on Huawei Latency improves 1.49x Cost 1.16x

Why? FPGA is not cheap in public cloud Amdahl’s law
AWS, 1 FPGA = 25 vCPU in price Huawei, 1 FPGA = 23 vCPU in price Amdahl’s law the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement. For HTC, PairHMM kernel takes 39% Application speedup is bounded by 1/(1-0.39) = 1.63

Challenges Applications are different
r (kernel ratio) S (speedup) BWA (SmithWaterman) 0.4 ~16 Samtools (deflate) 0.2 ~12x GATK3.6 (pairHMM) ~40x BWA, Samtools, GATK3.6 are all single node, multi-thread program BWA, Samtools in C, GATK3.6 in Java

Performance Model on CPU-FPGA Platform
“Well Matched” FPGA vs CPUs e.g. r = 0.5, S = 8, P = 8

Fast FPGA e.g. r = 0.39, 1-r/r = 1.56, S = 40, P’ = 64 P = 8, Leave 1-8/64 = 87% FPGA resource idling, Waste!

Slow FPGA e.g. r = 0.83, 1-r/r = 0.2, S = 40, P’ = 8 P = 32, Leave 32-8 = 24 CPU cores idling, Waste!

Proposed Optimization for Fast FPGA: Sharing FPGA among Instances
Matching throughput of CPUs and FPGAs Network network network

Slow FPGA: FPGA is Bottleneck
FPGA offload CPU operation P, executor-core P = 1 1. P increases, runtime decreases When P further increases, there is a point after which even if you launch more executor-cores, the concurrent working cores remain the same, here it is 8. And now I/O becomes bottleneck and runtime remain same for P or larger. 2. FPGA becomes bottleneck Effective CPU cores remains same (8) Runtime remain same for P or larger (P>8)

Proposed Optimization for Slow FPGA: Co-execution of FPGA+CPU
Partial Offload: Co-execution of FPGA + CPU M1 tasks: Offload acceleratable kernels to FPGAs M2 tasks: Instead of offload, schedule to CPUs

Mocha: Multinode Optimization of Cost in Heterogeneous Cloud with Accelerators
1: Profiling r + Speedup S -> Matching Core Number, P’ 2. Setting up cluster HTC on AWS, P’ = 64 Instances: f1.2x: 8, m4.10x: 40, m4.4x: 16 3. Mocha Runtime CPU clients <-> NAM <-> Accelerator

Optimization Results HTC on AWS EC2
Latency improves 1.60x -> 1.58x, almost same performance Cost 2.58x -> 0.93x, cost efficiency improvement: 2.77x f1.16x (64 CPU+8FPGA) ->f1.2x + m4.16x (64 CPU +1 FPGA) Make more use of 1 FPGAs

Optimization Results Effective CPUs: 8 -> 32, Make more use of CPUs
Mutect2 on Huawei Cloud Latency improves 1.49x -> 2.27x, performance improvement: 1.5x Cost 1.16x -> 0.76x, cost efficiency improvement: 1.5x Original : 32 cores FPGA: (48/32 = 1.5) Mocha: (72/32 = 2.25) Effective CPUs: 8 -> 32, Make more use of CPUs

Problem Formulation: Public Cloud, Cost Optimization
Samples: One genome sequence? Batch of genome sequences Deadline: 6 hours ? 12 hours? 24 hours? No DDL constraints? Objective: Minimum cost, in public cloud like Amazon EC2 1 Read $? $billions savings! UCLA CDSC

Prior Work ILP formulations are common Limitations
Cost of running scientific workloads (astronomy application) on public clouds, different number of scheduled processors [DSL08] Cost-optimal scheduling in public cloud, formulates the problem as a linear programming, different memory, CPU, and data communication overhead, no storage considered [VVB10] Limitations Storage not considered Composable instances not considered [DSL08] Ewa Deelman, Gurmeet Singh, Miron Livny, Bruce Berriman, and John Good. The cost of doing science on the cloud: the montage example." In SC’08: Proceedings of the 2008 CM/IEEE conference on Supercomputing, [VVB10] R. Van den Bossche, K. Vanmechelen, and J. Broeckhove. \Cost-Optimal Scheduling in Hybrid IaaS Clouds for Deadline Constrained Workloads.“ In 2010 IEEE 3rd International Conference on Cloud Computing UCLA CDSC

Public Cloud: Storage + Compute

Multiple Computation Instances
Different instances have different CPU types In each CPU type series, different instances have a different number of CPU cores, memory sizes and prices UCLA CDSC

Multiple Storage Choices -> I/O-Aware Modeling
SSD io1 price: $0.125 per GB month, 0.125*250/24/30 = $0.0434/hr SSD gp2 price: $0.1 per GB month, 0.1*250/24/30 = $0.0347/hr Doppio [ISPASS’18] I/O-Aware Analytic Model Quantify the I/O impact Customized storage for optimal cost UCLA CDSC

Multiple Storage Choices, Size or IOPs driven?
For genome pipeline, size matters vcf: 20G fastq: 3G Input fastq: 104GB Output Bam: 110G Local folder: 250GB Choose SSD gp2 price, Update the table price SSD io1 price: $0.125 per GB month, 0.125*250/24/30 = $0.0434/hr SSD gp2 price: $0.1 per GB month, 0.1*250/24/30 = $0.0347/hr Bam: 80G Input fastq: 74GB UCLA CDSC

Different Runtime of Stages in Instances
Amazon EC2 instances and runtime (seconds) for different stages. UCLA CDSC

Scheduling by Solving MILP (Mixed Integer Linear Programming) Problem
Inputs: Objective: minimize total cost Constraints (details later) Output: Allocation variable indicating genome s, stage t is allocated to type i, m-th instance DDL: Deadline to finish whole genome pipeline processing I: Set of different types of compute instances. M: Maximum number of instances for each type of instance. N: Set of whole genomes to be processed Pi: On-demand price of instance Stage i,t: Runtime of instance i for stage t. The value is an integer. UCLA CDSC

MILP Constraints Scheduling Constraints: Solver: IBM CPLEX Studio
Hardware resource constraints: if two tasks are scheduled onto the same instance, then the two tasks do not overlap Deadline constraints: finish time < DDL Task dependency constraints: BWA -> BQSR -> HTC Solver: IBM CPLEX Studio

Scheduling Results for One Read
Cost under different time constraints When deadline is as small as seconds (5.5hrs) $70 to finish a WGS When deadline is as large as seconds (24.3hrs) $10 to finish a WGS UCLA CDSC

Scheduling Results When Mocha is Considered
Composable Instances “New” instance type, HTC: f1.2x + m4.16x cost to meet a deadline as small as seconds is reduced from $70 to $56 UCLA CDSC

Scheduling Results for Batches of Reads
The cost savings from 1 -> 2 comes from reduced overhead for node setup, and the benefit is small numSeq = 1 $10.7, ddl> secs numSeq = 2 $10.6, ddl> secs S0 bwa S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc UCLA CDSC

Batches of Genome Sequences (Reads)
Number of Genome Sequences = 2, 3, 4, 5 numSeq = 1, minCost = when ddl > 87800 numSeq = 2, minCost = when ddl > , % numSeq = 3, minCost = when ddl > , % numSeq = 4, minCost = when ddl > , % numSeq = 5, minCost = when ddl > , % UCLA CDSC

Insight Duplication of configuration of numSeq = 2 is good enough for any given number of sequences S0 bwa S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc S0 bwa S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc … S0 bwa S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc

Problem Formulation: Private Cloud, Throughput Optimization
Given hardware: 56 core CPU, 2TB disk Given number of sequences (reads) Objective: Minimum runtime UCLA CDSC

Private Cloud, Runtime Matters
Objective is to minimize runtime Assumption: CPU cores on a server is limited, 56 cores constant c0, parallel part t0; c0+ t0/P Machine storage is limited -> parallel genomes are limited, for example, on 2TB server, at most 2000GB/250GB = 8 whole genomes can run in parallel UCLA CDSC

Scheduling by Solving MILP Problem
Inputs: Objective: minimize total runtime Constraints (details later) Output: pi,j Allocated number of cores for genome i R: Set of cores in a server. Y : Set of storage spaces. Each storage space is for a whole genome. N: Set of whole genomes to be processed T : Set of stages for all genomes. c0; t0; c1; t1: Time for series part and parallel part for stage 0 and stage 1 in genome processing. For example, when there are 4 cores for a stage 0 task, runtime of the task is c0 + t0/4 UCLA CDSC

MILP Constraints Scheduling Constraints: Solver: IBM CPLEX Studio
Hardware resource constraints: if two tasks are scheduled onto the same CPU cores / storage space, then the two tasks do not overlap Dependency constraints: if two tasks have dependency, then the start time of the latter task >= end time of the previous one Solver: IBM CPLEX Studio

Scheduling Results Solver: IBM CPLEX
Cores: 56, Storage Space: 8, number of Genomes: 1, 2, 3, ,… First 4 can give results <5 mins, 4+ goes more than 24 hrs, # Genomes Optimal Runtime (seconds) 1 26071 2 43074 3 60269 4 77080 5… ?? (huge computation time) UCLA CDSC When #genome = 5, > 27 hrs, not finished yet

Configuration of Optimal Results
#genome = 1, both stages use 56 cores #genome = 2, both stages in two genomes use 28 cores #genome = 4, both stages in four genomes use 14 cores #genome = 3, as even as possible for each stage, split 56 cores to 19,19,18 for stage, 18,18,20 for stage 2 (explained later) Evenly Splitting 19： 35607 18： 24662 19： 35607 18： 24662 18： 37304 20： 22596

ILP For 4<#genome<=8, it takes extremely long time even with multithreading mode for CPLEX. We put forward 3 heuristics, check their error rates against optimal results on 3 problem sizes, and pick the best heuristic to obtain a good runtime when 4<#genome<=8. 3 Problem sizes case (first two are manually constructed): Case A: #cores is 14, #storage space is 2 Case B: #cores is 28, #storage space is 4 Case C: #cores is 56, #storage space is 8 UCLA CDSC

Heuristic 1 When there are more genomes, construct from smaller sequential patterns, when there are fewer genomes Calculating #genomes = 1,2 are fast for all 3 problem sizes We know optimal when #genome = 1, and #genome = 2, Heuristic: We can use the base results to construct any number of genomes, for example, when #genome = 3: #genome = 4, #genomes=2 + #genomes=2 #genome = 5, #genomes=2 + #genomes=2 + #genomes=1 1 genome 2 genomes 2 genomes 2 genomes 2 genomes 2 genomes 1 genome 4 genomes： 5 genomes：

Optimal Results vs Heuristic 1
For case A, cores 14, storage space 2 We know #genome = 1, #genome = 2 We use two base results to sequentially construct #genome = 3 (sum of runtime of #genome = 1, #genome = 2) #genomes optimal heuristic 1 error 1 77080 0.000% 2 145091 3 221392 222171 0.552% 4 290182 5 365340 367262 0.526% 6 435273 7 510431 512353 0.377% 8 580364

For case B, cores 28, storage space 4 We know #genome = 1, #genome = 2 We use two base results to sequentially construct #genome = 3 (sum of runtime of #genome = 1, #genome = 2) #genomes optimal heuristic 1 error 1 43074 0.000% 2 77080 3 113586 120154 5.782% 4 145091 154160 6.251% 5 huge computation 197234 na 6 231240 7 274314 8 308320

For case C, cores 56, storage space 8 #genomes optimal heuristic 1 error 1 26071 0.000% 2 43074 52142 3 60269 69145 14.727% 4 77080 86148 11.764% 5 huge computation 112219 na 6 129222 7 155293 8 172296

Heuristic 2 Distribute cores among genomes as even as possible
When #genomes <= #storage space, we can always distribute cores among genomes evenly For example, Case C, cores 56, storage space 8, #genome =3, distribute the cores as 19, 19, 18, in total 56 cores 19： 35607 19： 23575 19： 35607 19： 23575 18： 37304 18： 24662 heuristc2: 61966

For case B, cores 28, storage space 4 When #genomes <= #storage space, we can always distribute cores among genomes evenly Case B, cores 28, storage space 4, #genome =3, distribute the cores as 9, 9, 10 #genomes optimal heuristic 1 error heuristic 2 1 43074 0.000% 2 77080 3 113586 120154 5.782% 114864 1.125% 4 145091 154160 6.251% 5 huge computation 197234 na 188165 6 231240 222171 7 274314 259955 8 308320 290182

For case C, cores 56, storage space 8 Case C, cores 56, storage space 8, #genome =7, distribute the cores as 8,8,8,8,8,8,8 Further reduce the difference compared to optimal results Achieve smaller makespan compared to heuristic 1 For mod(56, #seq) = 0, error = 0%; small difference when mod(56, #seq) != 0 #genomes optimal heuristic 1 error heuristic 2 1 26071 0.000% 2 43074 3 60269 69145 14.727% 61966 2.816% 4 77080 86148 11.764% 5 huge computation 103151 na 95628 6 129222 114864 7 146225 128089 8 172296 145091

Runtime Experiments Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
R = 56 cpu cores, Y = 8 storage spaces, NA12878Garvan Average error < 4.2% Stage0: bwa Stage1: bqsr+htc

Heuristic 2 vs Optimal Results
#core=56, #genome = 3, Heuristic 2 -> optimal configurations if a genome is given less cores than other genomes in the first stage, it is given more cores in the second stage and that may result in less runtime 19： 35607 19： 23575 19： 35607 18： 24662 19： 35607 19： 23575 19： 35607 18： 24662 18： 37304 18： 24662 18： 37304 20： 22596 heuristc2: 61966 Optimal: 60269

Can This Apply to Other #seq? “Balance-aware” Heuristic 2
Check #seq = 5 does not work, heuristic 2: 95628 if redistribute in stage 2, runtime -> increases 12： 53418 8： 50499 12： 53418 12： 34997 12： 34997 11： 57813 11： 57813 11： 37815 12： 34997 11： 57813 11： 37815 11： 57813 11： 57813 11： 37815 12： 34997 11： 57813 11： 57813 11： 37815 12： 34997 11： 57813 UCLA CDSC

Summary For case C, cores 56, storage space 8
Case C, cores 56, storage space 8, #genome =6, Achieve smaller makespan compared to heuristic 2 for some points, for example, #seq = 6, we can use optimal runtime of #seq = 3 in Case B: core 28, storage space 4, > #genomes optimal heuristic1 error heuristic2 *heuristic2 1 26071 0.000% 2 43074 3 60269 69145 14.727% 61966 2.816% 4 77080 86148 11.764% 5 huge computation 103151 na 95628 6 129222 114864 113586 7 146225 128089 8 172296 145091

For #genome > #storage space = 8
Heuristic: construct any number of genomes using base cases when #genome <= #storage space For case C, cores 56, storage space 8, #genome > 8 Two methods Method 1: treat 8 genomes as a batch, and last batch pick the configuration from the base cases Method 2: ILP, introduce variables including base_case_1, base_case_2,…, base_case_8 (continued in next slide) UCLA CDSC

Method 2: ILP Input parameters: Variables: Objective Function:
Constraints: UCLA CDSC

Results of Method 1 vs 2 When N = 9..100
Method 2 of using ILP to compose base cases are faster than method 1 of pick last batch from base cases for some #genome Difference 191 explained: #genome = 11, method 1: 8+3, method 2: 7+4 Difference 2499 explained: #genome = 14, method 1: 8+6, method 2: 7+7 Difference 1543 explained: #genome = 21, method 1: 8+8+5, method 2: 7+7+7 UCLA CDSC

Conclusions Public cloud cost optimization
Composable instances optimize the cost compared to prior work Private cloud throughput optimization Heuristic: “balance-aware” splitting algorithm UCLA CDSC

References and Acknowledgement
Related publication Before thesis proposal [FCCM ‘16] P. Zhou, H. Park, Z. Fang, J. Cong, A. DeHon, “Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016 [ISPASS ‘18] P. Zhou, Z. Ruan, Z. Fang, M. Shand, D. Roazen, J. Cong, “Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework”, to appear in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Best Paper Candidate (4 out of 21 accepted) After thesis proposal [FCCM ‘18] J. Cong, P. Wei, C.H. Yu, and P. Zhou, “Latte: Locality Aware Transformation for High-Level Synthesis”, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018 Acknowledgement Energy Pipeline: Zhenman Fang, Hyunseok Park et al Doppio: Zhenyuan Ruan, Zhenman Fang et al Latte: Cody Hao Yu, Peng Wei et al Mocha: Jiayi Sheng, Di Wu, Cody Hao Yu, Muhuan Huang, Jie Wang et al

Thank you! Q&A

Backup Slides

ILP Input Parameters UCLA CDSC

ILP N, #whole genome; R: #cores in a server; Y: set of storage spaces. #Variables: 2N 2N*R N*Y 1 𝐶 2𝑁 2 −𝑁 𝐶 𝑁 2 6𝑁+5𝑁𝑅+𝑁𝑌+5𝐶 𝑁 2

ILP Objective: Constraints 𝑁 𝐸 𝑖, 𝑗 ≤𝑋 2𝑁 2𝑁 𝑁 𝑝 𝑖, 𝑗 = 2𝑁 𝑁 UCLA CDSC

ILP Constraints: If two tasks are scheduled on a same core, they can not overlap: If two genomes are scheduled to use same storage space, they can not overlap 9𝑁+ (𝐶 2𝑁 2 −𝑁)∗(4+𝑅)+ 𝐶 𝑁 2 ∗（4+𝑌))= 9𝑁+20∗ 𝐶 𝑁 𝐶 𝑁 2 ∗R + 𝐶 𝑁 2 ∗Y (𝐶 2𝑁 2 −𝑁)∗4 (𝐶 2𝑁 2 −𝑁)∗𝑅 𝐶 𝑁 2 ∗4 𝐶 𝑁 2 ∗𝑌 UCLA CDSC

Energy Efficiency of Full Pipelining (Chapter 2) Latte: Locality Aware Transformation for High-Level SynThEsis (Chapter 3) Node Level Performance and Cost Modeling: Computation Resources (Chapter 4) Doppio: I/O-aware Performance analysis, moDeling and OPtimization for in-memory computing framework. Storage Resources (Chapter 5) Cluster Level Performance and Cost Modeling Mocha: Multinode Optimization of Cost in Heterogeneous Cloud with Accelerators (Chapter 6) Applications of Modeling and Optimization: Genome Pipeline Public Cloud Cost Optimization (Chapter 7) Private Cloud Performance Optimization (Chapter 8) Conclusion

Peipei Zhou Ph.D. Final Defense June 10th, 2019

Similar presentations

Presentation on theme: "Peipei Zhou Ph.D. Final Defense June 10th, 2019"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peipei Zhou Ph.D. Final Defense June 10th, 2019

Similar presentations

Presentation on theme: "Peipei Zhou Ph.D. Final Defense June 10th, 2019"— Presentation transcript:

Similar presentations

About project

Feedback