The Pitfalls and Guidelines for Mobile ML Benchmarking

The Pitfalls and Guidelines for Mobile ML Benchmarking
In this talk, I’m going to talk about how we use the benchmark data to drive the design of the neural network models, the software implementations, as well as influencing the hardware vendors on the hardware implementation decisions. A lot of the processes are very research oriented. We welcome the research community to participate in the flow and advance the state-of-the-art. Fei Sun Software Engineer, AI platform, Facebook

Mobile ML Deployment at Facebook

ML Deployment at Facebook
Machine learning workload used at Facebook Neural style transfer Person segmentation Object detection Classification etc. Need refinemennt

Neural Style Transfer Neural style transfer is the effect that applies the style from the style image to the content in the content image. With our engineers’ work, we managed to render the neural style transfer in real time video.

Person Segmentation Mermaid effect in Instagram
The mermaid effect in Instagram uses person segmentation that replaces the background to an artificially rendered underwater background.

Object Detection Object detection is a building block for artificial reality. In the image on the left, the the cup with BFF on it is not a real cup on the table. But a phone rendered cup. In the image on the right, the flowers on the plant are also not real. We can only add those effects after we detect those objects.

Pitfalls of Mobile ML Benchmarking

Hardware Variations CPU GPU DSP NPU
If the server side hardware variation is at single digit or low tens, the hardware variation for mobile is ten to hundred times more. How good it is if the hardware can only run the models in the benchmark suites efficiently? All of them may be in an SoC. How to specify what is used and what is not?

Software Variations ML frameworks (e.g. Pytorch/Caffe2, TFLite)
Kernel libraries (e.g. NNPACK, EIGEN) Compute libraries (e.g. Open CL, Open GL ES) Proprietary libraries Different build flags This space is far more fragmented than the server counterpart. The mobile software support is far less mature than the server counter part. Many have bugs. How good it is if the software can only run the models in the benchmark suite bug free?

OEM Influence Some performance related knobs are not tunable at user domain. Require rooted phone or special built phone. But the OEM/Chipset vendors can change them for different scenarios. Some algorithms (e.g. scheduling) are hard-coded by OEM/Chipset vendors, but they influence performance a lot.

Run to Run Variation Run-to-run variation on mobile is very obvious
Difference between different iterations of the same run Difference between different runs How to correlate the metrics collected from the benchmark runs to the production runs? How to specify the benchmark condition?

How to Avoid Benchmark Engineering
Is it useful if the hardware only runs models in benchmark suite efficiently the software is only bug free running models in the benchmark suite a different binary is used to benchmark each model the code is bloated if the binary includes different algorithms for different models I don’t have solutions here, but I’d like to as a few questions

Guidelines of Mobile ML Benchmarking

Key Components of Mobile ML Benchmarking
I want to emphasize the third point. Due to the large variations of the hardware platforms and the software frameworks, kernels, libraries, it is very important to design a methodology that is benchmark engineering proof. The last item is also important when we have such a methodology. We can use the same methodology evaluating our proprietary models and compare the result with the published benchmark results. Relevant mobile ML Models Representative mobile ML workload Unified and standardized methodology of evaluating the metrics in different scenarios

Make Mobile ML Repeatable
Too many variables and pitfalls Judge benchmarking result by another person Easily reproduce the result Easily separate out the steps Easily visualize results from different benchmark runs It is difficult to give guidelines on ML benchmarking, because the conditions vary quite differently. The best judge is another person. As long as a person is okay with the benchmarking methodology, it is often fine.

Facebook AI Performance Evaluation Platform

Challenges of ML Benchmarking
What ML models matter? How to enhance ML model performance? How to evaluate system performance? How to guide future HW design? Neural network algorithms are computation intensive. If you launch a camera effect on your phone, you may feel that the phone gets hot after a while. It drains too much power. Because the facebook app is deployed to billions of phones, phones with different CPU micro-architecture, different GPUs, DSPs, and accelerators. But we only have one app that cover the entire spectrum in terms of the device capabilities. We want to help all the hardware vendors to improve the their efficiency running neural network workload. The hardware becomes more efficient, the facebook app uses less power. Specifically, we’d like to the hardware vendors to optimize the important models. Which neural network models matter? We want to help them to improve the neural network model performance on their hardware. Help them to iterate faster. We’d also like to assist them to evaluate against market, in a fair, honest, and representative way. Lastly, we’d like to help them on future hardware designs.

Goals of AI Performance Evaluation Platform
Provide a model zoo on important models Normalize the benchmarking metrics and conditions Automate the benchmarking process Honest measurement on performance

Backend & Framework Agnostic Benchmarking Platform
Same code base deployed to different backends Vendor needs to plugin their backend to the platform The build/run environment is unique The benchmark metrics are displayed in different formats Good faith approach Easily reproducing the result

Benchmarking Flow To help the vendors, we propose a benchmarking procedure, which composes three steps. First, it contains a centralized model zoo, which contains some representative models that we care about. It can be the ONNX format (open neural network exchange format), but it can be in any model format. The key thing is that those models are comparable so we have standardized inputs to the benchmark. Then, because there are many different kinds of hardware platforms, such as GPU, CPU, Phone, or embedded, each platform has an optimal build and run environment. The benchmarking is done in a distributed manner. But we have one benchmark harness deployed to all the platforms so that we can ensure the benchmarking in each platform follow the same procedure and the same metrics collected. But of course, part of the harness is customized for each backend platform to make the best out of the platform. So, the benchmarking is distributed and the metrics collected. The metrics are sent to a centralized place for analysis. People can filter, plot, compare those data with ease.

Benchmark Harness Diagram
Repository Models Harness Benchmark preprocess Meta data Repo driver Regression detection Task Benchmark driver Data Preprocess pipeline Online Data Postprocess pipeline Reporters Files Caffe2 driver Custom framework driver Benchmark inputs Data processing Screen Backend drivers Backend drivers Standard API Data Converter Data Converter Harness provides Vendor provides Backend Backend Caffe2 Custom framework

Current Features Support android devices, ios, windows, linux.
Support Pytorch/Caffe2, TFLite etc. frameworks. Support latency, accuracy, FLOPs, power, energy metrics. Save results locally, remotely, or display on screen. Deal with git, hg version controls. Cache models, binaries locally. Continuously run benchmarks on new commits. A/B testing on regression detection.

Open Source Facebook AI Performance Evaluation Platform
Encourage community to contribute

Questions

The Pitfalls and Guidelines for Mobile ML Benchmarking

Similar presentations

Presentation on theme: "The Pitfalls and Guidelines for Mobile ML Benchmarking"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Pitfalls and Guidelines for Mobile ML Benchmarking

Similar presentations

Presentation on theme: "The Pitfalls and Guidelines for Mobile ML Benchmarking"— Presentation transcript:

Similar presentations

About project

Feedback