Open Source on .NET A real world use case
Where it all began Analysing huge datasets Apache Spark (HDInsight) Various formats (CSV, JSON, XML) Row-based formats are generally slow We needed a columnar format Apache Parquet
Why row-based formats can be difficult Column 2 Row 1 Row 2 Row 3 Read all data
Read only needed subset Columnar formats Column 2 Column 1 Column 2 Column 3 Read only needed subset
Parquet Format Row Group 1 Column Chunk Row Group 2
Column Chunk Fixed data type (int, string, etc.) Logical compression Run-Length Encoding Dictionary compression Bit packing etc. Bold compression (None, GZIP, Snappy) Statistics! Min value Max value Number of unique values Number of nulls skip unwanted data
How we used to do it Expensive Slow Unsuitable Too much development effort Requires understanding parquet internals Slow Deployment effort (even with Miniconda) + fastparquet
The Dream Came True Wouldn’t be nice to run it on .NET Developed expressive language Great tooling Works everywhere! No heavy third-party dependencies (Apache Thrift.Core) No native dependencies (Google Snappy)
It’s on GitHub! Took 3 month and 3 people (evenings and weekends) More than 10 contributors now and growing Used by our big name clients Used by other companies Iterations take from hours to 1-2 days Completely open! In dialog to include in the main Apache Repo
Use Cases
Demo Parquet.Net Core Spark + Scala
Azure Data Lake Analytics Custom Outputter Custom Extractor Parquet Files
Demo Create Parquet File with ADLA
Parquet Viewer for Windows 10 Using Parquet.Net for .NET Standard 1.4 UWP is extremely fast comparing to “modern” UI framework UWP perfectly fits CPU heavy workloads Easy distribution model via Store Works on any Windows Device Showcase
Demo Parquet Viewer
Works on Xbox One
Future Plans DataFrames Open data science library built on top of Parquet.Net with Panda-like structures and distributed computing. Data Science Studio Open platform for Data preparation Data analysis Etc. Runs on Desktop(UWP), Azure Service Fabric, Kubernetes.
Why OSS is Important Quality Customisability Freedom Flexibility Interoperability Support options Cost Try before you buy Quality – handful devs vs thousands of devs Customisability – businesses can tweak to their needs Freedom – no vendor (creator) lock-in Flexibility – you have a say in how resource intensive the app should be Interoperability – OSS is much better at adhering to open stanards than proprietary is Support Options – generally free, excellent documentation, forums, etc. Cost – get it for a fraction of a price Try before you buy – nothing to pay, see if you can adjust it
Why there is not much OSS in .NET .NET was traditionally closed source .NET was Windows Only Visual Studio was the only true IDE Other tech was more attractive to academic community Licensing blocker to use in data centers
Config.Net The easiest configuration framework for .NET developers
Storage.Net Storage abstractions with implementations for .NET/.NET Standard
Thank you