Modular Abstraction of Complex Real Time Analysis Benjamin P. Campbell Faculty Advisor: Dr Joel Henry, Ph.D. Department of Computer Science University of Montana
Introduction Big Data? Quantities of data that extend beyond traditional mechanisms of processing Data growth is on par with Moore's law and growing at an exponential rate
Introduction Properties of Big Data Volume The expansion of data in one dimension dramatically expands the volume of data. Velocity Rate of growth in participation Variety Introduction of new data types beyond textual or tabular data
Introduction Traditional Big Data mechanisms use batch processing to iterate over large quantities of data. The ATLAS experiment at CERN immediately discards 199,999 of 200,000 data points and still produces 19 Gigabytes of data a second 1. Real Time Streaming is the process of analysis on moving data
Background
IBM Infosphere Streams Data paradigms Real Time Analysis Distributed Advantages ComprehensiveC/C++ API 2 Academic Licensing
Background Disadvantages With greater functionality comes greater complexity Unix Runtime Environment C-Like Domain Specific Language
Research Question How would a modular language integration into the IBM Infosphere Streams runtime ease implementation of streaming analysis for domain experts? OR How can we make it easier for the people who understand the data to use the power of IBM Infosphere?
Methods What general tools are domain experts using now to process and compute with data? R – Statistical Analysis Python NumPy – Numerical tools SciPy– Regression/Clustering/Comput ations Pandas Rpy – Python Binding for R
Methods – Python C API The python interpreter can be embedded within any C/C++ application using the Python C API 3. Interaction with Python Objects within the interpreter can be accessed and manipulated through the API. The API is designed to facilitate multithreaded interaction with the single interpreter through the builtin Python threading mechanisms.
Findings – Python Abstraction Model
Findings – Streams Python C Integration Using the primitive operator mechanism within IBM Infosphere I was able to embed a python interpreter into the executable. Using the Python C API the streams operator creates an instance of the StreamsFilter class or child class. The main methods of the Operator class are wrappers to the Python Object Instance.
Further Research Exploring the possibilities of type dynamics. Python is able to handle any type given through the interface, but the primitive operator types are set at compile time. Abstracting further elements of the streams runtime. Allowing users to define stream graphs within python Allowing users to submit scripts through a Web Interface
Conclusion Through the integration of a Python Interpreter into an IBM Infosphere Streams primitive operator it is possible to begin writing filters and operators in a more general and simplified language. This provides domain experts with the capability to quickly design and prototype computational elements in an environment they are knowledgeable in.
Conclusion - Why does this matter? Logistics Healthcare
Conclusion - Why does this matter? Big Science Finance
Acknowledgments UM Faculty Eric Tangedahl Brian Steele Ph.D. UM Graduate/Alumni Evin Ozer Kegan Kabil References 1. Brumfiel, Geoff High-energy physics: Down the petabyte highway. Nature 469 (7330): IBM InfoSphere Streams 3.0 Information Center,last modified 2012, Python/C API Reference Manual Python v2.7.6 documentation, last modified 2014,
Methods – Streams Flow Model
Methods – Operator Model
Challenges Abstracting the streams data model to a python class Handling dataflow in a concurrent and multithreaded enviroment Implementing the Python C Interface within the IBM Infosphere Streams Primitive Operator
Findings – Concurrency Challenges Since method calls were made concurrently use of the embeded python threading allowed access from each thread to a single Python class instance
Findings – Python Implementation
Findings – Python Example User Class