DataLines a framework for building steaming data applications Mike Haberman Senior Software/Network Engineer
The Problem Data deluge: routers, switches, IDS, servers (web, mail, logs, etc), software (tcpdump, web100, SNMP, tarpit, etc), sensors, taps, … (help me) ? ? ?
The problem (continues) Disparate data formats Software (sometimes) to manage each Tweaking to get what you want (custom software) Correlating data (more custom software)
DataLines Can we build a framework that can remove all (most) of the tedium of working with all these disparate data formats?
DataLines Framework designed to manage and build streaming data processing applications
DataLines Framework designed to manage and build streaming data processing applications
DataLines Framework Manage: would like one tool to handle all these different data sources. designed to manage and build streaming data processing applications
DataLines Framework Build: uniform way of creating a data processing application. designed to manage and build streaming data processing applications
DataLines Framework Streaming data: Never ending stream of ‘manageable’ chunks of data No random access, no blocking operators One look, linear or sub-linear algorithms/data ops Each data item (a tuple in DataLines) is an independent entity Many tools were not designed for streaming data designed to manage and build streaming data processing applications
DataLines Framework Processing: Something you want to do to the data (e.g. reading, writing, parsing, event generation, filtering, statistics, reports, data synopsis, …) designed to manage and build streaming data processing applications
DataLines Creating a DataLines application: XML DataLines Application “compile”
DataLines XML file defines 3 major components: –Data Processors What one does with the data –Processing Order The order in which the processors will operate on the data –Event Management What to do when a processor generates an event
DataLines Processors Data Processors are the heart of D.L. –I/O: socket, file –Filters:inline, dispatch –Collectors: binning, windowing (w/operators) –Gui: charts, picture taking –Converters: binary to tuple –Misc: printers, counters, iterators, timers, data generators, gates, delays Processors can generate events Processors can drop, mutate, mutilate the tuple being processed, generate new tuples
DataLines Pipelines Control tuple movement among processors Can connect either processors or other pipelines Two paths within a pipeline: binary and tuple
Event Management Allow processors to signal an event –timers, open/close, client connects, etc Allow the user to tie in domain logic Allow the user to call a processor specific API
DataLines Data The generalization of data is a DlTuple Tuple is just a set of values DlTuple is the interface processors use –String[] <-- getFieldNames() –DlValue <-- getValue(fieldname)
DataLines Data Tuples can have virtual fields –calculated values, static values Tuples can have composite fields The creation of the tuple is left to the processor in charge of conversion
XML Syntax … run away!
Data Example
Data Example ${A} + ${B}
DataLines Tutorial Fast forward past a painful 3 hour tutorial covering each of those sections in detail (tuples, processors, pipelines, event management, configurations) You have seen all the XML though!
DataLines Distilled A library of data processors that operate on “Tuples” –one of the processors takes the raw data and creates the tuple An XML compiler that takes the xml file, the library, and creates an application
DataLines Example
DataLines in use DataLines does make it easier to hit the ground running. Much of the tedious work you need to do is taken care of For highly specific needs, you still need to write code. But that code then becomes part of the DataLines lib. That others can build on
Balance Sheet Positive Flexible (vendor neutral, data, debugging) Reusable (pipelines, processors) Fast development time “easy” to change the client (cli, desktop, web page) Negative May need to write domain specific code Learning curve -- processors config, data expectations, events
DataLines in Action Network Engineering group –Monitor router, tar pit, IDS, packet sampling, L2/L3 mappings Security Group –Network forensics Intergroup Wiring Use DataLines to share data between groups/projects
DataLines in Action Network Research group –Monitor cluster network activity from MPI layer –Data Mining –Misc. NSF data oriented projects
Future Open Source More Info: (a work in progress)