Metadata Management of Terabyte Datasets from an IP Backbone Network: Experience and Challenges Sue B. Moon and Timothy Roscoe
5/25/2001NRDM Overview Sprint IP Monitoring Project Types of Data Types of Analysis Experience and Challenges Metadata Abstractions and Model Design and Implementation
5/25/2001NRDM Sprint IP Monitoring Project Design Goal: to acquire data without sampling or insufficient accuracy. System Components: –Linux PC with 3 PCI buses and 100GB –DAG card with OC3 to OC48 support and GPS. –SAN-based analysis platform –Data repository
5/25/2001NRDM Configuration at Monitored PoP customer
5/25/2001NRDM Analysis Platform and Data Repository at Sprint ATL
5/25/2001NRDM Types of Collected Data Packet trace of 50 to 100GB –44 byte packet header + 12 byte framing info per packet BGP routing tables IS-IS tables PoP configuration (topology)
5/25/2001NRDM Types of Analysis Simple statistics gathering Isolation of TCP flows Trace correlation Generation of traffic matrices
5/25/2001NRDM Challenges Total amount of data > 10 TB –What to keep on-line and off-line Sharing data and results –What has been computed/generated Correlating different types of data –E.g. packet traces with routing tables Determining s/w dependency Reproducibility of results
5/25/2001NRDM Task Abstraction Storage of data –Ad-hoc solution: disk arrays, SAN, tape library Source code maintenance –CVS Metadata management –Our focus in this work
5/25/2001NRDM Metadata Abstraction Raw input data sets Result data sets Analysis programs –Versions of s/w Analysis operations –between data sets and programs
5/25/2001NRDM Design and Implementation Dependency graph in relational database schema => RDBMS Interaction with version control –S/W major release Linkage to data storage system –Make raw data set self-describing –Metadata independent of data location User interface –Browsing DB thru GUI and capturing analysis operations by simple command scripts.
5/25/2001NRDM Conclusion and Future Work Flexible and minimally intrusive Extensions: –Automatic storage management –Result caching –Job scheduling –Automation of analysis Will results be easily reproducible? Will users adapt to the new discipline?