Download presentation
Presentation is loading. Please wait.
Published byAnika Cranmer Modified over 9 years ago
1
Integrating Cybersecurity Log Data Analysis in Hadoop Bryan Stearns, Susan Urban, Sindhuri Juturu Texas Tech University 2014 NSF Research Experience for Undergraduates Site Program Abstract In cybersecurity, the growth of Advanced Persistent Threats (APTs) and other sophisticated attacks makes it increasingly important to analyze network and system activities from all event sources. If logs are recorded through different software packages, the resulting big data can be in a form known as dirty data that is difficult to merge and cross-analyze. Additionally, storing logs in a big data architecture like Apache Hadoop can make data joins time-consuming. Merging data as it is stored rather than at access can greatly simplify unified analysis, yet doing so requires knowledge of what kinds of merges will be required. Services wishing to holistically analyze dirty data are required to merge information externally each time data is pulled. This research is developing a file system called HVID (Hadoop Value-oriented Integration of Data) that will represent dirty log data as a single table while maintaining its raw form using a novel variant of column-oriented storage. This system utilizes the open-source big data system Hadoop HBase to enable fast access to unified views without the need for predetermined joins. This design will allow more natural and efficient holistic analysis of stored cybersecurity data both with external mining applications and with local MapReduce tools. Introduction o The amount of unstructured or semi-structured data recorded in cybersecurity endeavors is growing every day [1]. o Increasingly sophisticated cybersecurity threats require large-scale holistic pattern analysis for proper detection [1][2]. o Heterogeneous “dirty” data generated from multiple network sources needs customized unification to be useful for such purposes [3]. o MapReduce is desirable to analyze unstructured data, but existing unification methods structure data externally from unstructured storage platforms, requiring additional I/O to feed merged data back to storage [3]. o A better method is needed to unify and manipulate disparate information from across services in a network! Methods Equipment o 64-bit single-node virtual Linux machine o IBM BigInsights V3.0.0.0 o Station with 8-thread 1.87GHz processor and 8GB RAM Process o Design – The design proceeded with two fronts: abstract and physical. Abstract design focused on the data to be merged, while physical focused on the selection, implementation, and optimization of features available in Hadoop. o Implement – The designed data architecture was created along with basic access features. Java was used for system creation and interfacing. o Test – Basic speed tests were performed on working features. Generic system time properties at the start and completion of operations were used for tests. Objectives Design and prototype a file system that: o Runs in Hadoop o Supports structured, semi-structured, and unstructured data o Provides quick access to merged tables o Does not restrict what columns are used for merging o Supports MapReduce operations upon merged data Implications o Faster value-based retrieval of data o Eliminate need to individually merge tables containing shared features o Reduce I/O needed for holistic unstructured MapReduce analysis Conclusion: The HVID design: o Unifies data into a common format via a unique value-based structure o Allows heterogeneous datasets to be merged by any field o Supports external or internal data mining and manipulation o Supports internal MapReduce on merged data o Should allow value-based merge and join queries to be run in comparable time to plain select queries, o Requires less space than comparable row-oriented solutions* o Can utilize backup copies for improved data interconnectivity o Requires further research and development o Provides a means to unite dirty cybersecurity data in storage without the need to explicitly outline how information should me merged until it is needed. References: [1] A. A. Cárdenas, P. K. Manadhata, and S. P. Rajan, "Big Data Analytics for Security," IEEE Security & Privacy, vol. 11, pp. 74-76, 2013. [2] A. K. Sood and R. J. Enbody, "Targeted Cyberattacks: A Superset of Advanced Persistent Threats," IEEE Security & Privacy, vol. 11, p. 7, 2013. [3] T.-F. Yen, A. Oprea, K. Onarlioglu, T. Leetham, W. Robertson, A. Juels, et al., "Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks," presented at the Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, 2013. Design: o An inverted variant of column-oriented storage was developed in which values are the primary key and row IDs are the dependent value. o This physical association of rows with shared values allows dynamic views on relations without lengthy scans and comparisons. o HBase was chosen as the HVID platform for its flexibility, support for read-intensive applications, and its ability to integrate with Hadoop MapReduce. o Value-oriented data reside in row key byte arrays for quick scanning o Rows are sorted for fast collection of values from contiguous ranges. o Large unstructured data is stored in a separate row- oriented table. o This row-oriented form can be used to store backups of value-oriented data, while enhancing row ID resolution of value-based queries. Future Work: o Complete working implementation for data lookup o Compare functionality with and without backup row-oriented records o Modify BulkLoad to support multi-table output from a single MapReduce o Analyze row key structure for region balancing o Load implementation onto cluster o Benchmark various operations in various cluster configurations o Create Hive interface for increased functionality and ease-of-use o Create automatic upload system and interface o Create web-interface for database access o Explore support for varying data-types within a field via qualifiers o Explore inverted clustering techniques based on inverted data DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS- 1263183. Any opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense. Results: o Custom database generation and access tools were created for HBase using Java. Full functionality is not yet complete, but load and basic data retrieval have been implemented for the row-oriented segment of HVID. o Whether row-oriented versions of data should be kept alongside value-oriented tables remains unknown until a full prototype is built and merge speeds are tested under various configurations. o While results are inconclusive, selection of rows through value-oriented storage shows promise. Figure 1: Value-Based storage Preliminary Testing: o Collective storage consumption by value-based HVID tables was found to be 86% of that required for a classic HBase table. o Further compression can be realized when employing numeric data o Value-based access is not yet fully implemented for testing. o Results show little difference between HVID and classic HBase for basic row-oriented retrieval (when using rows as a backup form). o The system must be tested on a full Hadoop cluster before any speed tests can contain significant meaning. Figure 2: Preliminary Space and Time Comparisons * Tests used 25MB.tsv text file as source data 116 11500 5500 ** Only Row IDs were selected. Pulling content remains to be implemented. 108 128 * When using non-duplicated value-oriented usage ABC a1 a2 a3 b1 b2 b3 c1 c2 c3 T1 row1 row2 row3 B CD b1 b2 b4 c2 c4 c5 d1 d2 d3 row1 row2 row3 T2 B b1:T1:row1 b1:T2:row1 b2:T1:row2 b2:T2:row2 b3:T1:row3 b4:T2:row3 A a1:T1:row1 a2:T1:row2 a3:T1:row3 C c1:T1:row1 c2:T1:row2 c2:T2:row1 c3:T1:row3 c4:T2:row2 c5:T2:row3 D d1:T2:row1 d2:T2:row2 d3:T2:row3 T2 row1 row2 row3 {(“B”, “b1”), (“C”, “c2”), (“D”, “d1”)} {(“B”, “b2”), (“C”, “c4”), (“D”, “d2”)} {(“B”, “b4”), (“C”, “c5”), (“D”, “d3”)} rows T1:row1 T1:row2 T1:row3 {(A, a1), (B, b1), (C, c1)} {(A, a2), (B, b2), (C, c2)} {(A, a3), (B, b3), (C, c3)} T2:row1 T2:row2 T2:row3 {(B, b1), (C, c2), (D, d1)} {(B, b2), (C, c4), (D, d2)} {(B, b4), (C, c5), (D, d3)} Source Data Classic HBase Row-Oriented storage HVID Storage vs T1 row1 row2 row3 {(“A”, “a1”), (“B”, “b1”), (“C”, “c1”)} {(“A”, “a2”), (“B”, “b2”), (“C”, “c2”)} {(“A”, “a3”), (“B”, “b3”), (“C”, “c3”)}
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.