Parallel Autonomous Cyber Systems Monitoring and Protection December 8, 2009 Revision 1 Chris Archer
Cyber Challenges Zero-day exploits always provide new tactics for adversaries. Current heuristic methods only respond to known tactics. The time lag for heuristic methods catching up to new exploits is a huge window of opportunity for bad things to happen. This challenge overlaps into “traditional IT” – it is difficult to see a problem arising until it is too late. New tools are needed that can react at the speed of the network to previously unknown threats and problems.
Application and Extension of Unsupervised Clustering to Cyber Applications Unsupervised clustering has been developed to allow autonomous organization of large amounts of data into hierarchical groups of similar data The addition of new information-based distance metrics (based on IG, et al) to existing unsupervised clustering will allow us to find the needles in the haystack. Because of the scalability improvements (developed under NG IRAD) and parallel nature, we can process extremely large volumes of data quickly. Approach will adapt to information content – not based on heuristic approaches that become outdated. Potential for orders of magnitude of improvement in time required to react to new and changing threats and failures.
Application 1: Parallel Unsupervised Clustering Tree for Autonomous Firewall Packet Inspection Shallow and Deep Packet Distance Metrics Multilayer Data-Driven Clustering Course Packet Clustering Increasing Parallelism Fine Packet Clustering Fine Packet Clustering Finer Packet Clustering Finer Packet Clustering Finer Packet Clustering Alerts Cluster Recognition and Uniqueness Metric Thresholding Packet Management
Packets and Data Time t Time t+D Normal, Known Information Something new develops: *Problem *Attack *New Pattern of Usage Continuous Data Flow
Finding Needles in Large Haystacks Information Metrics separate out data of different characteristics. Auto-summarization is based on clusters: Large clusters get summarized often Small clusters do not get summarized No more hiding in the data. Use a prototocol with a lot of traffic to hide OR Use a distributed approach: Won’t work: the information content of the packets will be different and will separate out. Overwhelm the processing so it falters: Highly parallel, efficient implementation can process large amounts of data Failure modes can even be designed to fail ‘gracefully’
Deep and Shallow Pack Distance Metrics Shallow Metric Based on Header Information Ports, IP Addresses, Length, Flags Deep Distance Metric based on Packet Content More expensive metric can be utilized in lower levels which are conducted in parallel Can include content information metrics Separation of the two leads to a highly parallel implementation Possible fast/cheap implementation using CUDA on Nvidia Graphics Cards
Application 2: Autonomous Log Monitoring Northrop Grumman Proprietary Level 1 Application 2: Autonomous Log Monitoring Computers Log Server Multilayer Data-Driven Clustering Line-by-line Logs Cluster Recognition and Uniqueness Metric Thresholding Program Summary Purpose: To Provide a top-level summary of key program positions and performance areas that are deemed significant for review Definitions: Accomplishments: Recent activities, milestones, highlights, or events Challenges: Items that could significantly impact program Quality, financial/technical performance, delivery, or Customer Satisfaction. Customer/Contract Issues: Open Customer issues or actions items, Contract/funding/GFE issues, “who owes who what”. Focus: Near term actions/plans (next 90 days) Process: Enter listed fields. References: IPRS User Guide Automated System Management Alerts
Experiment Plan: Application 1 – Firewall Segment Need a source of data Time tagged for artificial streaming Packet captures would be ideal Real data is better / Simulated data may be possible to generate Develop and test detailed shallow and deep packet distance metrics Set up unsupervised clustering code for autonomous hierarchical clustering Run data to generate clusters and uniqueness metrics Conclusion: We should be able to find the interesting features automatically Proof of concept should allow easy transition into a system prototype
Experiment Plan: Application 2 – Log Segment Need a source of data OR use MiSTICKE Lab to generate a real data stream Time tagged for artificial streaming Tune existing text-based distance metrics to log content, as needed Set up unsupervised clustering code for autonomous hierarchical clustering Run data to generate clusters and uniqueness metrics Conclusion: We should be able to quickly and easily find unique system events. Easily transitioned into a system prototype.
11