NIMD 1 Exploring Massive Structured Data with ARGUS PI Meeting November 29, 2005 Main contacts : Prof. Jaime Carbonell – Dr. Santosh.

Slides:



Advertisements
Similar presentations
Abstract There is significant need to improve existing techniques for clustering multivariate network traffic flow record and quickly infer underlying.
Advertisements

Applications of one-class classification
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 8: Monitoring the Network Connecting Networks.
ARGUS: Rete + DBMS = Efficient Persistent Profile Matching on Large-Volume Data Streams Chun Jin Language Technologies Institute School of Computer Science.
Detectability of Traffic Anomalies in Two Adjacent Networks Augustin Soule, Haakon Ringberg, Fernando Silveira, Jennifer Rexford, Christophe Diot.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Search Engines and Information Retrieval
Traffic Engineering With Traditional IP Routing Protocols
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
5/1/2006Sireesha/IDS1 Intrusion Detection Systems (A preliminary study) Sireesha Dasaraju CS526 - Advanced Internet Systems UCCS.
ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston Jamie Callan Phil.
ExaSphere Network Analysis Engine © 2006 Joseph E. Johnson, PhD
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Neural Technology and Fuzzy Systems in Network Security Project Progress 2 Group 2: Omar Ehtisham Anwar Aneela Laeeq
Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
NIMD 1 Project Argus Massive Data NIMD PI Meeting December 2, 2004.
Data Mining – Intro.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Lecture 11 Intrusion Detection (cont)
Using Argus Audit Trails to Enhance IDS Analysis Jed Haile Nitro Data Systems
Presenter: Shant Mandossian EFFECTIVE TESTING OF HEALTHCARE SIMULATION SOFTWARE.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology USENIX Security '08 Presented by Lei Wu.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
A Statistical Anomaly Detection Technique based on Three Different Network Features Yuji Waizumi Tohoku Univ.
Malware Hunter How To Guide for SecurityCenter Continuous View™
Search Engines and Information Retrieval Chapter 1.
GrIDS -- A Graph Based Intrusion Detection System For Large Networks Paper by S. Staniford-Chen et. al.
Network Flow-Based Anomaly Detection of DDoS Attacks Vassilis Chatzigiannakis National Technical University of Athens, Greece TNC.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Basic concepts in ordination
NetFlow: Digging Flows Out of the Traffic Evandro de Souza ESnet ESnet Site Coordinating Committee Meeting Columbus/OH – July/2004.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
Detection Unknown Worms Using Randomness Check Computer and Communication Security Lab. Dept. of Computer Science and Engineering KOREA University Hyundo.
Connect. Communicate. Collaborate Experiences with tools for network anomaly detection in the GÉANT2 core Maurizio Molina, DANTE COST TMA tech. Seminar.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Mapping Internet Sensors with Probe Response Attacks Authors: John Bethencourt, Jason Franklin, Mary Vernon Published At: Usenix Security Symposium, 2005.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
NIMD 1 Scalable Data Exploration and Novelty Detection NIMD Grand Finale PI Meeting April 18, 2006 Main contacts: Prof. Jaime Carbonell, Carnegie Mellon.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.
Open-Eye Georgios Androulidakis National Technical University of Athens.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Workpackage 3 New security algorithm design ICS-FORTH Ipswich 19 th December 2007.
Bradley Cowie Supervised by Barry Irwin Security and Networks Research Group Department of Computer Science Rhodes University DATA CLASSIFICATION FOR CLASSIFIER.
Information Systems Analysis and Design Reviews of IS and Software Process Spring Semester
Cryptography and Network Security Sixth Edition by William Stallings.
Digital Forensics Dr. Bhavani Thuraisingham The University of Texas at Dallas Network Forensics - III November 3, 2008.
Intrusion Detection Systems Paper written detailing importance of audit data in detecting misuse + user behavior 1984-SRI int’l develop method of.
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Automated Worm Fingerprinting Authors: Sumeet Singh, Cristian Estan, George Varghese and Stefan Savage Publish: OSDI'04. Presenter: YanYan Wang.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Mapping Internet Sensor With Probe Response Attacks Authors: John Bethencourt, Jason Franklin, and Mary Vernon. University of Wisconsin, Madison. Usenix.
Unclassified//For Official Use Only 1 RAPID: Representation and Analysis of Probabilistic Intelligence Data Carnegie Mellon University PI : Prof. Jaime.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Network Anomaly Detection Using Autonomous System Flow Aggregates Thienne Johnson 1,2 and Loukas Lazos 1 1 Department of Electrical and Computer Engineering.
1 Minneapolis‘ IETF IPFIX Aggregation draft-dressler-ipfix-aggregation-00.txt.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Data Mining – Intro.
Parallel Autonomous Cyber Systems Monitoring and Protection
Data Streaming in Computer Networking
Chapter 8: Monitoring the Network
Data Warehousing and Data Mining
Introduction to Stream Computing and Reservoir Sampling
Intrusion Detection Systems
Presentation transcript:

NIMD 1 Exploring Massive Structured Data with ARGUS PI Meeting November 29, 2005 Main contacts : Prof. Jaime Carbonell – Dr. Santosh Ananthraman –

NIMD 2 Project ARGUS Objectives 1.Novelty detection in structured databases or data streams Detect and track situation-specific “alert-watch” patterns Cluster analysis to establish background (normal) models Cluster density and locus analysis for early detection of new pattern onset, or meaningful change to established pattern 2.Data Explorer - analyst interface Framework for intensive, analyst-directed data exploration Applications MED: Massachusetts hospital admission database to detect attacks by biological agents NED: Network anomaly/attack detection with CERT ®, the federally funded computer security incident response center at CMU 3.Fast multi-dimensional structured-data matching Exact and approximate matching Scalable: O(10 6 ) to O(10 12 ) records Profile matching for streaming data: O(1) to O(10 6 ) profiles

NIMD 3 Role of ARGUS in Hypothetical End- to-End Multifunctional Architecture Analysts Structured Data Banking Transactions Raw Data - Other Raw Data – Other Raw Data - Net TrafficRaw Data-Annotations Raw Data - RSS Raw Data - Financial Raw Data - Reports Raw Data - MaterialsRaw Data - Customs Raw Data - News Raw Data - Biometric Data Normalization & Modeling Distributed Structured Search Engines Exploration Analyst Workstation Query Generation Massive Search Control Data Source Prioritization Active Context Control Hypothesis Management Profile Queries Matched Events Analyst Interface Structured Data Network Traffic Structured Data Hospital Admissions Structured Data Extracted News Archives Structured Data Extracted Agency Reports Analysis Subsystem – Text & Data Situation Assessment Validation Analyst Collaboration Hypothesis Evaluation Events and Alerts Structured Data Search Exact Approximate Massive data Streaming data Novelty Detection Text Extraction Mobile Agents

NIMD 4 Novelty Detection Objective: –Detect the onset of novel events in incoming data streams –Generate alert for analyst (with justification) –If judged significant  track developments, else discard Properties –Need a model of “business as usual” to detect divergences therefrom (done by clustering recent history) –Control points (tradeoff in precision-recall) Degree of deviation from normalcy required Amount of data support (e.g. # of observations) before alerting Statistical model of normal “noise” in data streams

NIMD 5 Cluster Evolution and Density Change Detection Constant EventNew Unobfuscated Event New Obfuscated EventGrowing Event

NIMD 6 Visualizations in Display Area

NIMD 7 Sample Application: Monitoring for Bioterrorism Database of all Mass hospital stays discharged between 10/2000 and 9/2001 (835,895 records) 18 fields per record, including: –provider (hospital) –patient (gender, age, birthdate, race, ZIP) –timing (admit date, length of stay) –diagnoses (up to 8 with one primary) –payment source Cluster to form background models Inject new streaming data that may include potential threats (e.g. SARS, Anthrax, toxin-based attack,…) New Mini-Cluster Analysis reveals outbreaks of: Tularemia Dengue Fever Myiasis Chagas Disease SARS Outbreak simulation –Added new records for patients from a small geographical region diagnosed with influenza in 9/2001 –Graph shows resulting secondary peak in the pulmonary disease density function

NIMD 8 CERT Collaboration Working with CERT on NetFlow data for scalable detection of network attack patterns (viruses, denial of service, unauthorized entry attempts, etc.)

NIMD 9 CERT: Preliminary Data Analysis Principal component analysis is used for data reduction where the 11 input features are reduced to 3 principal component features (PC1, PC2 and PC3 below) to capture 54%, 25% and 13%, respectively, of the variance in the original 11 features For example, PC2 is mainly comprised of DST FLOWS, PKTS, and BYTES, and PC3 is mainly comprised of UNIQ_PORTS, SUBNETS and DSTPORT Clustering in the principal components dimension to explore automatically-generated aggregations and abstractions of data for meaningful matching and pattern detection

NIMD 10 Scalable Matcher In-memory matchers faster than 1/100 th of a second Disk matcher faster than 1/10 th of a second until disk access barrier  1 second per match above 10 8 records (in 2-year-old processor) Matcher Versions Record Volume Time complexityStatus In-memory10 6 to 10 8 LogarithmicMature Disk-based10 7 to Low power-lawAlgorithmically stable Distributed10 9 to As underlying matcher Initial prototype only

NIMD 11 Matching Data Streams to Profiles Data Streams Novelty Detection Analyst Matcher Profiles Novel Events Alerts New Profiles Profile = “alert-watch” pattern Generated by analyst Novelty detection & vetted Need rapid matching for simultaneously active profiles

NIMD 12 Profile Sharing Framework ARGUS Query Network Manager Query Data Tables Analyst Identified Threats Data Streams Query Network System Catalog Dynamix Matcher

NIMD 13 Evaluation MED: Bio-surveillance FED: Fedwire suspicious transaction pattern tracking AvgTime/Query with 565 queries in seconds: NonJoinS:0.20 MatchPlan+NCanon:0.12 AllSharing: 0.11 AvgTime/Query with 768 queries in seconds: NonJoinS:0.25 MatchPlan+NCanon:0.12 AllSharing: 0.04

NIMD 14 ARGUS Achievements: Summary Solid scientific underpinnings –Efficient algorithms for approximate search and exploration –Efficient matching of complex patterns on streaming data –Novelty detection via radial cluster-density function analysis Prototype development –User validation of utility of techniques (at NIST) –Analyst GUI - Data Explorer (under development) –Applications MED: Massachusetts hospital admission database for detection of attacks by biological agents FED: Fedwire Money Transfer database for suspicious transaction pattern tracking NED: NetFlow database from CERT® for scalable detection of network attack patterns Sufficient progress to interest operational IC –Exploring collaboration with GDAIS (their client has >10 8 transactions daily, >10 10 records total) –Getting ready for stage 1 RDEC insertion

NIMD 15 Additional Slides for Q & A Session

NIMD 16 Cluster Evolution Constant EventNew Unobfuscated Event New Obfuscated EventGrowing Event

NIMD 17 Novelty Detection Functionality Build background model –Expected Events (clusters) Find divergences –Individual outliers (but many false positives) –New Mini-clusters (more reliable, unobfuscated new-event detection) –Detect when a novel event is masked by ordinary happenings or intentiallly obfuscated Trigger Alerts –Route & Prioritize –Formulate hypotheses for Analyst Technology Modeling methods –(Hierarchical) k-means Divergence metrics –Radial density gradients from cluster centroid –Temporally-adaptive distance measures –Secondary peaks in density function Create analyst profiles –RETE-based SAMs methods (last PI-meeting ARGUS paper)

NIMD 18 ARGUS Query Network Manager Query Network Query ARGUS Query Network Manager Coordinator System Catalog Common Computation Identifier Sharing Optimizer Projection Manager Network Topology & Operation Manager Query Rewriter Query Optimizer Code Assembler

NIMD 19 Recording & Identifying Common Comps r2.type_code = 1000 r3.type_code = 1000 r1.type_code = 1000 r1.amount > r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date <= r1.tran_date + 10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date <= r2.tran_date + 10 r1.type_code = 1000 r1.amount > r2.type_code = 1000 r2.amount > r3.type_code = 1000 r3.amount > r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date <= r1.tran_date + 10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date <= r2.tran_date + 10 r1.amount – r2.amount * 2 < 0 r3.tran_date – r2.tran_date <= 10 System Catalog PredIDCanonicalForm … PredSetIDPredID … NodeIDPredSetID … PredicateIndex PredicateSetIndex TopologyIndex Canonicalization Inference & Classification Common Computation Identification

NIMD 20 Preliminary Data Analysis CERT: The Data Exploratory data for this exercise comprised a matrix of 65k rows and 24 columns which was aggregated as follows For every SCAN_HOUR, for every unique SCAN_ID record the {independent, input features - time element} TIME DATETIME - FIRST TIME THIS (SCAN, PORT, HOST) WAS SEEN THIS HOUR STIME DATETIME - START TIME OF THE FIRST FLOW IN THE SCAN ETIME DATETIME - START TIME OF THE LAST FLOW IN THE SCAN record the {independent, input features - Source details} SRCADDR ADDRESS - SOURCE IP ADDRESS COUNTRY CHAR - TWO-LETTER COUNTRY CODE OF THE SRC (FROM GEOIP) UNIQ_DSTS INTEGER - NUMBER OF UNIQUE (PORT, ADDR) PAIRS SCANNED FLOWS INTEGER - TOTAL NUMBER OF FLOWS IN THE SCAN PKTS INTEGER - TOTAL NUMBER OF PACKETS IN THE SCAN BYTES INTEGER - TOTAL NUMBER OF BYTES IN THE SCAN UNIQ_PORTS INTEGER - NUMBER OF UNIQUE PORTS SCANNED UNIQ_HOSTS INTEGER - NUMBER OF UNIQUE HOSTS SCANNED SUBNETS INTEGER - NUMBER OF UNIQUE CLASS /24 PREFIXES SCANNED HAS_EXPLOIT INTEGER - 1 IF ANY OF THE TARGETS "TALKED BACK" record the {independent, input features - Destination details} DSTPORT INTEGER - DESTINATION PORT FLOWS INTEGER - NUMBER OF FLOWS FOR THIS (SCAN, HOUR, PORT) PKTS INTEGER - " PACKETS " BYTES INTEGER - " BYTES " DSTADDR ADDRESS - DESTINATION IP ADDRESS EXPLOIT INTEGER - 1 IF THE DESTINATION HOST "TALKED BACK" TO THE SOURCE record the {dependent, output features - SCAN classification labels based on CERT expert heuristics} SCAN_PROB FLOAT - PROBABILITY THAT THIS EVENT REPRESENTS A SCAN SCAN_FP INTEGER - 0: UNKNOWN, 1: HORIZONTAL, 2: VERTICAL SCAN_TYPE INTEGER - 0: NOT A SCAN, 1: SYN SCAN, 2: SYN-FIN SCAN, 3: NULL SCAN, 4: XMAS SCAN, 5: FIN SCAN, 6: UNIDENTIFIED SCAN HAS_TROJAN_PORT INTEGER - 1 IF ANY DSTPORT IS USED BY A KNOWN TROJAN IS _WORM INTEGER – 1 IF THE SCAN APPEARS TO BE A WORM