Download presentation
Presentation is loading. Please wait.
1
NIMD 1 Exploring Massive Structured Data with ARGUS PI Meeting November 29, 2005 Main contacts : Prof. Jaime Carbonell – carbon8918@yahoo.com Dr. Santosh Ananthraman – santosh@dynamixtechnologies.com
2
NIMD 2 Project ARGUS Objectives 1.Novelty detection in structured databases or data streams Detect and track situation-specific “alert-watch” patterns Cluster analysis to establish background (normal) models Cluster density and locus analysis for early detection of new pattern onset, or meaningful change to established pattern 2.Data Explorer - analyst interface Framework for intensive, analyst-directed data exploration Applications MED: Massachusetts hospital admission database to detect attacks by biological agents NED: Network anomaly/attack detection with CERT ®, the federally funded computer security incident response center at CMU 3.Fast multi-dimensional structured-data matching Exact and approximate matching Scalable: O(10 6 ) to O(10 12 ) records Profile matching for streaming data: O(1) to O(10 6 ) profiles
3
NIMD 3 Role of ARGUS in Hypothetical End- to-End Multifunctional Architecture Analysts Structured Data Banking Transactions Raw Data - Other Raw Data – Other Raw Data - Net TrafficRaw Data-Annotations Raw Data - RSS Raw Data - Financial Raw Data - Reports Raw Data - MaterialsRaw Data - Customs Raw Data - News Raw Data - Biometric Data Normalization & Modeling Distributed Structured Search Engines Exploration Analyst Workstation Query Generation Massive Search Control Data Source Prioritization Active Context Control Hypothesis Management Profile Queries Matched Events Analyst Interface Structured Data Network Traffic Structured Data Hospital Admissions Structured Data Extracted News Archives Structured Data Extracted Agency Reports Analysis Subsystem – Text & Data Situation Assessment Validation Analyst Collaboration Hypothesis Evaluation Events and Alerts Structured Data Search Exact Approximate Massive data Streaming data Novelty Detection Text Extraction Mobile Agents
4
NIMD 4 Novelty Detection Objective: –Detect the onset of novel events in incoming data streams –Generate alert for analyst (with justification) –If judged significant track developments, else discard Properties –Need a model of “business as usual” to detect divergences therefrom (done by clustering recent history) –Control points (tradeoff in precision-recall) Degree of deviation from normalcy required Amount of data support (e.g. # of observations) before alerting Statistical model of normal “noise” in data streams
5
NIMD 5 Cluster Evolution and Density Change Detection Constant EventNew Unobfuscated Event New Obfuscated EventGrowing Event
6
NIMD 6 Visualizations in Display Area
7
NIMD 7 Sample Application: Monitoring for Bioterrorism Database of all Mass hospital stays discharged between 10/2000 and 9/2001 (835,895 records) 18 fields per record, including: –provider (hospital) –patient (gender, age, birthdate, race, ZIP) –timing (admit date, length of stay) –diagnoses (up to 8 with one primary) –payment source Cluster to form background models Inject new streaming data that may include potential threats (e.g. SARS, Anthrax, toxin-based attack,…) New Mini-Cluster Analysis reveals outbreaks of: Tularemia Dengue Fever Myiasis Chagas Disease SARS Outbreak simulation –Added new records for patients from a small geographical region diagnosed with influenza in 9/2001 –Graph shows resulting secondary peak in the pulmonary disease density function
8
NIMD 8 CERT Collaboration Working with CERT on NetFlow data for scalable detection of network attack patterns (viruses, denial of service, unauthorized entry attempts, etc.)
9
NIMD 9 CERT: Preliminary Data Analysis Principal component analysis is used for data reduction where the 11 input features are reduced to 3 principal component features (PC1, PC2 and PC3 below) to capture 54%, 25% and 13%, respectively, of the variance in the original 11 features For example, PC2 is mainly comprised of DST FLOWS, PKTS, and BYTES, and PC3 is mainly comprised of UNIQ_PORTS, SUBNETS and DSTPORT Clustering in the principal components dimension to explore automatically-generated aggregations and abstractions of data for meaningful matching and pattern detection
10
NIMD 10 Scalable Matcher In-memory matchers faster than 1/100 th of a second Disk matcher faster than 1/10 th of a second until disk access barrier 1 second per match above 10 8 records (in 2-year-old processor) Matcher Versions Record Volume Time complexityStatus In-memory10 6 to 10 8 LogarithmicMature Disk-based10 7 to 10 10 Low power-lawAlgorithmically stable Distributed10 9 to 10 12 As underlying matcher Initial prototype only
11
NIMD 11 Matching Data Streams to Profiles Data Streams Novelty Detection Analyst Matcher Profiles Novel Events Alerts New Profiles Profile = “alert-watch” pattern Generated by analyst Novelty detection & vetted Need rapid matching for 10 5 + simultaneously active profiles
12
NIMD 12 Profile Sharing Framework ARGUS Query Network Manager Query Data Tables Analyst Identified Threats Data Streams Query Network System Catalog Dynamix Matcher
13
NIMD 13 Evaluation MED: Bio-surveillance FED: Fedwire suspicious transaction pattern tracking AvgTime/Query with 565 queries in seconds: NonJoinS:0.20 MatchPlan+NCanon:0.12 AllSharing: 0.11 AvgTime/Query with 768 queries in seconds: NonJoinS:0.25 MatchPlan+NCanon:0.12 AllSharing: 0.04
14
NIMD 14 ARGUS Achievements: Summary Solid scientific underpinnings –Efficient algorithms for approximate search and exploration –Efficient matching of complex patterns on streaming data –Novelty detection via radial cluster-density function analysis Prototype development –User validation of utility of techniques (at NIST) –Analyst GUI - Data Explorer (under development) –Applications MED: Massachusetts hospital admission database for detection of attacks by biological agents FED: Fedwire Money Transfer database for suspicious transaction pattern tracking NED: NetFlow database from CERT® for scalable detection of network attack patterns Sufficient progress to interest operational IC –Exploring collaboration with GDAIS (their client has >10 8 transactions daily, >10 10 records total) –Getting ready for stage 1 RDEC insertion
15
NIMD 15 Additional Slides for Q & A Session
16
NIMD 16 Cluster Evolution Constant EventNew Unobfuscated Event New Obfuscated EventGrowing Event
17
NIMD 17 Novelty Detection Functionality Build background model –Expected Events (clusters) Find divergences –Individual outliers (but many false positives) –New Mini-clusters (more reliable, unobfuscated new-event detection) –Detect when a novel event is masked by ordinary happenings or intentiallly obfuscated Trigger Alerts –Route & Prioritize –Formulate hypotheses for Analyst Technology Modeling methods –(Hierarchical) k-means Divergence metrics –Radial density gradients from cluster centroid –Temporally-adaptive distance measures –Secondary peaks in density function Create analyst profiles –RETE-based SAMs methods (last PI-meeting ARGUS paper)
18
NIMD 18 ARGUS Query Network Manager Query Network Query ARGUS Query Network Manager Coordinator System Catalog Common Computation Identifier Sharing Optimizer Projection Manager Network Topology & Operation Manager Query Rewriter Query Optimizer Code Assembler
19
NIMD 19 Recording & Identifying Common Comps r2.type_code = 1000 r3.type_code = 1000 r1.type_code = 1000 r1.amount > 1000000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date <= r1.tran_date + 10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date <= r2.tran_date + 10 r1.type_code = 1000 r1.amount > 1000000 r2.type_code = 1000 r2.amount > 500000 r3.type_code = 1000 r3.amount > 500000 r1.rbank_aba = r2.sbank_aba r1.benef_account = r2.orig_account r2.amount * 2 > r1.amount r1.tran_date <= r2.tran_date r2.tran_date <= r1.tran_date + 10 r2.rbank_aba = r3.sbank_aba r2.benef_account = r3.orig_account r2.amount = r3.amount r2.tran_date <= r3.tran_date r3.tran_date <= r2.tran_date + 10 r1.amount – r2.amount * 2 < 0 r3.tran_date – r2.tran_date <= 10 System Catalog PredIDCanonicalForm … PredSetIDPredID … NodeIDPredSetID … PredicateIndex PredicateSetIndex TopologyIndex Canonicalization Inference & Classification Common Computation Identification
20
NIMD 20 Preliminary Data Analysis CERT: The Data Exploratory data for this exercise comprised a matrix of 65k rows and 24 columns which was aggregated as follows For every SCAN_HOUR, for every unique SCAN_ID record the {independent, input features - time element} TIME DATETIME - FIRST TIME THIS (SCAN, PORT, HOST) WAS SEEN THIS HOUR STIME DATETIME - START TIME OF THE FIRST FLOW IN THE SCAN ETIME DATETIME - START TIME OF THE LAST FLOW IN THE SCAN record the {independent, input features - Source details} SRCADDR ADDRESS - SOURCE IP ADDRESS COUNTRY CHAR - TWO-LETTER COUNTRY CODE OF THE SRC (FROM GEOIP) UNIQ_DSTS INTEGER - NUMBER OF UNIQUE (PORT, ADDR) PAIRS SCANNED FLOWS INTEGER - TOTAL NUMBER OF FLOWS IN THE SCAN PKTS INTEGER - TOTAL NUMBER OF PACKETS IN THE SCAN BYTES INTEGER - TOTAL NUMBER OF BYTES IN THE SCAN UNIQ_PORTS INTEGER - NUMBER OF UNIQUE PORTS SCANNED UNIQ_HOSTS INTEGER - NUMBER OF UNIQUE HOSTS SCANNED SUBNETS INTEGER - NUMBER OF UNIQUE CLASS /24 PREFIXES SCANNED HAS_EXPLOIT INTEGER - 1 IF ANY OF THE TARGETS "TALKED BACK" record the {independent, input features - Destination details} DSTPORT INTEGER - DESTINATION PORT FLOWS INTEGER - NUMBER OF FLOWS FOR THIS (SCAN, HOUR, PORT) PKTS INTEGER - " PACKETS " BYTES INTEGER - " BYTES " DSTADDR ADDRESS - DESTINATION IP ADDRESS EXPLOIT INTEGER - 1 IF THE DESTINATION HOST "TALKED BACK" TO THE SOURCE record the {dependent, output features - SCAN classification labels based on CERT expert heuristics} SCAN_PROB FLOAT - PROBABILITY THAT THIS EVENT REPRESENTS A SCAN SCAN_FP INTEGER - 0: UNKNOWN, 1: HORIZONTAL, 2: VERTICAL SCAN_TYPE INTEGER - 0: NOT A SCAN, 1: SYN SCAN, 2: SYN-FIN SCAN, 3: NULL SCAN, 4: XMAS SCAN, 5: FIN SCAN, 6: UNIDENTIFIED SCAN HAS_TROJAN_PORT INTEGER - 1 IF ANY DSTPORT IS USED BY A KNOWN TROJAN IS _WORM INTEGER – 1 IF THE SCAN APPEARS TO BE A WORM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.