Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Data-Centric Approach to Insider Attack Detection in Database Systems Sunu Mathew Joint Work with: Michalis Petropoulos, Hung Q. Ngo, Shambhu Upadhyaya.

Similar presentations


Presentation on theme: "A Data-Centric Approach to Insider Attack Detection in Database Systems Sunu Mathew Joint Work with: Michalis Petropoulos, Hung Q. Ngo, Shambhu Upadhyaya."— Presentation transcript:

1 A Data-Centric Approach to Insider Attack Detection in Database Systems Sunu Mathew Joint Work with: Michalis Petropoulos, Hung Q. Ngo, Shambhu Upadhyaya Computer Science and Engineering, Univ. at Buffalo 13 th International Symposium on Recent Advances in Intrusion Detection (RAID 2010) Ottawa, Canada (September 2010)

2 Insider Threats against Database Systems  Insider Threats – Hard problem with vast scope  Operate within existing rules and policies  High potential for damage  Impersonation  Information Leak  Data Harvesting  Most critical threat  insider access to sensitive information  Hence, focus on effective solution to insider threats against Database Systems 2

3 Contributions  Modeling user queries from a data-centric viewpoint (as opposed to prior syntax centric)  Extracting feature vectors based on the results of query execution  Sampling strategies to address performance issues  Taxonomy for analyzing query anomalies from a data-centric viewpoint 3

4 Outline  Limitations of Syntax Centric Approaches  Data-Centric User Profiling  Taxonomy of Query Anomalies (Data-Centric)  Experimental Evaluation  Conclusions and Future work 4

5 Syntax Centric Modeling  Models queries using a characteristic or frequency vector based on parsing syntax elements ( e.g., [Kamra et al., VLDB J. 2008] ) SELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost = 100 and p.weight > 80; SQL command – SELECT, Select Clause Relations – PRODUCT, Select Clause Attributes – product_name, product_id, Where Clause Relations – PRODUCT, Where Clause Attributes – cost, weight SQL command – SELECT, Select Clause Relations – PRODUCT, Select Clause Attributes – product_name, product_id, Where Clause Attributes - cost 5

6 Limitations of Syntax Centric Approach  Two similar looking queries can produce completely different results SELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost > 100 and p.weight = 80;  False negative – similar syntax, different results  Two different looking queries can produce similar results SELECT p.product_name, p.product_id FROM PRODUCT p WHERE p.cost = 100 and p.weight > 80 AND p.product_name is not null;  False positive – different syntax, same results  Syntax Analysis is hard!!( [Calvanese et al. PODS 98] ) 6

7 Outline  Limitations of Syntax Centric Approaches  Data-Centric User Profiling  Taxonomy of Query Anomalies (Data-Centric)  Experimental Evaluation  Conclusions and Future work 7

8 Alternate viewpoint – Data Centric  When are two queries similar/different?  Model based on what data is accessed (i.e., results of query execution)  Visualize a database as a single relation (Universal Relation, [Ullman et al., TODS 84] )  Compute a statistical summary (S-Vector) of a query’s result tuples  Fixed size feature vector (depends on DB schema, independent of DB size and number of result tuples)  S-Vectors for past queries used to profile user 8

9 Statistics Vector For Sample Database 9 Database SchemaS-Vector Features RelationAttribute Product.type.ncount Product.type.ndistinct Product.cost.Min Product.cost.Max Product.cost.Mean Product.cost.StdDev Product.cost.Median Product Product.type (varchar) Product.cost (numeric)

10 Outline  Limitations of Syntax Centric Approaches  Data-Centric User Profiling  Data-Centric Analysis of Query Anomalies  Experimental Evaluation  Conclusions and Future work 10

11 Analyzing query anomalies -- 1  Based on schema, statistics of execution result  Type 1: Different Result Schema/Different Result Tuples (Different Rows and Columns) Query 1: SELECT p.cost FROM PRODUCT p WHERE p.type = ’abc’; Query 2: SELECT p.type FROM PRODUCT p WHERE p.cost < 1000;  Both syntax-centric/data-centric should be effective  Typically studied in the context of Role Based Access Control (RBAC) and Masquerade detection 11

12 Analyzing query anomalies -- 2  Type 2: Similar Result Schema/Different Result Tuples (Similar Columns, Different Rows) Base Query: SELECT * FROM PRODUCT p WHERE p.cost == 1000;  Type 2(a): Distinct Syntax SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND p.type == `abc’;  Type 2(b): Similar Syntax SELECT * FROM PRODUCT p WHERE p.cost 1000;  Syntax-centric schemes “blind” to 2(b) 12

13 Analyzing query anomalies --3  Type 3: Similar Result Schema/Similar Result Tuples (Similar rows and columns ) Base Query: SELECT p.type FROM PRODUCT p WHERE p.cost < 1000;  Type 3(a): Different Syntax/Similar Semantics SELECT p.type FROM PRODUCT p WHERE p.cost < 1000 AND p.type IN (SELECT q.type FROM PRODUCT q);  Syntax-centric generates false positives  Type 3(b): Different Syntax/Different Semantics SELECT p.type FROM PRODUCT p WHERE true;  Attacker gains additional knowledge (detected by syntax-centric, undetected by data-centric)  3(b) is rare (attacker has to control result statistics) 13

14 Data-Centric Classification of Query Anomalies 14 Anomaly CasesTypesDetected by Syntax Centric? Detected by Data- Centric? Type 1 (Different Schema/Different Results Yes Type 2 (Similar Schema/Different Results) a) Distinct SyntaxYes b) Similar SyntaxNoYes Type 3 (Similar Schema/Similar Results) a) Different Syntax/ Similar Semantics False positiveYes (allowed) b) Different Syntax/ Different Semantics YesNo (Rare)

15 Outline  Limitations of Syntax Centric Approaches  Data-Centric User Profiling  Data-Centric Analysis of Query Anomalies  Experimental Evaluation  Conclusions and Future work 15

16 Experimental Evaluation  Goals  Type 1 and 2 (a) anomalies (Detected by both syntax-centric, data-centric):  Show that data-centric performs at least as well as syntax-centric  Type 2(b) (Undetected by syntax-centric, Detected by data-centric):  Show good performance for the data-centric approach  Type 3 is outside scope (addressed in other research)  Compare with syntax centric ( [Kamra et al. 2008] ) 16

17 Test Setup 17

18 Query Result Sampling -- Strategies  A potential drawback of data-centric – Requires query execution  Database server load may not be significant  However, some examination of query results is needed for online detection  Compromise – use only k tuples of the execution results to approximate the S-Vector  Top (Initial) – k  Random – k  Suitable for the pipelined query execution model for commercial RDBMS 18

19 Evaluating Type 1, 2(a) (Different Query Result Schema/Different Syntax)  Role based anomaly (masquerade) detection  Algorithms: Naïve Bayes, Decision Tree, SVM and Euclidean clustering  Syntax Centric formats – C (crude) quiplet, M (medium) quiplet, F (fine) quiplet ( [Kamra et al. VLDB J. 2008] )  Data-Centric formats (S-Vector, S-V Initial and S-V Random with 20, 10 and 5 result tuples) 19

20 Results – Type 1, 2(a) (Different Query Result Schema/Different Syntax) 20 RolesAlgorithm C. quip % M. quip F.qui p S-V (all) S-V I-20 S-V R-20 S-V I-10 S-V R-10 S-V I-5 S-V R-5 Chair Vs Faculty N-Bayes81.6785.337585 82.6778.337781.6790 Dec-Tree8887.67 96.3388.3 8988.67 SVM83.38187.6782.3374.677771.3375.676874.33 Clustering73.37265.69292.692.3394 92.693.33  Similar performance as expected, with high detection rates (close to 100%)  Clustering performs better with S Vector  Good results with SV Initial and SV Random

21 Evaluating Type 2(b) anomalies (Different Query Result Stats/ Similar Syntax)  Volume anomalies, Sensitive tuples  Supervised two-class training is not feasible  Manually created `attack set’ based on dataset  Base query: SELECT * FROM vApplicants WHERE reviewStatusID = `a’ AND reviewStatusID = `b’;  Variation: SELECT * FROM vApplicants WHERE reviewStatusID = `b’ OR reviewStatusID = `b’;  One-class Anomaly Detection  Cluster-based Outlier Detection  Attrib-Deviation 21

22 Evaluating Type 2(b) (Different Query Result Stats/ Similar Syntax) -- Schemes  Cluster based outlier detection  Queries can be considered as points in some space, construct role (e.g., Faculty) clusters  Approximate a cluster by a single point (e.g., centroid of cluster), use Euclidean distance & threshold for anomaly detection  Attrib-Deviation  Data anomalies  Anomalies in S-V dimensions  Perform anomaly detection on a per-dimension basis (e.g., 3*StdDev threshold), number of anomaly dimensions used to rank test queries 22

23 Results – Type 2(b) anomalies (Different Query Result Stats/ Similar Syntax) 23 AlgorithmSyntax-CentricData-Centric C qlet M qlet F qlet S-V all S-V I20 S-V R20 S-V I-10 S-V R10 S-V I-5 S-V R5 Cluster- Outlier Detection23.526.417.683.81267.76.445.16.435.4 False Positive 14.411.815.810.53.96.53.95.22.66.6 Attrib- Deviation Detection017.62.987 12.964.5 False Positive 04.8 22.6261523.815.823.820.4  Cluster based -- Data-centric better, Initial-k poor  Attrib-Deviation – Data-centric better, Initial-k effective!

24 Initial-k Sampling -- Performance  But why does Initial-k perform well with Attrib- Deviation?  Single anomalous attribute  many anomaly dimensions  Per dimension basis -- selective ordering affects anomaly attributes differently  Different statistical measurements may be affected differently (e.g., Max, Median)  Performance drops only when result sample size is too low (e.g., k = 5)  Positive practical implications of Initial-k performance 24

25 Outline  Limitations of Syntax Centric Approaches  Data-Centric User Profiling  Data-Centric Analysis of Query Anomalies  Experimental Evaluation  Conclusions and Future Work 25

26 Conclusions and Future Work  Queries other than Select-Project-Join  Union Queries  SELECT g.name, g.gpa FROM GRADS g  UNION  SELECT u.name, u.gpa FROM UGRADS u;  Computed Attributes (Avg, etc.)  GROUP-BY attributes  ORDER-BY may lead to false positives  Dynamic databases, correlated queries  Feasibility of randomized result set  Combining Syntax and Data Centric schemes 26

27 Questions? Thank You! 27


Download ppt "A Data-Centric Approach to Insider Attack Detection in Database Systems Sunu Mathew Joint Work with: Michalis Petropoulos, Hung Q. Ngo, Shambhu Upadhyaya."

Similar presentations


Ads by Google