Detection of Masqueraders Based on Graph Partitioning of File System Access Events Flavio Toffalini, Ivan Homoliak, Athul Harilal, Alexander Binder, and Martín Ochoa ST Electronics- SUTD Cyber Security Laboratory 39th IEEE Symposium on Security and Privacy on Workshop on Research for Insider Threats. USA, San Francisco, May 24th, 2018.
SUTD - CorpLab It is a laboratory co-founded by: Singapore University of Technology and Design (aka SUTD) ST Electronics National Research Foundation (aka NRF) One of its projects deals with insider threats
SUTD - CorpLab TWOS: A Dataset of Malicious Insider Threat Behavior Based on a Gamified Competition (MIST-CSS@2017 + JoWUA@2018) Insight into Insiders: A Survey of Insider Threat Taxonomies, Analysis, Modeling, and Countermeasures (preprint@arxiv.org) Others about host detection and continuously authentication
Agenda Our goal Attacker model Previous approaches Intuition Markov Cluster Chain Overview Implementation Qualitative evaluation Efficiency Analysis
Our Goal We aim at catching masqueraders by using an anomaly detection system Masquerader
Attacker Model What is a masquerader? Masquerader Legitimate
Attacker Model What is a masquerader? He is any (malicious) user who acts on behalf of a legitimate user He already bypassed previous system access controls He can exfiltrate data He can sabotage the machine
Alternative detection Previous approaches Scenarios Logs File systems Network Login/logout Email … Features Extraction + Machine Learning Alternative detection “Raw data” + Deep Learning
Intuition e.g., file system access Legitimate tasks follow patterns Masquerader tasks are different A task generates events Modeling events as graphs e.g., file system access
Intuition Users’ file system access: Timestamp Action File 05/05/2018 10:10:05.. Open C:\File1 05/05/2018 10:10:06.. Read C:\File2 05/05/2018 10:10:07.. 05/05/2018 10:11:13.. Close 05/05/2018 10:12:05.. Delete 05/05/2018 10:12:10.. 05/05/2018 10:13:11.. List Dir C:\Temp\ 05/05/2018 10:13:21.. Create C:\Temp\ToExfiltrate
Markov Cluster Graph C:\User\Alice\Documets\Project1.doc C:\User\Alice\Documents\Prject3.doc C:\User\Alice\Documents\Project2.doc C:\User\Alice\Documents\Administration\.. . Vertex Cluster
Markov Cluster Graph What does a Vertex Cluster mean? A set of resources (i.e., files) used to achieve a task Vertex Cluster
Overview -> 0 : ES not so similar to the history H -> 1 : ES very similar to the history H similarityFunction(H, ES) -> [0, 1] History H Event Sequence ES New list of file access logs Built from file access logs previously generated
Overview true: ES is Legitimate False: ES is Malicious similarityFunction(H, ES) > t Threshold t
similarityFunction(…) Implementation similarityFunction(…) Markov Cluster Graph
Implementation History Similarity functions (we tried different approaches)
History History of a user U File system logs
History History of a user U Time windows
History History of a user U Graphs
History History of a user U Markov Cluster Graph History VC1: VC2: A set of Vertex Cluster
Implementation Still a list of event similarityFunction(HU, ES) -> [0, 1] History HU Event Sequence ES Still a list of event
Implementation Split ES in Time windows (as for the history) Then extracting Vertex Clusters Event Sequence ESW Time window W
Similarity Function Inside similarityFunction(HU, ESW) -> [0, 1] History HU Event Sequence ESW How to compare H and ES?
Similarity Function Inside History HU Event Sequence ESW - History: is a set of Vertex Cluster - Event Sequence: is a set of Vertex Cluster
Similarity Function Inside History HU Event Sequence ESW - History: is a set of Vertex Cluster - Event Sequence: is a set of Vertex Cluster
Similarity Function Inside History HU Event Sequence ESW We built 7 similarity functions based on set comparison operators e.g., equal, subset, superset
Similarity Function Inside Just an example, the simplest: SimilaryByEqual(HU, ESW) { m <- 0 for all s in ESW do for all h in HU do if s == h then m <- m + 1 break return m/| ESW | } Idea: trying to understand how many elements of ESW are contained in HU Other more complex versions in the paper
Similarity Function Inside Something more complex SimilaryBySubsetWeight(HU, ESW) { m <- 0 n <- 0 for all s in ESW do for all h in HU do if s isSubset h then m <- m + |s|/|h| n <- n + 1 m’ <- m/n return m’/| ESW | } Idea2.0: we propose a weighted sum between HU and ESW . Sum elements ratio Normalization Other more complex versions in the paper
Evaluation That’s hard to find a good dataset WUIL: The Windows-Users and -Intruder simulations Logs dataset TWOS: A Dataset of Malicious Insider Threat Behavior Based on a Gamified Competition Both contain file system access logs
WUIL Dataset Legitimate user activities: from real users Masquerader user activities: synthetic For each user: 3 sessions of malicious logs, 5 min long each (around 15 min in total) Around 70 users in total
TWOS Dataset Legitimate user activities: from real users Masquerader user activities: from real users too for each user: 1 sessions of malicious logs 1 hour long Around 20 users in total User behaviors from a gamified experiment
Setting WUIL: time window 30s 1m 2m TWOS: time window 10m 20m 30m 5-fold cross validation
Setting WUIL and TWOS have labeled data User U Legitimate Masquerader
Setting WUIL and TWOS History (training set): only legitimate Test (test set): masquerade and legitimate User U
Setting WUIL and TWOS 5-fold cross validation User U 1 2 User for making the History 3 4 5 User for making the Test M
Results Area Under the Curve (AUC) Receiver Operating Characteristic Curve (ROC) Best Configuration (threshold) Efficiency Analysis (time)
Area Under the Curve On average per user Sim. Function used: WUIL (2m) 0.944 0.851 TWOS (30m) On average per user Sim. Function used: Subset OR Superset w/ weight
Receiver Operating Characteristic Curve On average per user Sim. Function used: Subset OR Superset w/ weight
Best Configuration WUIL dataset (real legitimate user activities + synthetic attacks): TWOS dataset (real legitimate user activities + real attacks) True Positive Ratio False Positive Ratio Our results 95% 9% Previous results 91.5% 11.81% True Positive Ratio False Positive Ratio Our results 91% 11% Previous results No previous results
Efficiency Analysis How expensive is Markov Cluster Graph algorithm? Mean time to analyze an ES (from a list of logs to a vertex cluster) WUIL: 0.015s (2 min TW) TWOS: 0:016s (30 min TW)
Future works Reducing False Positive Ratie Developing auto-tune techniques Try over different types of logs (network, SQL queries, HTTP logs)
THANK YOU