Intelligent Detection of Malicious Script Code CS194, Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Sponsored by Symantec
Outline for Project Phase I : Setup Set up machine for testing environment Set up machine for testing environment Ensure that “whitelist” is clean Ensure that “whitelist” is clean Phase II : Crawling Modify crawler to output only necessary data. This means: Modify crawler to output only necessary data. This means: Grab only necessary information from webcrawling resultsGrab only necessary information from webcrawling results Listen into Internet Explorer’s Javascript interpreter and output relevant behaviorListen into Internet Explorer’s Javascript interpreter and output relevant behavior Phase III: Database Research and develop an effective structure for storing data and link it to webcrawler Research and develop an effective structure for storing data and link it to webcrawler Phase IV: Analysis Research and develop an effective algorithm for learning from massive amounts of data Research and develop an effective algorithm for learning from massive amounts of data
Completed Tasks – First Quarter Phase I Configured machine with Norton Antivirus and Heritrix web crawler Configured machine with Norton Antivirus and Heritrix web crawler Webcrawler will be used to grab additional URLs, and Norton Antivirus will be used to verify that a URL has not launched an attackWebcrawler will be used to grab additional URLs, and Norton Antivirus will be used to verify that a URL has not launched an attack Created a Python script to ensure that visited sites are clean Created a Python script to ensure that visited sites are clean Captures Norton’s web attack logs before and after loading a site in Internet Explorer, then compares the logs for new entries and signals whether or not a site’s data should be discardedCaptures Norton’s web attack logs before and after loading a site in Internet Explorer, then compares the logs for new entries and signals whether or not a site’s data should be discarded Phase II Configured Heritrix to run specific crawls that target a set of domains, and output minimal information Configured Heritrix to run specific crawls that target a set of domains, and output minimal information The purpose is to gather as many URLs with scripts as possible for a large sample baseThe purpose is to gather as many URLs with scripts as possible for a large sample base Created a parser for Heritrix logs to filter out irrelevant websites Created a parser for Heritrix logs to filter out irrelevant websites For example, we are omitting URLs that point to images since they will not contain scriptsFor example, we are omitting URLs that point to images since they will not contain scripts
Completed Tasks – Second Quarter Phase I Whitelist: integrated Symantec component to check whether visited site is malicious, so all of the data we gather is from clean sources Whitelist: integrated Symantec component to check whether visited site is malicious, so all of the data we gather is from clean sources Hard drive: installed a 750 GB hard drive Hard drive: installed a 750 GB hard drive
Completed Tasks – Second Quarter Phase II Crawling: We ran a shallow crawl with 200 domains as seed, and that is the current base of our data. The result was 18,500 URLs that we run through with our Script Listening component Crawling: We ran a shallow crawl with 200 domains as seed, and that is the current base of our data. The result was 18,500 URLs that we run through with our Script Listening component
Completed Tasks – Second Quarter Phase II Script Listening: received a customizable tool from Symantec that listens to the Javascript interpreter in Internet Explorer Script Listening: received a customizable tool from Symantec that listens to the Javascript interpreter in Internet Explorer We modified it to output the information we need: We modified it to output the information we need: GUID -> DISPID -> ArgType -> ArgVal
Completed Tasks – Second Quarter Example of data: DISPID(function)GUID(object) # of Args Arg Type Arg Value f55f- 98b5- 11cf- bb82- 00aa00b dce0b 1BSTR130
Completed Tasks – Second Quarter Phase III The amount of data we have gotten is too large to use in a database. The pure text file is 4GB (~50 million function calls), and querying such a database is too slow on the computer we have. The amount of data we have gotten is too large to use in a database. The pure text file is 4GB (~50 million function calls), and querying such a database is too slow on the computer we have. Instead, we are storing the data as a text file, and doing operations on it with Python scripts. Instead, we are storing the data as a text file, and doing operations on it with Python scripts.
Results and Findings – Second Quarter Phase IV We have analyzed data from our first two result sets We have analyzed data from our first two result sets Crawl with 5 initial seedsCrawl with 5 initial seeds 3,476,348 function calls 3,476,348 function calls 109 distinct GUIDs, 7364 GUID-DispID pairs 109 distinct GUIDs, 7364 GUID-DispID pairs Crawl with 15 initial seedsCrawl with 15 initial seeds 3,706,454 function calls 3,706,454 function calls 95 distinct GUIDS, 5575 GUID-DispID pairs 95 distinct GUIDS, 5575 GUID-DispID pairs Looked at most common functions, most common int-argument functions, and distribution of the argument values for these functions Looked at most common functions, most common int-argument functions, and distribution of the argument values for these functions
Results and Findings – Second Quarter Function 1: Function 1: GUID: 3050f55d-98b5-11cf-bb82-00aa00bdce0bGUID: 3050f55d-98b5-11cf-bb82-00aa00bdce0b GUID object name: DispHTMLWindow2GUID object name: DispHTMLWindow2 DispID: 1103DispID: 1103 Most popular int-argument function in both result sets Most popular int-argument function in both result sets Mostly random distribution, but signs of regularity Mostly random distribution, but signs of regularity Results from two sets show significant differences Results from two sets show significant differences
Results and Findings – Second Quarter
Function 2: Function 2: GUID: 3050f55f-98b5-11cf-bb82-00aa00bdce0bGUID: 3050f55f-98b5-11cf-bb82-00aa00bdce0b GUID object name: DispHTMLDocumentGUID object name: DispHTMLDocument DispID: 1013DispID: 1013 Second most popular int-argument function in both result sets Second most popular int-argument function in both result sets Shows a regular distribution with distinct characteristics Shows a regular distribution with distinct characteristics Results from two sets show significant differences Results from two sets show significant differences
Results and Findings – Second Quarter
Function 3: Function 3: GUID: 3050f51b-98b5-11cf-bb82-00aa00bdce0bGUID: 3050f51b-98b5-11cf-bb82-00aa00bdce0b GUID object name: DispHTMLIFrameGUID object name: DispHTMLIFrame Dispid: Dispid: Third most popular int-argument function 1 st result set, 95 th most popular in 2 nd result set Third most popular int-argument function 1 st result set, 95 th most popular in 2 nd result set Shows a random distribution with distinct characteristics Shows a random distribution with distinct characteristics Results are dramatically different between data sets Results are dramatically different between data sets All arguments in the 2 nd result set are 0All arguments in the 2 nd result set are 0
Results and Findings – Second Quarter
Found significant differences between the data sets in both the frequencies of specific functions, and the arguments of specific functions Found significant differences between the data sets in both the frequencies of specific functions, and the arguments of specific functions Suspect that differences result from biases due to small amount of original seeds (5 and 15) Suspect that differences result from biases due to small amount of original seeds (5 and 15) Ran a much broader crawl (200 seeds) in hopes of getting more general, unbiased results Ran a much broader crawl (200 seeds) in hopes of getting more general, unbiased results Just from partial results of this crawl (roughly 8000 websites), we have so far found:Just from partial results of this crawl (roughly 8000 websites), we have so far found: A much larger average of calls to our listener per website A much larger average of calls to our listener per website A large percentage of function calls that take 0 arguments A large percentage of function calls that take 0 arguments Will post complete results once crawl is finishedWill post complete results once crawl is finished
Direction for Next Quarter Further analyze the gathered data for patterns Further analyze the gathered data for patterns Compare trends in “normal” data to what occurs in malicious scripts Compare trends in “normal” data to what occurs in malicious scripts