A Lustrum of Malware Network Communication: Evolution and Insights

A Lustrum of Malware Network Communication: Evolution and Insights
CAMPBELL FOSKIN

Background Malware analysis critical to countering internet threats
Static and dynamic analysis systems used produce detailed reports Provides reputation information on IP and DNS infrastructure Little known about how infrastructure and methods has evolved over time. - Malware analysis is at the forefront of the fight against Internet threats. - Over the last decade, many systems have been proposed to statically and dynamically analyze malicious software and produce detailed behavioral reports -The vast amounts of data collected provide important reputation information about both IP and domain name system (DNS) infrastructure, which play an important role in the state-of-the-art detection engines used by the security industry. - focusing on topics like the role of cloud providers, the infrastructure behind drive-by downloads, or the domains used by few malware families. But dispite lots of data little is known about infrastructure and methods used by Internet miscreants

Overview of study Conducted five year, longitudinal study of dynamic analysis traces Analyzed more than 26.8 million unique malware execution samples + five billion DNS queries Largest ever malware classification effort Classify malware samples into families Differentiating PUPs Correlate domains with malware families. - collected from multiple (i.e., two commercial and one academic) malware feeds – provided a UNIQUE look at malware networks - First study comparing the network properties of PUP and malware domains

Data collection Malware executions <= 5mins
Passive DNS from large ISP provider Timestamped blacklists DGArchive – domains from reversed engineered malware Malware samples from 3 datasets Malware samples exclude those with no valid DNS resolution during their execution - no more than 5 min execution Virustotal - analyzes files and URLs submitted by users - scanned with multiple AV engines- API to query meta-data on malware samples using a sample’s hash DGArchive - dataset of domains from malware that use a DGA

Domain filtering Removed samples not flagged as malicious
Malware does not interact with exclusively malicious infrastructure Invalid Domains Benign Domains Reverse Delegation Zones: DNS Pointer Records Most DNS queried by malware are benign (95%) 1) Not all submitted things malicious 2) not all network activity is to malicious infrastructure remove those not fledged by AV vendors from virus total dataset - Invalid domains – non existent domains generated by DGAs (domain generation algorithms) - Benign Domains – the hardest to do, caused by use of legitimate domains like dropbox, testing internet connection, downloading, spam...... - remove domains in Alexa top 10,000 popular domains (except dynamic DNS domains), then manual sifted through most popular domains in dataset (mostly removing CDNs)- then remove spam bots filtered out any MX lookups, or domains with mail keywords (mail, imap etc) RDZ - when a program directly connects to an IP address without performing a DNS resolution of a service’s domain name Removed begin PTRs excluding zones used by large ISPs and hosting providers - Most DNS benign – has implications on blacklist approach based of dynamic analysis + takes a lots of manual work

Classification of domains
Sample classification - AVClass PUP/Malware family classification e2LD classification 2LD -> family that had the most samples resolve to it AVClass -open-source tool for massive malware labeling - successfully removes noise from AV labels by addressing label normalization, generic token detection, and alias detection WHAT IS PUP AVClass – differentiate between PUP and malware by examining keywords HOWEVER – classifcation is conversative e2LD classification - mapping from e2LD to the most likely family the e2LD belongs to create a mapping from e2LD to the number of samples of each family that have resolved that e2LD assign each e2LD to the family with most samples resolving it e2LDs with less than 10 samples resolving them are left unclassified.

MALWARE DOMAIN ANALYSIS
Malware use network communications to exfiltrate data, communicate with C&C servers, or download payloads. Malware often uses DNS to avoid IP blacklisting Studied domain queries to determine the temporal DNS properties of malware samples Dynamic Malware Analysis Passive DNS and Blacklists Analysis Dynamic malware analysis to find trends and identify DNS queries used by malware. Passive DNS and Blacklists Analysis to determine if DNS queries are malicous by comparing with other datasets

Dynamic Malware Analysis: Domain polymorphism
Most malware uses different domains over time to avoid DNS blacklisting Most domains are used only once by a single malware sample Relying on DNS queries from analyzed malware not a viable threat mitigation Domain Polymorphism: Most malware uses different domains over time to avoid DNS blacklisting most domains are used only once by a single malware sample -> blacklisting malware domains observed during dynamic analysis does little to prevent future communication from newly discovered malware samples

Dynamic Malware Analysis: Dynamic DNS
Allows nameservers to be automatically updated with frequently changing information. Many publicly available services provide this functionality Zone level blocking also blocks legitimate users Used by 32% of all malware samples with DNS queries dynamic DNS domains are commonly used across many malware samples and evasion is performed on the child label of the domain.

Dynamic Malware Analysis: CDNs
Provide increased performance and availability Malicious content hosted in a CDN can hide in plain site. serve content from multiple, geographically distributed, data centers provide increased performance and availability WIDELY USED ON INTERNET – reduce effect of outages, better performance etc Massive discrepancy between most and least quried – top ones are some of the most common and popular CDNs The large number of child labels combined with potentially benign usage allows malicious content hosted in a CDN to effectively hide in plain site.

Passive DNS and Blacklists Analysis
Correlated domains gathered from dynamic analysis with three datasets Passive DNS dataset from US ISP Public DNS based blacklists Set of domain expiration events Determined lag between when a domain is discovered, and when it is first resolved in passive DNS. Determine effectiveness of public blacklists

First appearance Aimed to determine effectiveness of public blacklists
Only 30% of the entries already in blacklists 20% were reported with a delay of over days Delays could be reduced by relying on malware analysis to populate domains blacklists Things added to blacklists by manual inspection or dedicated servers Only 30% of the entries already in blacklists BEFORE OUR DYNAMIC ANALYSIS THIS SUGGESTS… Delays could be reduced by relying on malware analysis LIKE THIS to populate domains blacklists But even then it would miss some, they would not show up for weeks – many were actives for weeks before being analyzed 95% - so still a lot of manual work

Domain lifetime All three types of domains frequently have long domain lifetimes Many samples remain active on the Internet for extended periods of time Lifetime: Difference between the first and last seen dates for each of these domains in passive DNS three hotspots that correspond to the most prevalent resolution behaviors for domains in malware and unclassified malicious software BOTTOM LEFT: large number of domains that are short lived and rarely resolved TOP RIGHT: long lived and frequently resolved BOTTOM RIGHT: long lifetime but infrequent resolution SO UNCLASSIFIED DOMAINS ARE LIKELY ALL MALWARE PUPs rising in prevalence over last 2-3 years – so its all stuck at the lower end of the lifetime axis Diagonal prevalence – intense and continuing resolution of PUP domains = Result of organizations failing to block PUP domains Since we showed most DNS resolutions occur only once, this means they remain active a LONG LONG time on the internet

INFRASTRUCTURE ANALYSIS
Analyzed the hosting infrastructure of domains used by malware Focus on IP ranges over time Samples with domains resolving on a subnet for different years Assigned spikes to families, using mapping of e2LDs Looked at the IP subnets resolved by samples over the 5 years. Used the mappings created during classification to map spikes to families of malware/PUP

PUP families used stable infrastructure over time – indicates that popular cloud hosters do not ban them the same as malware/more lenient Sinkholes – example Microsoft led initiative on domains using no-ip DNS Third group of spikes caused multiple, rather than a single, malware family – likely correspond to hosting providers that had a more open policy on acceptable behavior during that timeframe – so we see them move around different subnets each year

Domain generation algorithms
Reintroduce failed resolutions into the dataset Check against DGArchive 44% (3 M) of e2LDs in malware executions were generated by DGAs - Lower bound since DGArchive will miss some families – its

Criticisms and improvements
US-centric Long running/dormant malware not analyzed Static analysis Prototype a general-use tool that uses this technique Benign network communications – may have patterns of behavior -> ML classification US-centric, used US ISP data - although hard to avoid it would be good to see if there are differences or different trends different parts of the world Long running dormant might have different behavior Static analysis could help where DGA is not used – could help catch long running/dormant ones Benign comms could have pattern of behavior + what are they downloaded, how often, at what times, is it regular intervals etc ML classification could assist in analysis, and classification of families and different behaviours - train a model on known malware behavior - Example many unclassified samples show similar behavior to malware, as opposed to PUPs – this could be confirmed with ML to a % probability

Conclusions Collected, filtered and analyzed 26.8 million records from dynamic malware executions Made several observations about the behavior and temporal properties of malware domains used by these samples PUPs are becoming more common, and use stable infrastructure Malware detection based off network communications marginally effective several hundred thousands PUP samples use the same network infrastructure over an entire year. – and not treated the same as most malware Not detected until weeks, or even months after it becomes active. Most are benign domains LOTS of manual work

A Lustrum of Malware Network Communication: Evolution and Insights

Similar presentations

Presentation on theme: "A Lustrum of Malware Network Communication: Evolution and Insights"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Lustrum of Malware Network Communication: Evolution and Insights

Similar presentations

Presentation on theme: "A Lustrum of Malware Network Communication: Evolution and Insights"— Presentation transcript:

Similar presentations

About project

Feedback