Design open relay based DNS blacklist system
open relay based DNS blacklist system Problem Current DNSBL offers only binary blacklist Either listed or not-listed Unable to handle grey IPs (IPs sent both spam and normal emails) Not real-time (delay in blacklist IPs) Rely on people’s report and other ad hoc methods Goal Improvement of efficiency based on open relay data Offer more flexible blacklist method allowing classification of previous un-seen IPs Systematic way to take advantage of IP locality, content similarity and other features of open relay data Backward compatibility with existing systems
Advantage of using open relay data All the emails received at open relay are guaranteed to be spam Avoid inaccuracy in learning and classification Traditional learning methods used in spam detection (e.g., spam assassin can suffer high false positives) Emails similar to those received at open relay are likely to be spam Open relay can observe higher IP locality and content similarity within each spam campaign Spammers use open relay to target different domain within same spam campaign Multiple bots involved in the same spam campaign may also use the same open relay
Design of the system Current DNSBL system Backward compatibility Take the client's IP address reverse the bytes and append the DNSBL's domain name: 23.42.168.192.dnsbl.example.net. Look up this name in the DNS as a domain name ("A" record). This will return either an address, indicating that the client is listed; or an "NXDOMAIN" ("No such domain") code, indicating that the client is not. Optionally, if the client is listed, look up the name as a text record ("TXT" record). Most DNSBLs publish information about why a client is listed as TXT records. Backward compatibility Make use of TXT record to return our “signatures”
Return Signatures for DNSBL query Two cases Query IP is listed in the blacklist Query IP is not listed in the blacklist Two type of signatures IP signature Set of IPs known to be bad Use Bloom filter for efficiency, i.e. eliminate the need for repeated queries (false positive?) Content signatures URL and/or mail content For matching emails within the same spam campaign which often share great similarity in the content
details If the query IP is listed in the blacklist Return A record indicate the IP is blacklisted A record could use a special IP address telling the client that more information can be retrieved from txt record Return the TXT record including Bloom filter for all IPs observed in open relay (IPs in the same campaign as the query IP or all IPs that connected to the open relay?) False positive will increase if bloom filter contains more elements TTL of the bloom filter (some IP may be removed from the list, so we need update or discard the bloom filter if it is too out-of-date) Content signatures for the spam emails of the same spam campaign as the query IP
details If the query IP is NOT listed in the blacklist Don’t return A record for compatibility Allow client to query the TXT record which returns: Bloom filter for all IPs observed in open relay TTL of the bloom filter Content signatures For the most popular spam campaigns For the spam campaign observed from IPs that close to the query IP (IP locality) The score for the IP (as proposed in pathak’s email)
Content Signatures Goal URL ? Mail body Allow mail server to stop spams that from the same campaign which share similarity in the content URL ? How to handle URL redirection Obfuscation Mail body Contain noise, obfuscation. Latent semantic analysis for noise reduction A signature is in form of a document vector in concept space Similarity is computed by calculating cosine between document vectors in the concept space
Latent Semantic Analysis A method to summarize the semantics of a corpus of text conceptually, allowing mapping documents into the concept space so that documents can be correlated based on the conceptual meaning (better accuracy than literal matching) Applications: Compare documents in the concept space Document classification Find similar documents across languages Find relations between terms Given a query of terms, translate it into the concept space, and find matching documents Capable of simulating a variety of human cognitive phenomena Having been widely used in search engine for finding documents with similar concepts (Latent Semantic Indexing)
Latent Semantic Analysis Term-document matrix m terms and n documents -> m*n Sparse matrix weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency) SA transforms the occurrence matrix into concept space by SVD (singular value decomposition) X = UΣVT
LSA cont. U and V are orthonormal matrices Σ is a diagonal matrix of singular values are called the singular values Then we choose k largest singular values and their corresponding singular vectors Get the rank k approximation to X with the smallest error Translates the term and document vectors into a concept space.
Rank lowering (dimension reduction) Find a low-rank approximation to the term-document matrix Reduce computing cost Reduce noise The original matrix could be noisy The approximated matrix is interpreted as a de-noisified matrix Merge dimensions associated with terms that have similar meanings
Document comparison Problem: given a new document, compare it to your documents in the concept space and find the similarity 1. translate the new document into concept space use the same transformation the vector gives the relation between document j and each concept. 2. Compute the cosine similarity di*dj/(|di||dj|)
Use of LSA in our system Each spam campaign Similarity in the content (but with small variations i.e. add noise to avoid content based filtering) Use LSA to do noise reduction Each campaign corresponds to one document vector in the concept space (vsc ) When mail server receive a new email, it translate the email content into another vector (vemail ) in the concept space Compute similarity between two vectors (cosine) If they are similar enough, the email is likely to come from the same campaign . Since the machine learning already used for spam filtering, the transformation matrix can be pre-computed, the overhead is acceptable
Put it together The open relay server construct the document matrix based on all spam emails it received Apply SVD to get U, Σ, V and select an appropriate k Separate different campaigns and find the document vectors of spam emails within in each campaign in the concept space (they should be quite similar to each other but with small variation) Generate a vector (vsc)for this specific campaign (simple way is to take average of all vectors) The signature for the spam content of this campaign consists of: vsc, , and term vectors
Put it together Generate one signature for each campaign The signatures are put in the TXT record for the client to query When receive a new email, Client compute the similarity based on the above mentioned algorithm and decide whether to mark it as spam Advantage of this approach Flexible DNSBL account for those grey IPs Centralized server allow information aggregation, hence the learning performs better than traditional method where each host learns by itself Data received at open relay exhibit more IP and content locality which facilitate clustering All data received by open relay is spam bypassing the need for learning and classification
Problems SVD need to be re-computed when new spams comes Incremental computation of SVD? How to update the parameters to client How to avoid sending repeated signatures to the client? Let the client indicate if it runs the new system as well as the version number of current signatures? One open relay is only onevantage point Probably need a distributed system where multiple open relays cooperate
Problem (cont.) The fact that those IPs that use open relay to send spam means that they are likely to use other open relay instead of directly connecting to the mail servers. In this case, blacklist these IPs might not be very useful. Combine with traditional methods? Chinese language doesn’t have clear separation between words