Download presentation
Presentation is loading. Please wait.
1
Design open relay based DNS blacklist system
2
open relay based DNS blacklist system
Problem Current DNSBL offers only binary blacklist Either listed or not-listed Unable to handle grey IPs (IPs sent both spam and normal s) Not real-time (delay in blacklist IPs) Rely on people’s report and other ad hoc methods Goal Improvement of efficiency based on open relay data Offer more flexible blacklist method allowing classification of previous un-seen IPs Systematic way to take advantage of IP locality, content similarity and other features of open relay data Backward compatibility with existing systems
3
Advantage of using open relay data
All the s received at open relay are guaranteed to be spam Avoid inaccuracy in learning and classification Traditional learning methods used in spam detection (e.g., spam assassin can suffer high false positives) s similar to those received at open relay are likely to be spam Open relay can observe higher IP locality and content similarity within each spam campaign Spammers use open relay to target different domain within same spam campaign Multiple bots involved in the same spam campaign may also use the same open relay
4
Design of the system Current DNSBL system Backward compatibility
Take the client's IP address reverse the bytes and append the DNSBL's domain name: dnsbl.example.net. Look up this name in the DNS as a domain name ("A" record). This will return either an address, indicating that the client is listed; or an "NXDOMAIN" ("No such domain") code, indicating that the client is not. Optionally, if the client is listed, look up the name as a text record ("TXT" record). Most DNSBLs publish information about why a client is listed as TXT records. Backward compatibility Make use of TXT record to return our “signatures”
5
Return Signatures for DNSBL query
Two cases Query IP is listed in the blacklist Query IP is not listed in the blacklist Two type of signatures IP signature Set of IPs known to be bad Use Bloom filter for efficiency, i.e. eliminate the need for repeated queries (false positive?) Content signatures URL and/or mail content For matching s within the same spam campaign which often share great similarity in the content
6
details If the query IP is listed in the blacklist
Return A record indicate the IP is blacklisted A record could use a special IP address telling the client that more information can be retrieved from txt record Return the TXT record including Bloom filter for all IPs observed in open relay (IPs in the same campaign as the query IP or all IPs that connected to the open relay?) False positive will increase if bloom filter contains more elements TTL of the bloom filter (some IP may be removed from the list, so we need update or discard the bloom filter if it is too out-of-date) Content signatures for the spam s of the same spam campaign as the query IP
7
details If the query IP is NOT listed in the blacklist
Don’t return A record for compatibility Allow client to query the TXT record which returns: Bloom filter for all IPs observed in open relay TTL of the bloom filter Content signatures For the most popular spam campaigns For the spam campaign observed from IPs that close to the query IP (IP locality) The score for the IP (as proposed in pathak’s )
8
Content Signatures Goal URL ? Mail body
Allow mail server to stop spams that from the same campaign which share similarity in the content URL ? How to handle URL redirection Obfuscation Mail body Contain noise, obfuscation. Latent semantic analysis for noise reduction A signature is in form of a document vector in concept space Similarity is computed by calculating cosine between document vectors in the concept space
9
Latent Semantic Analysis
A method to summarize the semantics of a corpus of text conceptually, allowing mapping documents into the concept space so that documents can be correlated based on the conceptual meaning (better accuracy than literal matching) Applications: Compare documents in the concept space Document classification Find similar documents across languages Find relations between terms Given a query of terms, translate it into the concept space, and find matching documents Capable of simulating a variety of human cognitive phenomena Having been widely used in search engine for finding documents with similar concepts (Latent Semantic Indexing)
10
Latent Semantic Analysis
Term-document matrix m terms and n documents -> m*n Sparse matrix weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency) SA transforms the occurrence matrix into concept space by SVD (singular value decomposition) X = UΣVT
11
LSA cont. U and V are orthonormal matrices
Σ is a diagonal matrix of singular values are called the singular values Then we choose k largest singular values and their corresponding singular vectors Get the rank k approximation to X with the smallest error Translates the term and document vectors into a concept space.
12
Rank lowering (dimension reduction)
Find a low-rank approximation to the term-document matrix Reduce computing cost Reduce noise The original matrix could be noisy The approximated matrix is interpreted as a de-noisified matrix Merge dimensions associated with terms that have similar meanings
13
Document comparison Problem: given a new document, compare it to your documents in the concept space and find the similarity 1. translate the new document into concept space use the same transformation the vector gives the relation between document j and each concept. 2. Compute the cosine similarity di*dj/(|di||dj|)
14
Use of LSA in our system Each spam campaign
Similarity in the content (but with small variations i.e. add noise to avoid content based filtering) Use LSA to do noise reduction Each campaign corresponds to one document vector in the concept space (vsc ) When mail server receive a new , it translate the content into another vector (v ) in the concept space Compute similarity between two vectors (cosine) If they are similar enough, the is likely to come from the same campaign . Since the machine learning already used for spam filtering, the transformation matrix can be pre-computed, the overhead is acceptable
15
Put it together The open relay server construct the document matrix based on all spam s it received Apply SVD to get U, Σ, V and select an appropriate k Separate different campaigns and find the document vectors of spam s within in each campaign in the concept space (they should be quite similar to each other but with small variation) Generate a vector (vsc)for this specific campaign (simple way is to take average of all vectors) The signature for the spam content of this campaign consists of: vsc, , and term vectors
16
Put it together Generate one signature for each campaign
The signatures are put in the TXT record for the client to query When receive a new , Client compute the similarity based on the above mentioned algorithm and decide whether to mark it as spam Advantage of this approach Flexible DNSBL account for those grey IPs Centralized server allow information aggregation, hence the learning performs better than traditional method where each host learns by itself Data received at open relay exhibit more IP and content locality which facilitate clustering All data received by open relay is spam bypassing the need for learning and classification
17
Problems SVD need to be re-computed when new spams comes
Incremental computation of SVD? How to update the parameters to client How to avoid sending repeated signatures to the client? Let the client indicate if it runs the new system as well as the version number of current signatures? One open relay is only onevantage point Probably need a distributed system where multiple open relays cooperate
18
Problem (cont.) The fact that those IPs that use open relay to send spam means that they are likely to use other open relay instead of directly connecting to the mail servers. In this case, blacklist these IPs might not be very useful. Combine with traditional methods? Chinese language doesn’t have clear separation between words
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.