Design open relay based DNS blacklist system

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Evaluating Search Engine
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
Pro Exchange SPAM Filter An Exchange 2000 based spam filtering solution.
Clustering Unsupervised learning Generating “classes”
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Data Structures & Algorithms and The Internet: A different way of thinking.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Chapter 6: Information Retrieval and Web Search
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Presented By Amarjit Datta
Natural Language Processing Topics in Information Retrieval August, 2002.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
From Frequency to Meaning: Vector Space Models of Semantics
Data Science Dimensionality Reduction WFH: Section 7.3 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall.
Ahoy: A Proximity-Based Discovery Protocol Robbert Haarman.
Singular Value Decomposition and its applications
Tiny http client and server
Information Retrieval: Models and Methods
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Updating SF-Tree Speaker: Ho Wai Shing.
Information Retrieval: Models and Methods
Data Mining K-means Algorithm
Martin Rajman, Martin Vesely
Internet Networking recitation #12
Net 323 D: Networks Protocols
Latent Semantic Indexing
Representation of documents and queries
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Chapter 5: Information Retrieval and Web Search
CS 430: Information Discovery
Recuperação de Informação B
Information Retrieval and Web Design
Term Frequency–Inverse Document Frequency
Latent Semantic Analysis
Presentation transcript:

Design open relay based DNS blacklist system

open relay based DNS blacklist system Problem Current DNSBL offers only binary blacklist Either listed or not-listed Unable to handle grey IPs (IPs sent both spam and normal emails) Not real-time (delay in blacklist IPs) Rely on people’s report and other ad hoc methods Goal Improvement of efficiency based on open relay data Offer more flexible blacklist method allowing classification of previous un-seen IPs Systematic way to take advantage of IP locality, content similarity and other features of open relay data Backward compatibility with existing systems

Advantage of using open relay data All the emails received at open relay are guaranteed to be spam Avoid inaccuracy in learning and classification Traditional learning methods used in spam detection (e.g., spam assassin can suffer high false positives) Emails similar to those received at open relay are likely to be spam Open relay can observe higher IP locality and content similarity within each spam campaign Spammers use open relay to target different domain within same spam campaign Multiple bots involved in the same spam campaign may also use the same open relay

Design of the system Current DNSBL system Backward compatibility Take the client's IP address reverse the bytes and append the DNSBL's domain name: 23.42.168.192.dnsbl.example.net. Look up this name in the DNS as a domain name ("A" record). This will return either an address, indicating that the client is listed; or an "NXDOMAIN" ("No such domain") code, indicating that the client is not. Optionally, if the client is listed, look up the name as a text record ("TXT" record). Most DNSBLs publish information about why a client is listed as TXT records. Backward compatibility Make use of TXT record to return our “signatures”

Return Signatures for DNSBL query Two cases Query IP is listed in the blacklist Query IP is not listed in the blacklist Two type of signatures IP signature Set of IPs known to be bad Use Bloom filter for efficiency, i.e. eliminate the need for repeated queries (false positive?) Content signatures URL and/or mail content For matching emails within the same spam campaign which often share great similarity in the content

details If the query IP is listed in the blacklist Return A record indicate the IP is blacklisted A record could use a special IP address telling the client that more information can be retrieved from txt record Return the TXT record including Bloom filter for all IPs observed in open relay (IPs in the same campaign as the query IP or all IPs that connected to the open relay?) False positive will increase if bloom filter contains more elements TTL of the bloom filter (some IP may be removed from the list, so we need update or discard the bloom filter if it is too out-of-date) Content signatures for the spam emails of the same spam campaign as the query IP

details If the query IP is NOT listed in the blacklist Don’t return A record for compatibility Allow client to query the TXT record which returns: Bloom filter for all IPs observed in open relay TTL of the bloom filter Content signatures For the most popular spam campaigns For the spam campaign observed from IPs that close to the query IP (IP locality) The score for the IP (as proposed in pathak’s email)

Content Signatures Goal URL ? Mail body Allow mail server to stop spams that from the same campaign which share similarity in the content URL ? How to handle URL redirection Obfuscation Mail body Contain noise, obfuscation. Latent semantic analysis for noise reduction A signature is in form of a document vector in concept space Similarity is computed by calculating cosine between document vectors in the concept space

Latent Semantic Analysis A method to summarize the semantics of a corpus of text conceptually, allowing mapping documents into the concept space so that documents can be correlated based on the conceptual meaning (better accuracy than literal matching) Applications: Compare documents in the concept space Document classification Find similar documents across languages Find relations between terms Given a query of terms, translate it into the concept space, and find matching documents Capable of simulating a variety of human cognitive phenomena Having been widely used in search engine for finding documents with similar concepts (Latent Semantic Indexing)

Latent Semantic Analysis Term-document matrix m terms and n documents -> m*n Sparse matrix weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency) SA transforms the occurrence matrix into concept space by SVD (singular value decomposition) X = UΣVT

LSA cont. U and V are orthonormal matrices Σ is a diagonal matrix of singular values are called the singular values Then we choose k largest singular values and their corresponding singular vectors Get the rank k approximation to X with the smallest error Translates the term and document vectors into a concept space.

Rank lowering (dimension reduction) Find a low-rank approximation to the term-document matrix Reduce computing cost Reduce noise The original matrix could be noisy The approximated matrix is interpreted as a de-noisified matrix Merge dimensions associated with terms that have similar meanings

Document comparison Problem: given a new document, compare it to your documents in the concept space and find the similarity 1. translate the new document into concept space use the same transformation the vector gives the relation between document j and each concept. 2. Compute the cosine similarity di*dj/(|di||dj|)

Use of LSA in our system Each spam campaign Similarity in the content (but with small variations i.e. add noise to avoid content based filtering) Use LSA to do noise reduction Each campaign corresponds to one document vector in the concept space (vsc ) When mail server receive a new email, it translate the email content into another vector (vemail ) in the concept space Compute similarity between two vectors (cosine) If they are similar enough, the email is likely to come from the same campaign . Since the machine learning already used for spam filtering, the transformation matrix can be pre-computed, the overhead is acceptable

Put it together The open relay server construct the document matrix based on all spam emails it received Apply SVD to get U, Σ, V and select an appropriate k Separate different campaigns and find the document vectors of spam emails within in each campaign in the concept space (they should be quite similar to each other but with small variation) Generate a vector (vsc)for this specific campaign (simple way is to take average of all vectors) The signature for the spam content of this campaign consists of: vsc, , and term vectors

Put it together Generate one signature for each campaign The signatures are put in the TXT record for the client to query When receive a new email, Client compute the similarity based on the above mentioned algorithm and decide whether to mark it as spam Advantage of this approach Flexible DNSBL account for those grey IPs Centralized server allow information aggregation, hence the learning performs better than traditional method where each host learns by itself Data received at open relay exhibit more IP and content locality which facilitate clustering All data received by open relay is spam bypassing the need for learning and classification

Problems SVD need to be re-computed when new spams comes Incremental computation of SVD? How to update the parameters to client How to avoid sending repeated signatures to the client? Let the client indicate if it runs the new system as well as the version number of current signatures? One open relay is only onevantage point Probably need a distributed system where multiple open relays cooperate

Problem (cont.) The fact that those IPs that use open relay to send spam means that they are likely to use other open relay instead of directly connecting to the mail servers. In this case, blacklist these IPs might not be very useful. Combine with traditional methods? Chinese language doesn’t have clear separation between words