Internal Presentation by : Lei Wang Pervasive and Artificial Intelligenge research group On: An Artificial Immune System for .

Slides:



Advertisements
Similar presentations
V-Detector: A Negative Selection Algorithm Zhou Ji, advised by Prof. Dasgupta Computer Science Research Day The University of Memphis March 25, 2005.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Imbalanced data David Kauchak CS 451 – Fall 2013.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Artificial Immune Systems Andrew Watkins. Why the Immune System? Recognition –Anomaly detection –Noise tolerance Robustness Feature extraction Diversity.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Search Engines and Information Retrieval
Information Retrieval in Practice
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Presented by Zeehasham Rasheed
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Chapter 5: Information Retrieval and Web Search
By : Anas Assiri.  Introduction  fraud detection  Immune system  Artificial immune system (AIS)  AISFD  Clonal selection.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Drones Collecting Cell Phone Data in LA AdNear had already been using methods.
Search Engines and Information Retrieval Chapter 1.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
CSC 480 Software Engineering Lecture 19 Nov 11, 2002.
Printing: This poster is 48” wide by 36” high. It’s designed to be printed on a large-format printer. Customizing the Content: The placeholders in this.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Rogério de LemosDEFINE – Pisa, November 2002 – 1 Proactive Computing: Artificial Immune Systems Rogério de Lemos University of Kent at Canterbury  Brian.
Automatic Test-Data Generation: An Immunological Approach Kostas Liaskos Marc Roper {Konstantinos.Liaskos, TAIC PART 2007.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
CS370 Spring 2007 CS 370 Database Systems Lecture 1 Overview of Database Systems.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Spam Detection Ethan Grefe December 13, 2013.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Classification using Co-Training
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
A New Generation of Artificial Neural Networks.  Support Vector Machines (SVM) appeared in the early nineties in the COLT92 ACM Conference.  SVM have.
Presentation By SANJOG BHATTA Student ID : July 1’ 2009.
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Surface Defect Inspection: an Artificial Immune Approach Dr. Hong Zheng and Dr. Saeid Nahavandi School of Engineering and Technology.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc Ohara, Kami-fukuoka, Saitama , Japan
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Introduction to Machine Learning, its potential usage in network area,
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Application Software Chapter 6.
Web Development Web Servers.
Source: Procedia Computer Science(2015)70:
Data Mining: Concepts and Techniques Course Outline
Prepared by: Mahmoud Rafeek Al-Farra
Design open relay based DNS blacklist system
Text Categorization Assigning documents to a fixed set of categories
Overview of Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
Chapter 5: Information Retrieval and Web Search
Identifying Slow HTTP DoS/DDoS Attacks against Web Servers DEPARTMENT ANDDepartment of Computer Science & Information SPECIALIZATIONTechnology, University.
Text Mining Application Programming Chapter 9 Text Categorization
Presentation transcript:

Internal Presentation by : Lei Wang Pervasive and Artificial Intelligenge research group On: An Artificial Immune System for Classification Andy Secker, Alex Freitas, Jon Timmis Computing Laboratory, University of Kent Canterbury, Kent, UK 19/02/2004

An Artificial Immune System for Classification Andy Secker, Alex Freitas, Jon Timmis Computing Laboratory, University of Kent Canterbury, Kent, UK 19/02/2004

Significance  With the increase in information on the Internet, the strive to find more effective tools for distinguishing between interesting and non-interesting material is increasing.  This paper provides an immune-inspired algorithm called AISEC that is capable of continuously classifying electronic mail as interesting and non-interesting without the need for re-training.  Comparing with a naïve Bayesian classifier, the system proposed in this paper performs as well as the naïve Bayesian system and has a great potential for augmentation.

19/02/2004 AISEC, immunity-inspired system  Immune system Human body constantly under attack. Immune system must adapt and respond The (natural) immune system is: 1.Dynamic 2.Adaptive 3.Robust 4.Etc.  Artificial Immune Systems (AIS) use principles and process from observed and theoretical immunology to solve problems

19/02/2004 Artificial Immune Systems  Engineering framework Representation of individual immune cells Affinity measures Evaluate interaction of individuals with environment and/or each other Algorithms Procedures of adaptation manipulate populations of immune cells  AIS as a classifier AIRS A successful supervised AIS algorithm for classification

19/02/2004 AIS for Web Mining  Web mining, an umbrella term used to describe three quite different types of data mining: Content mining A process of extracting useful information from the text, images and other forms of content that make up the pages The mining of textual data is a common task, often for the purposes of information retrieval Usage mining Structure mining  AISEC research goal To develop a highly adaptive system capable of retrieving interesting information from the internet based on user’s current interests The authors believe AIS may offer a number of advantages

19/02/2004 What is AISEC ?  AISEC isn’t a spam filter It has no methods to penalize false positives (loss of important ) Without a very low false positive rate, a spam filter would not be trusted

19/02/2004 What is AISEC ?  AISEC is A first step towards an AIS for web mining. A study of performance and characteristics of an AIS applied to text mining in a dynamic domain A text classification algorithm capable of continuous adaptation, which may yield a classification accuracy comparable to a Bayesian approach. User behaviour and interaction with can be similar to web pages Supervised classification algorithm classified as interesting and uninteresting Uses constant(ish) feedback from user Capable of continuous adaptation This tracks concept drift and can also handle concept shift A specialised AIS algorithm based in part on the immune principle of clonal selection No previously documented algorithm was suited for use in this situation without extensive changes

19/02/2004 Representation  Each cell contains 3 sets of words (+ state) Punctuation is removed from fields Research literature has suggested header information is enough to accurately classify * A = [,, ] Subject field Title of the Sender field Sender’s name Return field (Sender’s address) * Diao, Lu & Wu (2000). A Comparative Study of Classification Based Personal Filtering, PAKDD 2000

19/02/2004 Affinity  Affinity value is proportion of words in one cell found in another More features would require a less naïve distance measure Cosine distance is an obvious choice Resultant value always between 0 and 1 A = [,, ] B = [,, ] affinity(A,B) = 4/9 PROCEDURE affinity (bc1, bc2) IF(bc1 has a shorter feature vector than bc2) bshort ← bc1, blong ← bc2 ELSE bshort ← bc2, blong ← bc1 count ← the number of words in bshort present in blong bs_len ← the length of bshort’s feature vector RETURN count/bs_len

19/02/2004 Clone-Mutation  One mutation takes a word previously used in subject or address and replaces single location Subject, sender and return address libraries are kept separately Usually >1 mutation per cell takes place Subjectlib= free,DVD SenderLib = sales,DVD,com ReturnLib = sales,DVD,com A = [,, ] PROCEDURE clone_mutate(bc1,bc2) aff ← affinity(bc1,bc2) clones ← ∅ num_clones ← | aff * Kl | num_mutate ← | (1-aff) * bc’s feature vector length * Km | DO(num_clones)TIMES bcx ← a copy of bc1 DO(num_mutate)TIMES p ← a random point in bcx’s feature vector w ← a random word from the appropriate gene library replace word in bcx’s feature vector at location p with w bcx’s stimulation level ← Ksb clones ← clones ∪ {bcx} RETURN clones

19/02/2004 The algorithm - classification 1.System is initialised with known uninteresting e- mail 2. presented for classification. Classified as uninteresting as it stimulates close cells Memory cells Naive cells

19/02/2004 The algorithm – correct classification 3.Highly stimulated cell reproduces 7 times. Less stimulated cell produces only 2 clones but with higher mutation rate 4.Cell with highest affinity is known to be useful therefore rewarded by becoming memory cell. Classification Region Stimulation Region

19/02/2004 The algorithm cont… Cell removal Aged naïve cells deleted. Memory cells placed in already covered areas also deleted. Incorrect classification 5.Any cell responsible for incorrect classification is removed (memory or otherwise)

19/02/2004 Results – Classification accuracy  s (742 uninteresting) received over 6 months  s presented in the order of date received  Feedback given after EVERY classification  AISEC run 10 times, results show mean  C5.0, neural network and C&R tree all run in “Clementine” data mining package  Bayesian algorithm used feedback to update like AISEC C % Naïve Bayesian85.0% Neural Network85.6% AISEC 86.0%  1.29 C&R tree87.7% Naïve Bayesian88.05% AISEC 89.09%  0.97 Traditional LearningContinuous Learning

19/02/2004 Results – variation of population size

19/02/2004 User point of view  AISEC runs as a proxy on local machine  Advantages No need to switch client Can collect mail from multiple locations  AISEC’s user interface would require minimal interaction

19/02/2004 User point of view Local machine Server(s) Collect mail AISEC Classifier Collect mail Store Interesting Uninteresting User interaction Positive user response Negative user response Return mail Client

19/02/2004 Results cont…  Standard measures of quality Precision is the proportion of positive documents retrieved compared with the total number of positive documents Recall is the proportion of positive documents actually classified as positive PrecisionRecall Naïve Bayesian93.93%67.76% AISEC 82.20%  %  4.71

19/02/2004 Results – variation of time between user feedback

19/02/2004 Conclusion  AISEC has produced promising results and appears robust Interesting note: Typical accuracy similar to published results from other AIS for text classification (both traditional and continuous learning) Use a larger training set and optimise (the many) parameters Detect when there are the optimum number of cells  AISEC has been useful providing some evidence AIS applied to this domain would be possible  Research on adaptive systems for retrieval of interesting information, not necessarily purely accurate information

19/02/2004 Questions & Discussions An Artificial Immune System for Classification Andy Secker, Alex Freitas, Jon Timmis Computing Laboratory, University of Kent More information: