Artificial Immune based Approach to Association Rule Mining By: B. Hoda Helmi Supervisor: Adel T. Rahmani January 2008 A Thesis Submitted in Partial Fulfillment of the Requirement for the Degree of Master of Science in Artificial Intelligence-Computer Engineering 1
Outline The Immune System Natural and Artificial Association Rules Web Usage Mining Proposed Algorithm AISWUM Results and Conclusion 2
Natural Immune System A system that protects the body from foreign substances and pathogenic organisms. Immune System The immune system creates antibodies which match the antigens and cause the pathogens to be destroyed Antibody Substances capable of starting a specific immune response are referred to as antigens (viruses, bacteria, fungi). Antigen 3
A High Level Overview 4
Natural Immune System Immunity Innate Danger Theory Adaptive Clonal Selection Network Theory Affinity Maturation Hyper mutation 5
Innate versus Adaptive IS Innate immediately available for combat Adaptive antibody production specific to a determined infectious agent 6
Adaptive Immunity epitope Low affinity receptor structurally similar – high affinity 7
Clonal Selection & Affinity Maturation 8
Network Theory Ag Stimulation (Positive Response) Suppression (Negative Response) Idiotypic network (Jerne, 1974): B cells stimulate each other. Creates an immunological memory 9
Danger Theory 10
Artificial Immune System Algorithms Affinity Representation AIS A Framework for AIS 11
Association Rules Set of items: I={I 1,I 2,…,I m } Transactions: D={t 1,t 2, …, t n }, t j I Itemset: {I i1,I i2, …, I ik } I Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold. Support of an itemset: Percentage of transactions which contain that itemset. 12
Given: a set of items I={I 1,I 2,…,I m }, a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij I, The Association Rule Problem is to identify all association rules X Y with a minimum support and confidence. 13
Association Rule Mining Steps Find Frequent Itemsets. Generate rules from frequent itemsets. Challenging Step In Association Rule Mining 14
Goal In this project our goal is to find all the in using frequent itemsets Web usage data artificial immune system 15
Web Usage Mining Web usage mining also known as Web log mining Mining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the web. 16
Web Usage Mining Applications Target potential customers for electronic commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching) 17
Motivations (of choosing this application) 18
WUM-Definitions Set of all accessed to URLs of a Web site that is stored in Web server Web Logs A sequence of URLs that are accessed by a user in one visit of Web site. (Itemset) Session crowded paths that frequently are traversed by users. (Frequent Itemsets) Strong trend 19
Web Log O: || T:1997/09/12-22:43:00 ||U:/ || R: O: || T:1997/09/12-22:50:27 || U:/categories/software/ || R: O: || T:1997/09/12-22:50:38 || U:`/categories/software/Windows/ || R: O: || T:1997/09/12-22:50:47 || U:/categories/software/Windows/V909V03.TXT || R: ows/ O: || T:1997/09/12-22:51:06 || U:/categories/software/Windows/ || R: 20
Session Construction URLS ID X /0 /categories/soft ware/ 1 /categories/soft ware/Windows/ 2 /categories/soft ware/Windows/ V909V03.TXT 3 /categories 4 /manufacturers 5 /samples.html/ 6 /gearlists/ 7 /features/ 8 /ecards/ : 27 00: 11 02: 10 00: 19 02: 01 00: Duration Frequency 21
Representation Antibody: (strong trends) Antigen: (incoming sessions) URL1 (0/1) URL2 (0/1) URLm (0/1) URL1 (0/1) URL2 (0/1) URLm (0/1) 22
Scenario Antigen enters the body Determine if the first signal is produced? (2 signals are needed for an antigen to trigger AIS, first signal is produced if antigen is harmful to body) If first signal is produced, present antigen to antibodies and compute distance, weight and influence zone. Determine antibody with maximum weight. If maximum weight > threshold compute SL and IZ for antibody else create by duplication a new antibody. Clone and Mutate. 23
Danger Signal Danger Theory (two signal approach) If antigen is harmful trigger an IS response else discard the antigen. In data mining context : harmful interesting (valid) What is Danger signal in our system? ◦We should find a measure to determine the validity of sessions. 24
Validity Measure 25
Validity Measure 26
Affinity Measure What affinity measure is used in our proposed algorithm? 27
Affinity Measure Weight function decreases with distance from the antigen/data location. is a scale parameter that controls the decay rate of the weights along the spatial dimensions 28
Stimulation Level 29
Weighted Stimulation 30
Network Stimulation & Suppression 31
Cloning Antibodies are cloned in proportion to their stimulation levels relative to the average network stimulation. To avoid preliminary proliferation of antibodies and to encourage a diverse repertoire new antibodies do not clone before they are mature (their age exceeds a threshold) 32
Hypermutation Somatic hyper mutation is a powerful natural exploration mechanism in IS, that allows it to learn how to respond to new antigens that have never been seen before. very costly and inefficient operation since its complexity is exponential in the number of features. we model this operation in AIS by an instant antigen duplication whenever an antigen is encountered that fails to activate the entire immune network. 33
Directed Mutation Antibodies which are added to population via mutation are always superior individuals. In this mutation mechanism whenever the system realize there are not enough good antibodies to confront with antigens, new antibodies add to population. It is a new from of DANGER THEORY. Directed mutation mechanism is as follow: 34
Directed Mutation Web log In to the system 35
Directed Mutation
Directed Mutation
Directed Mutation
Directed Mutation
Directed Mutation
Directed Mutation
Directed Mutation
Decide to Mutate After some times
Mutation Occur After some times
Directed Mutation Directed mutation is not computationaly complex. It doesn't cause antibodies to destroy before they have to leave population. It make system intelligent -> system can decide when to create new individuals. After each T antigens enter the system, directed mutation happens. 45
Compression Compression: cluster antibody population into k clusters. external interactions: those occurring between an antigen (external agent) and the antibody in the immune network. internal interactions: those occurring between one antibody and all other antibodies in the immune network. The most expensive computation and storage overhead stems from calculating and storing all the internal network interactions (quadratic complexity with respect to the network size). After compression: ◦internal interactions: ◦external interactions: k choosing an appropriate number of clusters 46
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization
Algorithm Visualization X
Algorithm Visualization X
Algorithm Visualization
Algorithm Visualization
Pseudocode 71
Data Data set 1 One week of HTTP requests to Music Machine Web site Requests Sessions URLs. Data set 2 One week of HTTP requests to the University of Saskatchewan’s WWW server Requests Sessions URLs. 72
Ground Profiles For evaluating learned profiles, it should be shown that the learned profiles are good representatives of the input data: Summarization ability of AISWUM In order to show this ability, a comparison between distribution of the learned profiles and input data should be done, so: we need some ground profiles Ground profiles are extracted using: Scalable K-Means 73
Evaluation Metrics 74
Results (Music Machine) Distribution of the learned antibodies that are simultaneously precise and complete per input category at time t. 75
Precision Distribution of precise antibodies per input category at time t. 76
Coverage Distribution of complete antibodies per input category at time t. 77
Results (Saskatchewan University) Distribution of the learned antibodies that are simultaneously precise and complete per input category at time t. 78
Precision Distribution of precise antibodies per input category at time t. 79
Coverage Distribution of complete antibodies per input category at time t. 80
Evaluation Metrics Overall level of learned antibodies precision with respect to input data Ratio of learned antibodies that accurately represent the past input data to all of learned antibodies 81
Evaluation Metrics Overall coverage of learned antibodies with respect to input data Ratio of past input data that are summarized accurately with antibodies to the all input data. 82
Results (Music Machines) Ratio of learned antibodies that accurately represent past input data to the all of learned antibodies. Ratio of past input data that are summarized accurately with antibodies to the all input data. 83
Results (Saskatchewan) 84 Ratio of learned antibodies that accurately represent past input data to the all of learned antibodies. Ratio of past input data that are summarized accurately with antibodies to the all input data.
Results Maximum Contentment Minimum Contentment Average Contentment of 50 users 41%15%28% State 1 60%40%51% State 2 67%45%56%State 3 Danger Theory Weighted Items Weighted Sessions State 1No State 2YesNo State 3Yes 85
Run Time The rune time with one scan of data with non-optimal C++ code on Pentium 4 PC tooks: ◦For the first dataset: less than 6 min. ◦For the second dataset: less than 3 min. 86
Comparison with other methods Method AIS-WUMSKMDBSCANBIRCHaiNet Fuzzy AIS SOSDM Reliability/Insensitivit y to initial condition YesNoYesNoYes Noise toleranceYesNoYesNo Yes Moderately Need to scan before learning NoYes No Time complexity O(N) O(Nlog(N))O(N)O(N²) O(N) Buffer dataNoYes Number of clusters specified NoYesNoYesNo Yes Handle evolving clusters YesNo Yes Automatic scale estimation YesNo Yes No Clustering ModelNetworkCentroidsMedoidsCentroidsNetwork Handle different similarity measures YesNoYesNoYes Density/Partition based DensityPartition/ Distance DensityPartitionPartition/ Distance DensityPartition/ Distance 87
Novelties of the proposed algorithm Low Computational Complexity. Danger Theory in Two FormsDirected MutationWeighted Stimulation Learning the Data in a Single PassNatural MechanismApplicable to Stream Data Bi-functionality: Frequent Itemsets Mining + Finding Centroids of Clusters in Large Datasets Clear and fast identification of outliers. 88
Conclusion A robust and scalable algorithm for frequent itemsets mining is designed which is well fitted for noisy sparse data like Web usage data. 89
Conclusion The main factor behind the ability of proposed algorithm to learn in a single pass lies in the richness of the immune network structure that form a dynamic synopsis of the data and danger theory which decide which antigen is dangerous and when new antibodies are needed for combating antigens. 90
Publications B.Hoda Helmi, Adel T. Rahmani, Nona Helmi, “An Evolutionary Control Model for a Generic Multiagent System Using Artificial Immune Systems”, in proceeding of First Joint Congress on Fuzzy and Intelligent Systems,2007, Ferdowsi University. B. Hoda Helmi, Adel T. Rahmani, “Image Segmentation with a New Texture Feature Based on AIS ”, In proceeding of the first conference on Data Mining, AmirKabir University, 2007, Tehran, Iran.(farsi) B.Hoda Helmi, Adel T. Rahmani, “An AIS Algorithm for Web Usage Mining with Directed Mutation”, accepted in IEEE World Congress on Computational Intelligence, CEC division, 2008, Hong Kong. B. Hoda Helmi, Adel T. Rahmani, “An Enhanced AIS for WUM, inspired by Danger Theory”, submitted to ICEE 2008, Tarbiat Modarres University, 2008, Tehran, Iran. (farsi) 91
Publications Adel T. Rahmani, B.Hoda Helmi, “EIN-WUM an AIS-based Algorithm for Web Usage Mining”, submitted to Genetic and Evolutionary Computation Conference, 2008, Atlanta, Georgia. B. Hoda Helmi, Adel T. Rahmani, “A New Web Usage Mining Method based on An Artificial Immune System Solution with Enhanced Network and Danger Theory ”, submitted to International Journal of Control, Automation, and Systems. B.Hoda Helmi, Adel T. Rahmani, “Evolutionary based Combining of Evolved Neural Network Classifiers”, accepted in IASTAD International Conference on Signal Processing, Pattern Recognition and applications, 2006, Austria. (unrelated) 92
پایان Thanks 93
Somatic Hypermutation 94
Cross Reaction 95
Artificial Immune System Algorithms Affinity Representation AIS A Framework for AIS Shape-Space Binary Integer Real-valued Symbolic 96
Artificial Immune System Algorithms Affinity Representation AIS A Framework for AIS Euclidean Manhattan Hamming … 97
Artificial Immune System Algorithms Affinity Representation AIS A Framework for AIS Bone Marrow Clonal Selection Negative Selection Positive Selection Immune Network 98
Sessions : : : :0000:0000:3500: : :0000:5400: : :0700:00 00: : :0002:0000: Sesison 1 Sesison 2 Sesison 3 Sesison 4 Sesison 5 99
Affinity Measure 100
Precision versus Coverage 101