O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Artificial Bee Colony Algorithm
Dissemination-based Data Delivery Using Broadcast Disks.
New Mexico Computer Science for All
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Hadi Goudarzi and Massoud Pedram
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
Application of Ensemble Models in Web Ranking
Kien A. Hua Division of Computer Science University of Central Florida.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 A Brief Summer Recap Flocking, CUDA, GPU, Ants, and More Jesse St.Charles.
Flocking Behaviors Presented by Jyh-Ming Lien. Flocking System What is flocking system? – A system that simulates behaviors of accumulative objects (e.g.
G. Folino, A. Forestiero, G. Spezzano Swarming Agents for Discovering Clusters in Spatial Data Second International.
Distributed Load Balancing for Parallel Agent-based Simulations Biagio Cosenza*, Gennaro Cordasco, Rosario De Chiara, Vittorio Scarano ISISLab, Dipartimento.
Swarm algorithms COMP308. Swarming – The Definition aggregation of similar animals, generally cruising in the same direction Termites swarm to build colonies.
Patch to the Future: Unsupervised Visual Prediction
Ant Inspired Data Mining Brandon Emerson April 22,
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
Information Retrieval in Practice
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Xiaohui Cui †, Laura L. Pullum ‡, Jim Treadwell †, Robert M. Patton †, and Thomas E. Potok † Particle Swarm Social Model for Group Social Learning in an.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Cluster Computing Applications Project Parallelizing BLAST Research Alliance of Minorities.
Overview of Search Engines
Subgoal: conduct an in-depth study of critical representation, operator and other choices used for evolutionary program repair at the source code level.
SWARM INTELLIGENCE IN DATA MINING Written by Crina Grosan, Ajith Abraham & Monica Chis Presented by Megan Rose Bryant.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Chapter 14: Artificial Intelligence Invitation to Computer Science, C++ Version, Third Edition.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Swarm Computing Applications in Software Engineering By Chaitanya.
Swarm Intelligence 虞台文.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Community Architectures for Network Information Systems
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Whitney St.Charles Research Alliance in Math and Science 2007 Mentors: Yu (Cathy) Jiao, Ph.D. Robert Patton, Ph.D. Computational Sciences and Engineering.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Event Driven Programming, The.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY A Comparison of Methods for Aligning Genomic Sequences Ja’Nera Mitchom Fisk University Research.
(Particle Swarm Optimisation)
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.
Presented by Document Clustering on Supercomputers Yu (Cathy) Jiao, Ph.D. Applied Software Engineering Research Group Computational Sciences and Engineering.
Controlling the Behavior of Swarm Systems Zachary Kurtz CMSC 601, 5/4/
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Particle Swarm Optimization † Spencer Vogel † This presentation contains cheesy graphics and animations and they will be awesome.
Presented by Data Analysis and High Performance Computing Yu (Cathy) Jiao, Ph.D. Robert M. Patton, Ph.D. Xiaohui Cui, Ph.D. Applied Software Engineering.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Path Planning Based on Ant Colony Algorithm and Distributed Local Navigation for Multi-Robot Systems International Conference on Mechatronics and Automation.
FlowLevel Client, server & elements monitoring and controlling system Message Include End Dial Start.
Parallelization of a Non-Linear Analysis Code Lee Hively and Jim Nutaro (mentors) Computational Sciences and Engineering Travis Whitlow Research Alliance.
Detecting Undesirable Insider Behavior Joseph A. Calandrino* Princeton University Steven J. McKinney* North Carolina State University Frederick T. Sheldon.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Journal of Computational and Applied Mathematics Volume 253, 1 December 2013, Pages 14–25 Reporter : Zong-Dian Lee A hybrid quantum inspired harmony search.
Particle Swarm Optimization (PSO) Algorithm. Swarming – The Definition aggregation of similar animals, generally cruising in the same directionaggregation.
An Efficient Algorithm for Incremental Update of Concept space
Computing and Compressive Sensing in Wireless Sensor Networks
Supporting Fault-Tolerance in Streaming Grid Applications
DISTRIBUTED CLUSTERING OF UBIQUITOUS DATA STREAMS
DDoS Attack Detection under SDN Context
Paraskevi Raftopoulou, Euripides G.M. Petrakis
CLUSTER COMPUTING.
FLOSCAN: An Artificial Life Based Data Mining Algorithm
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group Oak Ridge National Laboratory

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Outline Introduction of Dynamic Information Stream and the issues Bio-inspired Clustering MSF Clustering Model Based on Bird Flock Collective Behavior TFIDF not practical for dynamic data MSF Document Clustering Algorithm Multi-Agent Document Clustering Implementation Future works and Conclusion

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Text Challenge Problem How to effectively reduce the size of a large, streaming set of documents Give me the 10 documents that I need to read, out of the 1000 I received today? Characteristics A steady flow of simple documents Need to rapidly organize the documents into subsets Select representative documents from the subsets

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Approach Use standard IR techniques to convert text to vectors Use unsupervised learning/text clustering to organize the documents Look for improvements in term weighting approaches

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Standard Information Retrieval Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract Term List Vector Space Model The Army needs senor technology to help find improvised explosive devices ORNL has developed sensor technology for homeland defense Mitre has won a contract to develop homeland defense sensors for explosive devices Army Sensor Technology Help Find Improvise Explosive device ORNL develop sensor technology homeland defense Mitre won contract develop homeland defense sensor explosive devices Document 1 Terms Document 2 Document 3

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Standard Textual Clustering Vector Space Model Dissimilarity Matrix TFIDF Documents to Documents D1D2D3 Cluster Analysis Most similar documents Euclidean distance O(n 2 Log n) Time Complexity

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Issues (1) Analysts are currently overwhelmed with the amount of information streams generated everyday. Researches in clustering analysis mainly focus on how to quickly and accurately cluster static data collection. Research on clustering the dynamic information stream is limited.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Solution: Bio-inspired Clustering New computational algorithms inspired from biological models, such as ant colonies, bird flocks, and swarm of bees etc., can solve problems in dynamical environment. These algorithms are characterized by the interaction of a large number of agents that follow the same rules. The bio-inspired clustering algorithms apply the self-organizing and collective behaviors of social insects for organizing of dynamical changed data.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Deneubourg proposed the first clustering solutions inspired by ant colonies in Agent (ant) action rule: agent move randomly in the grid. Agents only recognize objects immediately in front of them. Picking up or dropping item based on pickup probability and drop probability. The movement of data objects has to be implemented through the movements of a small number of ant agents, which will slow down the clustering speed. Data Clustering by Ant Clustering Algorithm

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A New Clustering Algorithm Based on Bird Flock Collective Behavior

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Flocking model, one of the first bio-inspired computational collective behavior models, was first proposed by Craig Reynolds in Alignment : steer towards the average heading of the local flock mates Separation : steer to avoid crowding flock mates Cohesion : steer towards the average position of local flock mates Alignment Separation Cohesion Flocking Model

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Flocking Demo

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Multiple Species Flocking (MSF) Model Feature similarity rule: Steer away from other birds that have dissimilar features and stay close to these birds that have similar features.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Issues (2) Every added or removed document from the set requires recalculation of the entire VSM TFIDF not practical for dynamic data Requires sequential processing Not good for a distributed agent approach Document Set must be known before VSM can be calculated

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Inverse Corpus Frequency Look at the forest, not the trees We analyzed near 1 million documents from 6 major research corpora We found 229,023 unique terms (A large dictionary contains around 70,000 terms) We use this term frequency distribution as our global term frequency Reed, Jiao, et al., TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams, The Fifth International Conference on Machine Learning and Applications (2006) to appear Reed et al., Multi-Agent System for Distributed Cluster Analysis, Third International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Why this matters We can now generate an accurate vector directly from a text document That vector can be generated where ever the document resides We can now use agents to create vectors from documents over a broad range of computers

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Multiple Species Flocking (MSF) Document Clustering Each document is projected as a bird in a 2D virtual space. The birds that have similar document vector feature (same as the birds species and colony in nature) will automatically group together and became a bird flock. Other birds that have different document vector features will stay away from this flock.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY MSF Document Clustering Demo Category/Topic Number of articles 1Airline Safety10 2 China and Spy Plane and Captives 4 3 Hoof and Mouth Disease 9 4Amphetamine10 5Iran Nuclear16 6 N. Korea and Nuclear Capability 5 7Mortgage Rates8 8Ocean and Pollution10 9 Saddam Hussein and WMD 10 Storm Irene22 11Volcano8 The Document collection Dataset

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Performance Results of MSF, K-means and Ant Clustering Algorithm * Four data types and each includes 200 two dimensional (x, y) data objects. x and y are distributed according to Normal distribution. ** 112 news article dataset, 11 categories *** The k-means algorithm has pre-knowledge of the cluster number. The clustering results of K-means, Ant clustering and MSF clustering Algorithm on synthetic* and document** datasets after 300 iterations Ref: X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues 8-9, pp , August 2006, ISSN: Algorithms Average cluster number Average F- measure value Synthetic Dataset MSF K-means(4)*** Ant Real Document Collection MSF K-means(11)*** Ant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY MSF Clustering Algorithm for Information Stream The MSF clustering algorithm can achieve better performance in document clustering than the K- means and the Ant clustering algorithm. This algorithm can continually refine the clustering result and quickly react to the change of individual data. This character enables the algorithm suitable for clustering dynamic changed document information, such as the text information stream.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Multi-Agent Document Clustering Implementation JADE platform. ( Linux Cluster Machine. One main node and three client nodes, which are connected with a Gigabit Ethernet switch. Each node contains a single 2.4G Intel Pentium IV processor and 512M memory. Document datasets are derived from TREC collections. TREC: Text REtrieval Conference (

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Current and Future Works Switched agent platform from JADE to our light agent platform (ORMAC). Built a control agent for automatically generating and deploying flock agents on all available cluster nodes of 135 node cluster. Built agents to monitor the news update on several popular Internet news websites and collect news and feed into the system in real-time. Building a better GUI interface

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Conclusion The heuristic searching mechanism of flocking model helps document agents to quickly form flocks and react to the change of any individual documents. TFIDF enhancement, the TFICF vector space model, allows for parallel or distributed algorithms for information stream clustering Agent architecture provides analysis approach that can run on cluster computers.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Thank you!

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Node1 Node3 Node2 Location proxy agents Boid agents Head Node JADE system agents JADE main Container JADE Container The architectures the central model and distributed model the distributed model Node1 … Boid agents Location proxy agent Head Node JADE main Container JADE Container JADE system agents the Single Processor model