O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A Distributed Agent Implementation of Multiple Species Flocking Model for Document Partitioning Clustering Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D. Applied Software Engineering Research Group Oak Ridge National Laboratory
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Outline Introduction of Dynamic Information Stream and the issues Bio-inspired Clustering MSF Clustering Model Based on Bird Flock Collective Behavior TFIDF not practical for dynamic data MSF Document Clustering Algorithm Multi-Agent Document Clustering Implementation Future works and Conclusion
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Text Challenge Problem How to effectively reduce the size of a large, streaming set of documents Give me the 10 documents that I need to read, out of the 1000 I received today? Characteristics A steady flow of simple documents Need to rapidly organize the documents into subsets Select representative documents from the subsets
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Approach Use standard IR techniques to convert text to vectors Use unsupervised learning/text clustering to organize the documents Look for improvements in term weighting approaches
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Standard Information Retrieval Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract Term List Vector Space Model The Army needs senor technology to help find improvised explosive devices ORNL has developed sensor technology for homeland defense Mitre has won a contract to develop homeland defense sensors for explosive devices Army Sensor Technology Help Find Improvise Explosive device ORNL develop sensor technology homeland defense Mitre won contract develop homeland defense sensor explosive devices Document 1 Terms Document 2 Document 3
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Standard Textual Clustering Vector Space Model Dissimilarity Matrix TFIDF Documents to Documents D1D2D3 Cluster Analysis Most similar documents Euclidean distance O(n 2 Log n) Time Complexity
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Issues (1) Analysts are currently overwhelmed with the amount of information streams generated everyday. Researches in clustering analysis mainly focus on how to quickly and accurately cluster static data collection. Research on clustering the dynamic information stream is limited.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Solution: Bio-inspired Clustering New computational algorithms inspired from biological models, such as ant colonies, bird flocks, and swarm of bees etc., can solve problems in dynamical environment. These algorithms are characterized by the interaction of a large number of agents that follow the same rules. The bio-inspired clustering algorithms apply the self-organizing and collective behaviors of social insects for organizing of dynamical changed data.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Deneubourg proposed the first clustering solutions inspired by ant colonies in Agent (ant) action rule: agent move randomly in the grid. Agents only recognize objects immediately in front of them. Picking up or dropping item based on pickup probability and drop probability. The movement of data objects has to be implemented through the movements of a small number of ant agents, which will slow down the clustering speed. Data Clustering by Ant Clustering Algorithm
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY A New Clustering Algorithm Based on Bird Flock Collective Behavior
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Flocking model, one of the first bio-inspired computational collective behavior models, was first proposed by Craig Reynolds in Alignment : steer towards the average heading of the local flock mates Separation : steer to avoid crowding flock mates Cohesion : steer towards the average position of local flock mates Alignment Separation Cohesion Flocking Model
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Flocking Demo
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Multiple Species Flocking (MSF) Model Feature similarity rule: Steer away from other birds that have dissimilar features and stay close to these birds that have similar features.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Issues (2) Every added or removed document from the set requires recalculation of the entire VSM TFIDF not practical for dynamic data Requires sequential processing Not good for a distributed agent approach Document Set must be known before VSM can be calculated
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Inverse Corpus Frequency Look at the forest, not the trees We analyzed near 1 million documents from 6 major research corpora We found 229,023 unique terms (A large dictionary contains around 70,000 terms) We use this term frequency distribution as our global term frequency Reed, Jiao, et al., TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams, The Fifth International Conference on Machine Learning and Applications (2006) to appear Reed et al., Multi-Agent System for Distributed Cluster Analysis, Third International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'04), May 24-25, 2004, Edinburgh, Scotland
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Why this matters We can now generate an accurate vector directly from a text document That vector can be generated where ever the document resides We can now use agents to create vectors from documents over a broad range of computers
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Multiple Species Flocking (MSF) Document Clustering Each document is projected as a bird in a 2D virtual space. The birds that have similar document vector feature (same as the birds species and colony in nature) will automatically group together and became a bird flock. Other birds that have different document vector features will stay away from this flock.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY MSF Document Clustering Demo Category/Topic Number of articles 1Airline Safety10 2 China and Spy Plane and Captives 4 3 Hoof and Mouth Disease 9 4Amphetamine10 5Iran Nuclear16 6 N. Korea and Nuclear Capability 5 7Mortgage Rates8 8Ocean and Pollution10 9 Saddam Hussein and WMD 10 Storm Irene22 11Volcano8 The Document collection Dataset
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Performance Results of MSF, K-means and Ant Clustering Algorithm * Four data types and each includes 200 two dimensional (x, y) data objects. x and y are distributed according to Normal distribution. ** 112 news article dataset, 11 categories *** The k-means algorithm has pre-knowledge of the cluster number. The clustering results of K-means, Ant clustering and MSF clustering Algorithm on synthetic* and document** datasets after 300 iterations Ref: X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architecture, Volume 52, Issues 8-9, pp , August 2006, ISSN: Algorithms Average cluster number Average F- measure value Synthetic Dataset MSF K-means(4)*** Ant Real Document Collection MSF K-means(11)*** Ant
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY MSF Clustering Algorithm for Information Stream The MSF clustering algorithm can achieve better performance in document clustering than the K- means and the Ant clustering algorithm. This algorithm can continually refine the clustering result and quickly react to the change of individual data. This character enables the algorithm suitable for clustering dynamic changed document information, such as the text information stream.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Multi-Agent Document Clustering Implementation JADE platform. ( Linux Cluster Machine. One main node and three client nodes, which are connected with a Gigabit Ethernet switch. Each node contains a single 2.4G Intel Pentium IV processor and 512M memory. Document datasets are derived from TREC collections. TREC: Text REtrieval Conference (
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Current and Future Works Switched agent platform from JADE to our light agent platform (ORMAC). Built a control agent for automatically generating and deploying flock agents on all available cluster nodes of 135 node cluster. Built agents to monitor the news update on several popular Internet news websites and collect news and feed into the system in real-time. Building a better GUI interface
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Conclusion The heuristic searching mechanism of flocking model helps document agents to quickly form flocks and react to the change of any individual documents. TFIDF enhancement, the TFICF vector space model, allows for parallel or distributed algorithms for information stream clustering Agent architecture provides analysis approach that can run on cluster computers.
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Thank you!
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Node1 Node3 Node2 Location proxy agents Boid agents Head Node JADE system agents JADE main Container JADE Container The architectures the central model and distributed model the distributed model Node1 … Boid agents Location proxy agent Head Node JADE main Container JADE Container JADE system agents the Single Processor model