Download presentation
Presentation is loading. Please wait.
Published byMalcolm Perry Modified over 8 years ago
1
Presented by Document Clustering on Supercomputers Yu (Cathy) Jiao, Ph.D. Applied Software Engineering Research Group Computational Sciences and Engineering Division
2
2 Yu_Potok_Clustering_0611 National challenge Data Binary Text Image Multimedia Sensors One small step for man 11010010 1970198019902000 2010 Data are everywhere Sources are unreliable Data are difficult to merge Merging cannot be done manually
3
3 Yu_Potok_Clustering_0611 Key technologies Intelligent agents Peer-to-peer communication Encapsulated messages Computation distribution Adaptive and collaborative behavior Fault tolerance High-performance computing Red/White Oak clusters 135 Dell computers Largest cluster computer at ORNL 1.7 Tflops 270 GB memory 11.3 TB disk Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Black Board Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents
4
4 Yu_Potok_Clustering_0611 What is in there? Are there any threats? What am I missing? Paper records Old disks Legacy databases What to do with this?
5
5 Yu_Potok_Clustering_0611 Raw documents Organize the information What are the connections? Connect the information What are they planning? Take action How ORNL can help Iraq Nuclear Materials Chemical Weapons Chemical Weapons Threats Potential Targets Potential Targets Money Laundering Money Laundering Training Camps Training Camps What do we have? Find the threats How credible is the threat?
6
6 Yu_Potok_Clustering_0611 Doc 1Doc 2Doc 3 Army100 Sensor111 Technology110 Help100 Find100 Improvise100 Explosive101 Device101 ORNL010 Develop011 Homeland011 Defense011 Mitre001 Won001 Contract001 Vector Space Model 10,000 documents 100 terms 1 second Similarity Matrix 10,000 documents 1.6 Minutes Words to documents Documents to documents Cluster Analysis Most similar documents The technical problem O(n 2 log n) Doc 1Doc 2Doc 3 Doc 1100%17%21% Doc 2100%36% Doc 3100% D1 D2 D3 d 2 (x i,x j ) = (x i,k x j.k ) 2 d K=1 1/2 W ij = log 2 ( ƒ ij + 1 ) *log 2 NnNn Powerful but expensive
7
7 Yu_Potok_Clustering_0611 Reed et al., “Multi-Agent System for Distributed Cluster Analysis,” Third International Workshop on Software Engineering for Large-Scale Multi-Agent Systems (SELMAS'04), May 24 – 25, 2004, Edinburgh, Scotland. Breakthrough—inverse corpus frequency We analyzed nearly 1 million documents from six major research corpora We found 229,023 unique terms (a large dictionary contains around 70,000 terms) We use this term frequency distribution as our “global” term frequency Look at the forest, not the trees 0 50,000 100,000 150,000 200,000 250,000 Number of documents (K) Unique term count 5105205305405505605705805905 W ij = log 2 ( ƒ ij + 1 ) log 2 C + 1 c + 1
8
8 Yu_Potok_Clustering_0611 Distributed clustering Reed et. al., “An Agent-based Method for Distributed Clustering of Textual Information,” patent pending, licensed to industry Head Node Computer 4 Computer 5 Computer 6 Computer 7 Computer 8 Computer 9 Computer 10 Computer 11 Computer 1 Computer 2 Computer 3 Intelligent Agents Intelligent Agents Intelligent Agents Black Board Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents Intelligent Agents
9
9 Yu_Potok_Clustering_0611 Ant colony optimization Bird flocking model Breakthrough – bio-inspired distributed solution AlignmentSeparationCohesion
10
10 Yu_Potok_Clustering_0611 Summary Current technology cannot solve emerging national challenges. Intelligent software agents are a significant breakthrough technology. Results indicate high potential to help solve these national challenges. We have a progression of significantly successfully deployed agent systems and research to our credit.
11
11 Yu_Potok_Clustering_0611 Contact Yu (Cathy) Jiao, Ph.D. Applied Software Engineering Research Group Computational Sciences and Engineering Division (865) 574-0647 jiaoy@ornl.gov 11 Yu_Potok_Clustering_0611
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.