1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
A distributed method for mining association rules
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Gueyoung Jung, Nathan Gnanasambandam, and Tridib Mukherjee International Conference on Cloud Computing 2012.
Reference: Message Passing Fundamentals.
Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.
Frequent Itemsets Mining in Distributed Wireless Sensor Networks Manjunath Rajashekhar.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Database caching in MANETs Based on Separation of Queries and Responses Author: Hassan Artail, Haidar Safa, and Samuel Pierre Publisher: Wireless And Mobile.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
DNA REASSEMBLY Using Javaspace Sung-Ho Maeung Laura Neureuter.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Dynamic Load Balancing and Job Replication in a Global-Scale Grid Environment: A Comparison IEEE Transactions on Parallel and Distributed Systems, Vol.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Distributed Process Scheduling : A Summary
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
1 Maintaining Knowledge-Bases of Navigational Patterns from Streams of Navigational Sequences Ajumobi Udechukwu, Ken Barker, Reda Alhajj Proceedings of.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
A System Performance Model Distributed Process Scheduling.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Jiawei Han, Jian Pei, Helen Pinto, Behzad Mortazavi-Asl, Qiming Chen,
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Dynamic Load Balancing Tree and Structured Computations.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
1 Summarizing Sequential Data with Closed Partial Orders Gemma Casas-Garriga Proceedings of the SIAM International Conference on Data Mining (SDM'05) Advisor.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Mining Closed Relational Graphs with Connectivity Constraints Xifeng Yan, X. Jasmine Zhou and Jiawei Han SIGKDD 05 ’ 報告者:蔡明瑾 2005/12/09.
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,
Supporting Fault-Tolerance in Streaming Grid Applications
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
2009 AAG Annual Meeting Las Vegas, NV March 25th, 2009
Approximate Frequency Counts over Data Streams
Summarizing Itemset Patterns: A Profile-Based Approach
Presentation transcript:

1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining Chicago, Illinois, USA, 2005 Advisor : Jia-Ling Koh Speaker : Chun-Wei Hsieh

2 Introduction Numerous applications: – DNA sequences, Analysis of web log, customer shopping sequences, XML query access patterns … Closed Sequential patterns – have All information – are more compact Many applications are time-critical and involve huge volumes of data.

3 Sequential Algorithm-BIDE Step 1: Identify the frequent 1-sequences Step 2: Project the dataset along each frequent 1-sequence Step 3: Mine each resulting projected dataset

4 Sequential Algorithm-BIDE The projected dataset forsequence AB is {C,CB,C,BCA}.

5 Task Decomposition 1. Each processor counts the occurrence of 1-sequences in a different part of the dataset. A global add reduction is executed to obtain the overall counts. 2. Build pseudoprojections. This is done in parallel by assigning a different part of the dataset to each processor. The pseudo- projections are communicated to all processors via an all-to-all broadcast. 3. Dynamic scheduling to distribute the processing of the projections across processors.

6 Task Decomposition In the second step, it is more efficient to implement the broadcast using a virtual ring structure. Assume there are N processor, and Processor K – Only receives the package from Processor ((K-1) mod N) – Only Sends the package to Processor ((K+1) mod N) It needs (N-1) send-receive steps and consumes no more than 0.5% of the mining time.

7 Task Scheduling 1. A master processor maintains a queue of pseudo- projection identifiers. Other processors is initially assigned a projection. 2. After mining a projection, a processor sends a request to the master processor for another projection. 3. This process continues until the queue of projections is empty.

8 Task Scheduling If the largest subtask takes 25% of the total mining time, the best possible speedup is only 4 regardless of the number of processors available. To improve the dynamic scheduling, the approach is to find which projections require long mining time, and to decompose them.

9 Relative Mining Time Estimation Random sampling – selects random subset of the projections – is not accurate if the overhead is kept small Selective sampling – uses every sequence of the projections – discards infrequent 1-sequences and the last L frequent 1- sequences ( L = a given fraction t * the average length of the sequences in the dataset )

10 Selective sampling For example, – assume (A : 4), (B : 4), (C : 4), (D :3), (E : 3), (F : 3), (G : 1) are the 1-sequences – the support threshold = 4 – the average length of the sequences in the dataset = 4 – Suppose t = 75% L = 4 ∗ 0.75 = 3 Given a sequence as AABCACDCFDB, selective sampling will reduce this sequence to AABCA

11 Relative Mining Time Estimation

12 Par-CSP Algorithm

13 Experiments 64 nodes OS: Redhat Linux 7.2 CPU: 1GHz Intel Pentium 3 RAM: 1GB Compiler: GNU g

14 Experiments Synthetic Dataset: IBM dataset generator Real Dataset: Gazelle, Web click-stream

15 Experiments

16 Experiments

17 Experiments

18 Experiments