Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore.

Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore County, Baltimore, MD. Also affiliated to Agnik, LLC, Columbia, MD. Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments

Motivation Support Vector (Kernel) Regression An illustration Support Vector Kernel Regression Find a function f(x)=y to fit a set of example data points Problem can be phrased as constrained optimization task Solved using a standard LP solver

Motivation contd.. Knowledge Based Kernel Regression In addition to sample points, give advice If (x ≥3) and (x ≤5) Then (y≥5) Rules add constraints about regions Constraints added to LP and a new solution (with advice constraints) can be constructed Fung, Mangasarian and Shavlik,”Knowledge Based Support Vector Machine Classifiers”, NIPS, 2002. Mangasarian, Shavlik and Wild, “Knowledge Based Kernel Approximation”, JMLR, 5, 1127 – 1141, 2005. Figure adapted from McLain, Shavlik, Walker and Torrey, “Knowledge-based Support Vector Regression for Reinforcement Learning”, IJCAI, 2005

Distributed Data Mining Applications – An example of Scientific Data Mining in Astronomy Distributed data and computing resources on the National Virtual Observatory P2P Data Mining on homogeneously partitioned sky survey H Dutta, Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure, Ph.D Thesis, UMBC, Maryland, 2007. Need for distributed optimization strategies

Road Map Motivation Related Work Framing an Linear Programming problem The simplex algorithm The distributed simplex algorithm Experimental Results Conclusion and Directions of Future Work

Related Work Resource Discovery in Distributed Environments Imantichi, “Resource Discovery in Large Resource Sharing Experiments”, Ph.D. Thesis, University of Chicago, 2003. Livny and Solomon, “Matchmaking: Distributed Resource Management for high throughput computing”, HPDC, 1998. Optimization Techniques Yarmish, “Distributed Implementation of the Simplex Method”, Ph.D. Thesis, CIS Polytechnic University, 2001. Hall and McKinnon, “Update procedures for parallel revised simplex methods, Tech Report, University of Edinburg, UK, 1992 Craig and Reed, “Hypercube Implementation of the Simplex Algorithm”, ACM, pages 1473 – 1482, 1998.

The Optimization Problem 7 Assumptions: n nodes in the network The network is static Dataset D i at node i Processing Cost at i-th node – ν i per record Transportation Cost between i and j – μ ij Amount of Data Transferred between nodes – x ij Cost Function Z = Σ ij μ ij x ij + ν i x ij = Σ ij c ij x ij

Framing the Linear Programming Problem: An illustration Objective Function z = 6.03x 12 +9.04x 23 +6.52x 15 +8.28x 14 +14.42x 25 + 9.58x 34 + 12.32x 45 Constraints C(X) = ∑ ij µ ij x ij + ν j x ij = ∑ ij c ij x ij, C ij = µ ij + ν ij x 12 + x 14 + x 15 ≤ 300; x 12 + x 25 + x 23 ≤ 600; x 15 +x 25 +x 45 ≤ 300 ; x 23 +x 34 ≤ 300; 0 ≤ x 12 ≤ D 1 ; 0 ≤ x 23 ≤ D 2 ; 0 ≤ x 15 ≤ D 1 ; 0 ≤ x 14 ≤ D 1 ; 0 ≤ x 25 ≤ D 2 ; 0 ≤ x 34 ≤ D 3 ; 0 ≤ x 45 ≤ D 4 5 600 GB 10.4 7.8 3 1 4 2 300 GB 2.5 8.3 3.8 6.5 6.1 NodeV 11.23 22.23 32.94 41.78 54.02

The Simplex Algorithm Find x 1 ≥ 0, x 2 ≥ 0, …., x n ≥ 0 and Min z = c 1 x 1 + c 2 x 2 + …. + c n x n Satisfying Constraints A 1 x 1 + A 2 x 2 + ….. + A n x n = B The Simplex Algorithm a 11 a 12 ….a 1n b1b1 a 21 a 22 ….a 2n b2b2 …. a m1 a m2 …a mn bmbm c1c1 c2c2 …cncn z The simplex tableau

The Simplex Algorithm – Contd … 10 The Problem Maximize z = x 1 + 2x 2 – x 3 Subject to 2x 1 + x 2 + x 3 ≤ 14 4x 1 +2x 2 +3x 3 ≤ 28 2x 1 +5 x 2 +5x 3 ≤ 30 The Steps of the Simplex Algorithm (Dantzig) Obtain a canonical representation (Introduce Slack Variables) Find a Column Pivot Find a Row Pivot Perform Gauss Jordan Elimination

The simplex tableau and iterations 21110014 42301028 25500130 -210000 x 1 x 2 x 3 s 1 s 2 s 3 B Pivot Column Canonical Representation 14/1= 14 28/2=14 30/5= 6 Pivot Row 21110014 42301028 25500130 -210000

Simplex iterations contd … Perform Gauss Jordan Elimination The Final Tableau 8/50010-1/58 16/50101-2/516 2/511001/56 -1/503002/512 00-1/21 00 105/160 -1/85 017/80-1/844 0049/1601/163/813

Road Map Motivation Related Work Framing an Linear Programming problem The simplex algorithm The distributed simplex algorithm Experimental Results Conclusions and Future Work

The Distributed Problem – An Example 14 Node1Node 2 Node 5 Node 4 Node 3 x 12 +x 15 +x 14 +2x 25 ≤300 x 12 +2x 15 -x 25 =2 300 GB x 12 +x 23 +x 25 ≤600 2x 25 -x 12 -x 23 =4 600 GB x 15 +x 25 +x 45 ≤300 x 25 -2x 15 -x 45 =5 300 GB x 34 +8 x 25 ≤300 300 GB x 23 +x 34 ≤300 300 GB Each site observes different constraints, but wants to solve the same objective function z = 6.03x 12 + 9.04x 23 + 6.52x 15 + 8.28x 14 + 14.42x 25 + 9.58x 34 + 12.32x 45

Distributed Canonical Representation 15 An initialization step No of basic variables to add = Total no of constraints in the system Build a spanning tree in the network Perform a distributed sum estimation algorithm Builds a canonical representation exactly identical to the one if data was centralized

The Distributed Algorithm for solving the LP problem 16 Steps involved: Estimate Column pivot Estimate Row pivot (requires communication with neighbors) Perform Gauss Jordan elimination

Illustration of the Distributed Algorithm x 12 x 23 x 15 x 14 x 25 x 34 x 45 s1s2s3s4s5s6s7s8B 101120010000000300 102000010000002 -6.03-9.04-6.52-8.28-14.42-9.58-12.32000000000 Node1Node 2 Node 5 Node 4 Node 3 x 12 x 23 x 15 x 14 x 25 x 34 x 45 s1s2s3s4s5s6s7s8B 000081100000010300 -6.03-9.04-6.52-8.28-14.42-9.58-12.32000000000 Column pivot selection is done at each node

Distributed Row Pivot selection 18 Protocol Push Min (gossip based) Minimum estimation problem Iteration t-1: {m r } values sent to node i m ti = min {{m r }, current row pivot} Termination: All nodes have exactly the same minimum value

Analysis of Protocol Push Min 19 Based on spread of an epidemic in a large population Suseptible, infected and dead nodes The “epidemic” spreads exponentially fast Node1Node 2 Node 5 Node 4 Node 3

Comments and Discussions 20 Assume η no of nodes in the network Communication Complexity is O(no of iterations of simplex X η ) Worst case Simplex may require exponential no of iterations. For most practical purposes it is λ m ( λ <4)

Road Map Motivation Related Work Framing an Linear Programming problem The simplex algorithm The distributed simplex algorithm Experimental Results Conclusion and Directions of Future Work

Experimental Results Artificial Data Set Simulated constraint matrices at each node Used Distributed Data Mining Toolkit (DDMT) developed at University of Maryland, Baltimore County (UMBC) for simulating the network structure Two different metrics for evaluation: TCC (Total Communication Cost in the network) Average Communication Cost per Node (ACCN)

Communication Cost Average Communication Cost Per Node versus Number of Nodes in the network

More Experimental Results …. TCC versus No of Variables at each node TCC versus No of constraints at each node

Conclusions and Future Work Resource management and pattern recognition present formidable challenges on distributed systems Present a distributed algorithm for resource management based on the simplex algorithm Test our algorithm on simulated data Future Work Incorporation of dynamics of the network Testing the algorithm on a real distributed network Effect of size and structure of network on the mining results Examine the trade-off between accuracy and communication cost incurred before and after using distributed simplex on a mining task like classification or clustering

Selected Bibliography G.B.Dantzig, “Linear Programming and Extensions”. Princeton University Press, Princeton, NJ, 1963 Kargupta and Chan,”Advances in Distributed and Parallel Knowledge Discovery”, AAAI Press, Menlo Park, CA, 2000. A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data Mining”. PhD thesis, University of Illinois at Chicago., 2002. Haimonti Dutta, “Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure”, Ph.D. Thesis, UMBC, 2007. Mangasarian, “Mathematical Programming in Data Mining”, DMKD, Vol 42, pg 183 – 201, 1997.

Questions ?

Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore.

Similar presentations

Presentation on theme: "Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore.

Similar presentations

Presentation on theme: "Haimonti Dutta 1 and Hillol Kargupta 2 1 Center for Computational Learning Systems (CCLS), Columbia University, NY, USA. 2 University of Maryland, Baltimore."— Presentation transcript:

Similar presentations

About project

Feedback