Download presentation
Presentation is loading. Please wait.
Published byLewis Higgins Modified over 9 years ago
1
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science AAAI 2005 Acknowledgements : This work is supported in part by grants from the National Science Foundation (IIS 0219699), and the National Institutes of Health (GM 066387) to Vasant Honavar. Learning Support Vector Machine Classifiers From Distributed Data Sources Cornelia Caragea, Doina Caragea and Vasant Honavar Learning from Distributed Data Given the data fragments D 1,…, D N of a data set D distributed across N sites, a set of constraints Z, a hypothesis class H and a performance criterion P, the task of the learner L d is to output a hypothesis h H that optimizes P, using only operations allowed by Z. Learning from Data Given a data set D, a hypothesis class H, and a performance criterion P, the learning algorithm L outputs a hypothesis h H that optimizes P. Support Vector Machines SVM finds a separating hyperplane maximizing the margin of separation between classes when the data are linearly separable. Kernels can be used to make data sets separable in high dimensional feature spaces. SVM is among one of the most effective machine learning algorithms for classification problems. Our approach relies on identifying sufficient statistics for learning SVMs. We present an algorithm that learns SVMs from distributed data by iteratively computing the set of refinement sufficient statistics. Our algorithm is exact with respect to its centralized counterpart and efficient in terms of time complexity. Sufficient Statistics: A statistic s L (D) is a sufficient statistic for learning a hypothesis h using a learning algorithm L applied to a data set D if there exists a procedure that takes s L (D) as input and outputs h. Usually, we cannot compute all the sufficient statistics at once. Instead we can only compute sufficient statistics for the refinement of a hypothesis h i into a hypothesis h i+1. Exactness: An algorithm L d for learning from distributed data sets D 1, …, D N is said to be exact relative to its centralized counterpart L if the hypothesis produced by L d is identical to that obtained by L from complete data set D by appropriately combining the data sets D 1, …, D N. D1D1 D2D2 D3D3 Learning from distributed data L d Classifier h D D1D1 D2D2 D3D3 Learning from data L Classifier h Exactness condition: q(D) = C(q 1 (D 1 ),…, q N (D N )) Query s(D,h i ->h i+1 ) Answer s(D,h i ->h i+1 ) Query Decomposition Answer Composition D1D1 D2D2 DNDN q1q1 q2q2 qKqK Statistical Query Formulation Hypothesis Generation h i+1 R(h i, s(D, h i ->h i+1 )) Learner Partial hypothesis h i Query answering engine Information extraction from distributed data + Hypothesis generation The support vectors (x i,y i ) and their corresponding coefficients λ i can be seen as sufficient statistics. Learning Support Vector Machines from Distributed Data SV(D i ) Naïve Approach: the resulting algorithm is not exact! D1D1 D2D2 D3D3 Query Decomp Answer Comp Take union Set SV Query answering engine Statistical query formulation SV(D) Apply SVM to the set SV SVM SV={(x i,y i )| sv} Margin of Separation Separating Hyperplane: wx+b=0 Support Vectors Optimal solution: Nibxwyw ii,...,1 allfor 1)( subject to|| 2 1 maxarg 2 Margin of Separation Separating Hyperplane: wx+b=0 Counterexample to Naïve Distributed SVM Exact learning: all boundary information VConv(D+) VConv(D-) where VConv(D) - the set of vertices that define the convex hull of D. Algorithm exponential in the number of dimensions. Exact and Efficient Learning SVM From Distributed Data Data sources Naïve-Tr. Acc.Naïve-Ts. Acc.Iter-Tr. Acc.Iter-Ts. Acc.Centr-Tr. Acc.Centr.-Ts.Acc.No of iter. Artificially generated 0.750.6111113 Human-Yeast protein 0.600.570.690.630.690.633 SVM Algorithm Learning Phase SVM(D:data, K:kernel) Solve the optimization problem: subject to: Let be the solution of this optimization problem. Classification Phase For a new instance x assign x to the class SVM from Horizontally Distributed Data Learning Phase Initialize (the global set of support vectors). repeat { Let Send to all data sources for (each data source ) { Apply and find the support vectors Send the support vectors to the central location. } At the central location: Compute Apply to find the new } until Let be the set of final support vectors. Let be their corresponding weights. Classification Phase for a new instance x assign x to the class We ran experiments on artificially generated data and protein function classification data. The results showed that our algorithm converges to the exact solution in a relatively small number of steps. This makes it preferable to previous algorithm for learning from distributed data.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.