Presentation is loading. Please wait.

Presentation is loading. Please wait.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Similar presentations


Presentation on theme: "Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui."— Presentation transcript:

1 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Learning Classifiers from Distributional Data Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu

2 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction Traditional Classification Instance representation: tuple of feature values BUT due to –Variability in sample measurements –Difference in sampling rate of each feature –Advances in tools and storages One may want to repeat the measurement for each feature and for each individual for reliability Example domains –Electronic Health Records –Sensor readings –Extracting text features –…–… How to represent? Y Patient 1 Dataset N Patient 2 N Patient 3 Healthy? Temperature Heart Rate Cholesterol White blood cell

3 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction How to represent? 1.Align samples Y Patient 1 Dataset N Patient 2 N Patient 3 Healthy? Temperature Heart Rate Cholesterol White blood cell

4 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction How to represent? 1.Align samples Y Patient 1 Healthy? Temperature Heart Rate Cholesterol White blood cell

5 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction How to represent? 1.Align samples –Measurements may not be synchronous –Missing data –Unnecessary big and sparse dataset –Need to adjust for weights Y Patient 1 Healthy? Temperature Heart Rate Cholesterol White blood cell ? ? ? ? ? ? ? ? ? ? Y Y Y Y Y Y Y

6 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction How to represent? 2.Aggregation Y Patient 1 Dataset N Patient 2 N Patient 3 Healthy? Temperature Heart Rate Cholesterol White blood cell

7 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction How to represent? 2.Aggregation –May lose valuable information –Which aggregation function? –The distribution of each sample set may contain information Y Patient 1 Healthy? Temperature Heart Rate Cholesterol White blood cell max avg Y

8 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction How to represent? 3.Proposed approach –Just as drawn –Bag of feature values –“Distributional” representation –Adapt learning models to this new representation Contribution –Introduce problem of learning from Distributional data –Offers 3 basic solution approaches Y Patient 1 Dataset N Patient 2 N Patient 3 Healthy? Temperature Heart Rate Cholesterol White blood cell

9 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Problem Formulation Distributional Instance: x = (B 1, …, B K ) where B k represents a bag of values of the k th feature Distributional Dataset: D = {(x 1, c 1 ), …, (x n, c n )} Distributional Classifier Learning Problem: (x 1, c 1 ) (x 2, c 2 ) … (x n, c n ) Learner Classifier New instance Predicted class

10 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Distributional Learning Algorithms Considers discrete domain for simplicity 3 basic approaches –Aggregation Simple aggregation (max, min, avg, etc.) Vector distance aggregation (Perlich and Provost [2006]) –Generative Models Naïve Bayes (with 4 different distributions) –Bernoulli –Multinomial –Dirichlet –Polya (Dirichlet-Multinomial) –Discriminative Models Using standard techniques to transform the above generative models into its discriminative counterpart

11 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Result Summary Dataset: –2 real-world datasets and 1 synthetic dataset –Dataset sizes: Results: DIL algorithms that take advantage of the information available in the distributional instance representation outperform or match the performance of their counterparts that fail to fully exploit such information Main critics: Results from discrete domain may not carry over to numerical features

12 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Topic Models Related Work Multiple Instance Learning Distributional Tabular Size of bag = 1 Tuple of bags of features Y Bag of tuples of features Y Document # Features = 1 Supervised Multi-modal Topic Models Numerical Domains Discrete Domains

13 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Future Work Consider ordinal and numerical features Consider dependencies between features Adapt other existing Machine Learning methods (e.g. kernel methods, SVMs, decision trees, nearest neighbors, etc.) Unsupervised setting: clustering distributional data

14 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Conclusion Opportunities –Variability in sample measurements –Difference in sampling rate of each feature One may want to repeat the measurement for each feature and for each individual for reliability Contributions –Introduce problem of learning from Distributional data –Offer 3 basic solution approaches –Suggest that the distribution embedded in the Distributional representation may improve performance Y


Download ppt "Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui."

Similar presentations


Ads by Google