Center for Big Data Analysis

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Random Forest Predrag Radenković 3237/10
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Classification: Alternative Techniques
An Overview of Machine Learning
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.
Ensemble Learning: An Introduction
Tree-based methods, neutral networks
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Ensemble Learning (2), Tree and Forest
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
1 Chapter Seven Introduction to Sampling Distributions Section 1 Sampling Distribution.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
2007 CAS Predictive Modeling Seminar Estimating Loss Costs at the Address Level Glenn Meyers ISO Innovative Analytics.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
Learning from observations
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Decision Trees. What is a decision tree? Input = assignment of values for given attributes –Discrete (often Boolean) or continuous Output = predicated.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Glenn Meyers ISO Innovative Analytics 2007 CAS Annual Meeting Estimating Loss Cost at the Address Level.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Data Mining and Decision Support
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Machine Learning: Ensemble Methods
Experience Report: System Log Analysis for Anomaly Detection
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Transformation: Normalization
Chapter 7. Classification and Prediction
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
School of Computer Science & Engineering
CSSE463: Image Recognition Day 11
Classification Nearest Neighbor
Machine Learning Basics
Chapter 7 – K-Nearest-Neighbor
Overview of Supervised Learning
Machine Learning Week 1.
Introduction to Data Mining, 2nd Edition by
Classification Nearest Neighbor
Clustering.
School of Computer Science & Engineering
Computer Vision Chapter 4
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Chapter 7: Transformations
Multivariate Methods Berlin Chen
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Machine Learning – a Probabilistic Perspective
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
What is Artificial Intelligence?
Logistic Regression Geoff Hulten.
Presentation transcript:

Center for Big Data Analysis Prune the inputs, increase data volume, or select strategy a different classification method – a to improve accuracy of classification. Center for Big Data Analysis Bergen, Norway Dr. Alla Sapronova alla.sapronova@uni.no

Problem of missing values Data analysis: fitting data to mathematical model (e.g. probability distribution) Data with inaccurate, corrupted or missed entries (especially for high-dimensional data) often impossible to fit Simple deletion of incomplete data leads to information loss ● ● ●

Case study: Build a predictive model for fish school ● presence at the given location and time. 12 fish types to predict and data from 750 ● historical catches recorded in 2010-2017 Classification shall be used for predictive ● modeling (learn the relation between desired feature-vector and labeled classes)

Addressing missing values singe complete case analysis replacing missing values with means replacing the missing values with sensible estimates of these values (imputation) complete case analysis followed by nearest- neighbor assignment (assign observations to the closest cluster based on the available data) partial data analysis based on the common data ● ● ● ● ●

Adding information computation of PCA outputs Build time-series Use variability of parameter (over averaged data) Add new, correlated data from different source Find low-dimensional subspace in which the data reside ● ● ● ● use procedures that adapt the standard PCA ● algorithm by considering the missing values in the computation of PCA outputs – Good summary can be found at Plant Ecol (2015) 216:657–667 "Principal component analysis with missing values: a comparative survey of methods" by Ste´phane Dray and Julie Josse

Case study approach during data pre-processing frequency) Variables: position, date, fishes type, additional and environmental parameters Added values: fishing ● ● environmental data from position (at lower frequency) timeseries from date (e.g. seasurface temperature) ● ● First approach: imputation - construct missed environmental parameters from lower frequency data Second approach: replace missed environmental parameters by lower frequency data ● ●

Data re-constructioin replacemet vs data

Timeseries vs single point data

Predicting model build on new data

Automatic search for best model regression vs decision tree

Summary All cases model: True positive rate (sensitivity) 0.58, ● All cases model: True positive rate (sensitivity) 0.58, ● Accuracy 0.65 ● Increased data volume (25% more data): ● TPR 0.67, Accuracy 0.7 Added data from different sources: ● TRP 0.67, Accuracy 0.75 ● Reconstructed missed values: ● ● TRP 0.72, Accuracy 0.74 ● Re-placed data: ● TRP 0.72, Accuracy 0.82 ● Timeseres used: ● TRP 0.72, Accuracy 0.82

Results: probability vs location