Siyan Gan, Pepperdine University

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
Sparse vs. Ensemble Approaches to Supervised Learning
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Statistics: The Science of Learning from Data Data Collection Data Analysis Interpretation Prediction  Take Action W.E. Deming “The value of statistics.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Sparse vs. Ensemble Approaches to Supervised Learning
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Ensemble Methods in Machine Learning
Konstantina Christakopoulou Liang Zeng Group G21
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
A cross-cultural comparison of adult skills: The ensemble approach
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Archival research: Ungraded review questions
Data Transformation: Normalization
Data Mining Practical Machine Learning Tools and Techniques
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Eco 6380 Predictive Analytics For Economists Spring 2016
Chapter 13 – Ensembles and Uplift
Boosting and Additive Trees (2)
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Basic machine learning background with Python scikit-learn
12 Inferential Analysis.
Ensemble methods: Bagging and boosting
Ungraded quiz Unit 6.
Data Mining Practical Machine Learning Tools and Techniques
L. Isella, A. Karvounaraki (JRC) D. Karlis (AUEB)
Methods of Economic Investigation Lecture 12
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Siyan Gan, Pepperdine University
Ensembles.
Linear Model Selection and regularization
12 Inferential Analysis.
Ensemble learning.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Regression Forecasting and Model Building
Model generalization Brief summary of methods
A Data Partitioning Scheme for Spatial Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Ungraded quiz Unit 2.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

The Ensemble Method and Model Comparison for Predictive Modeling with Big Data Siyan Gan, Pepperdine University Hyun Seo Lee, Azusa Pacific University Emily Brown, Azusa Pacific University Chong Ho Yu, Ph.D., D. Phil., Azusa Pacific University Presented at Southern California Big Data Discovery Summit, November 2017, Azusa, USA

Movement toward Big Data The size of digital data will double every two years. High volume Thousands of rows or columns, can often result in problems with data storage, data management, and data analysis. High velocity Non-stop data feed that has the potential to overwhelm a conventional database server. High variety Different types of data (e.g. numbers, texts, images, audio files, video clips…etc.). Big sample size / a lot of variables Fast and large in size (youtube google,)

Types of Data Unstructured Data Structured Data Webpages and digital footprints on social media. Extracting data from Websites. Challenging to work with. Could be used by social science researchers for nationwide or cross-cultural studies. Usually survey data, stored in a conventional row X column matrix.

Data Mining Most social science researchers are trained in the traditional Fisherian statistics (hypothesis testing). Most of the time, it should not be used for big data analysis. Shows inaccurate “significant” results. Imposes strong assumptions on the data structure and the distribution. Data mining = Data-driven, not hypothesis-driven Old: easy to get p <0.05 as false alarm Assumptions - restriction: data structure must be in certain form in order to run the procedure In reality, there are messy data Lack of replicapability Only by other researchers

Ensemble Methods in Big Data Analytics/ Data Mining Big data set are separated into many subsets and multiple analyses are run. In each run the model is refined by previous "training.” Results are from replicated studies. Machine learning (based on Artificial intelligence): Learning from previous analysis The Ensemble Method: Merging multiple analyses Compares, complements, and combines multiple methods in the analysis. Better predictive outcome than using just one analysis. Machine learning: reapeatiing analysis, each sit with a subset by invreasing the accuracy Its AI Allow the machines to make deciions (space) Machines are learning new informations, finding new erros , then make new adjustmnt to the new envorment

Boosting vs. Bagging Boosting Bagging Increases the predictive accuracy. Creates a working model from the subsets of the original data set. Adjusts weak models so they are combined to be a strong model. (Bootstrap Aggregation) Repeated multisets of additional training data from the original sample Increases the size of these generated data Minimizes the variance of prediction by decreasing the influence of extreme scores Boosting: gradual process – making improvements on the errors made throughout the process, eventually the model gets strong Bagging: run subsets in parallel without informing each other, in the end the agrithym puts them together Both good, run both in our research

Bagging as voting Imagine that there are 1,000 analysts. Each one randomly draws a sub-sample from the big sample and then run an analysis. The results must be diverse. Now these 1,000 analysts meet together and vote. “How many of us found Variable A as a crucial predictor? Please raise your hand.” And then move on to B, C…etc. At the end we have a list of top 10 or Top 20.

Boosting as gradual tuning Imagine that you are a cook. You put spices into the dish and taste it. If it is not salty enough, you put more salt and pepper next time. But next time if it is too spicy, then you put less hot sauce. You make gradual improvement every time until you have the final recipe. Similarly, boosting is a gradual tuning process.

Data Visualization Presentation Unveil undetected patterns

PIAAC Study Programme for the International Assessment of Adult Competencies (PIAAC). Developed by Organization for Economic and Cooperation and Development (OECD). In 2014, PIAAC collected data from 33 participating nations (OECD, 2016). U.S. adults were behind in all three test categories: Literacy, numeracy, and problem solving in technology-rich environments. Survey items included factors related to learning: Readiness to learn, cultural engagement, political efficacy, and social trust.

Variables The scores of literacy, numeracy, and technology-based problem-solving strongly correlated. All three skills were combined into one component. Composite score of literacy, numeracy, and problem-solving was treated as the dependent variable. Correlation matrix of literacy, numeracy, and problem-solving. # of variables Independent varaibles and dependent PCA tells us that 3 Vs can be combnined into 1 Component DATA REDUCTION TECHNIQUE There are debates on these two methods. So we run both! Screen plot of PCA of literacy, numeracy, and problem solving.

Bagging vs. Boosting Bagging Boosting Sequent Two-step Sequential   Bagging Boosting Sequent Two-step Sequential Partitioning data into subsets Random Give misclassified cases a heavier weight Sampling method Sampling with replacement Sampling without replacement Relations between models Parallel ensemble: Each model is independent Previous models inform subsequent models Goal to achieve Minimize variance Minimize bias, improve predictive power Method to combine models Weighted average Majority vote Requirement of computing resources Highly computing intensive Less computing intensive Major difference between Bagging: each analysis is independent from the other Boosting:

Model Comparison Subset type Method R2 RASE AAE No subset OLS regression 0.1647 43.692 34.603 Training Boosting 0.2058 42.708 34.031 Bagging 0.4813 34.515 26.979 Validation 0.1791 43.488 34.597 0.1685 43.768 34.689 Bagging and boosting outperformed than OLS regression in variance explained and error rate. In training the bootstrap method yielded overfitted models because the R2 is unreasonably high. The boosted tree model outperformed the bagging approach (higher variance explained and lower error). Using R-square, RASE, and AAE Training: to training the agrthym to propose a model Validation: to confirm In terms of R square aND THE AAE (error rate) RASE (), boosting showed

OLS Regression Result Predictor Estimate Std. Error t Ratio p Relate new ideas into real-life 13.07 0.85 15.32 <.0001* Like learning new things 1.93 1.02 1.89 0.0595 Attribute something new 1.54 0.98 1.56 0.1180 Get to the bottom of difficult things 1.80 0.91 1.96 0.0497* Figure out how different ideas fit together  -3.46 0.96  -3.61 0.0003* Looking for additional info 0.56 0.95 0.59 0.5576 Voluntary work for non-profit organizations 4.50 7.97 No influence on the government  -3.08 0.53  -5.85 Trust only few people  -3.57 0.61  -5.84 Other people take advantage of you  -3.28 0.73  -4.50 Problem with regression EVERYTHING IS SIGNIFICANT

Final Boosted Tree Model for the USA sample PIAAC also includes many survey items that are believed to be related to those learning outcomes We choose 8 as independent variable, after Independent variables Top predictors: cultural engagement (voluntary work for non-profit organizations), social trust (other people take advantage of you), and readiness to learn (like learning new things).

Median smoothing plots Learning outcomes and social trust in the US sample. Learning outcomes and cultural engagement in the US sample. Visualization Non-linear 8 CHANGE THE LABLES LINEAR REALTIONSHIP Learning outcomes and readiness to learn in the US sample.

Discussion Method choice and model goodness should be assessed on a case-by-case basis. Run both bagging and boosting, then choose the best result according to the model comparison. Big data analytics fixes the problem of hypothesis testing by using model building and data visualization When the ensemble method, model comparison, and data visualization are employed side by side, interesting patterns and meaningful conclusions can be found from a big data set.