Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Slides:

Advertisements

Similar presentations

Test Development.

Advertisements

Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Compass Information System – Information for LSASPA Members

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Face detection Many slides adapted from P. Viola.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.

Recommender systems Ram Akella November 26 th 2008.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Decision Tree Models in Data Mining

PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

Evaluating Classifiers

Anomaly detection Problem motivation Machine Learning.

Friends and Locations Recommendation with the use of LBSN

Fall 2006 Davison/LinCSE 197/BIS 197: Search Engine Strategies 6-1 Module II Overview PLANNING: Things to Know BEFORE You Start… Why SEM? Goal Analysis.

Copyright © 2006, SAS Institute Inc. All rights reserved. Predictive Modeling Concepts and Algorithms Russ Albright and David Duling SAS Institute.

Copyright © 2003, SAS Institute Inc. All rights reserved. Cost-Sensitive Classifier Selection Ross Bettinger Analytical Consultant SAS Services.

+ Recommending Branded Products from Social Media Jessica CHOW Yuet Tsz Yongzheng Zhang, Marco Pennacchiotti eBay Inc. eBay Inc.

Classical Music for Rock Fans?: Novel Recommendations for Expanding User Interests Makoto Nakatsuji, Yasuhiro Fujiwara, Akimichi Tanaka, Toshio Uchiyama,

Predicting Consumer Choice Using Supermarket Scanner Data: Combining Parametric and Non-parametric Methods Elena Eneva April CALD.

Report #1 By Team: Green Ensemble AusDM 2009 ENSEMBLE Analytical Challenge: Rules, Objectives, and Our Approach.

7-1 Management Information Systems for the Information Age Copyright 2004 The McGraw-Hill Companies, Inc. All rights reserved Chapter 7 IT Infrastructures.

Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA

Sigur Ecommerce Pvt. Ltd.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Pattern Discovery of Fuzzy Time Series for Financial Prediction -IEEE Transaction of Knowledge and Data Engineering Presented by Hong Yancheng For COMP630P,

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.

CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Nonparametric Method for Early Detection of Trending Topics Zhang Advisor: Prof. Aravind Srinivasan.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

The Wisdom of the Few Xavier Amatrian, Neal Lathis, Josep M. Pujol SIGIR’09 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Title of presentation Name(s) of author / presenter / co-authors

Data Mining: Concepts and Techniques

Cost-Sensitive Learning

Flavio Toffalini, Ivan Homoliak, Athul Harilal,

Mobile Sensor-Based Biometrics Using Common Daily Activities

Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.

Cost-Sensitive Learning

Q4 : How does Netflix recommend movies?

Performance Management in Public Education

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Model Evaluation and Selection

Tuning CNN: Tips & Tricks

Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng

Data Warehousing Data Mining Privacy

ADVANCED ANOMALY DETECTION IN CANARY TESTING

Shortlisting Applications

Shortlisting Applications

McGraw-Hill Technology Education

Jia-Bin Huang Virginia Tech

ROC Curves and Operating Points

Information Organization: Evaluation of Classification Performance

Presentation transcript:

Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu, Rui Kimura, Nobuyuki Maita, Yujin Tang, Takafumi Watanabe, Akihiro Kobayashi, Kazunori Matsumoto,and Keiichi Kuroyanagi

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved The Final Results Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

“KDDILABS&Keiku” Members Account Name Full Name Affiliations t.MF Akiyama, Takuya KDDI R&D Laboratories, Inc. Aakansh Gupta, Aakansh Uhuru Corporation NoahZh Zhang, Nuo kyone Yonekawa, Kei mz-matsumoto Matsumoto, Kazunori mura Muramatsu, Shigeki ruik Kimura, Rui no6est Maita, Nobuyuki Yujin Tang, Yujin Keiku Kuroyanagi, Keiichi Financial Engineering Group, Inc. TakWat Watanabe, Takafumi apf-koba Kobayashi, Akihiro Working at KDDILABS office Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved What is “KDDI”? Knowledge Discovery and Data mining Institute …? NO! A Japanese telecommunication company “KDDI” is an acronym standing for Japanese words No relation between KDD2015 and KDDI We did NOT Cheat Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved System Overview XGBoost† Regularized Greedy Forest Original Data Blend Submit Data Feature ×2000 Deep Neural Network Bagging of 200 models Our special twist is “strategic” feature engineering † http://dmlc.github.io/ Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

In late Jun, we were merged as a team At First... Each member started KDD Cup separately. Each member created features separately. In late Jun, we were merged as a team The number of features : 1500 over “Basic Features” Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Examples of Basic 1500 Features 4m 30s Target prediction interval The number of logs of the eID 30s 1m 15s Counting up 2h 30m 5s 50s The number of lags ？ Time 3m ：access log Time to Target Prediction Interval All features has variation with respect to labels like: Time window, category, event, source, or their combination Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

ROC Curve & Predicted Value Distribution Prediction of training data by XGBoost at 10-fold cross validation. Dropout eID True Positive Rate Non Dropout eID Density False Positive Rate Predicted Value Why can’t we predict “Lower right” eID accurately? “Lower right” eID do not have enough number of logs, in some cases there are only 1 log, but they did not drop out the course. Because there are less number of logs, it is hard to predict their dropout probability by basic features. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Our Strategy & Features Creating features which do NOT depend on the number of logs We created the features by 3 kinds of methods Aggregating “Cross-Course” logs Using Idea of Recommendation System Using Time-Series Prediction Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

①Aggregating “Cross-Course” logs Idea: About 1/3 users attend multiple courses. All users : 112448 Users attending multiple courses : 38939 active course count 1 2 3 4 5 ... 28 29 30 31 39 the number of user 73509 20251 8237 4118 2277 It is effective to create features by logs of not only the object course, but also other active courses. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

①Aggregating “Cross-Course” logs How to create features: A Maybe, some user have enrolled multiple courses at once, and attended courses one by one. Target prediction interval of Course A Course A ？ There is a high probability of attending Course A in this period. User a Course B Course C time Only 1 log at Course A Although there are many logs at Couse B & Course C There are little logs in the period Counting up the number of logs, unique days, or unique courses in which logs exist by moving time window (window size : 5days) Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

①Aggregating “Cross-Course” logs How to create features: B If there is some relationship between a target course and an other course, logs of the target course may exist near logs of the other course. Target prediction interval of Course A Relational Courses There are a high prob ability of Existing Logs of Course A nearby Logs of Course B Course A ？ User a Course B time Two Steps: Making a matrix of interrelationship of all courses which is transition probability from one course log to other course log. Calculating a sum of products of logs and interrelationship of each courses to the target course at the prediction interval. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

②Using Idea of Recommendation System We want to create features by NOT using logs. →other users who enroll similar course pattern to the user is useful. How to create features: Creating features by Collaborative Filtering which is often used as recommendation system in e-commerce sites or search engines Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

②Using Idea of Recommendation System Collaborative Filtering: Calculate similarities between the user and other users by comparing active course patterns of each users Calculate reasonable value which is calculated by a weighted average of other users value whose similarities are higher than threshold 　Course User　　. 1 2 3 4 5 Similarity to User A A ○ × - B 0.7 C 0.2 D 0.8 Course User . 1 2 3 4 5 A × 20 B 130 50 C 30 D 40 70 (130×0.7+50×0.8)/ (0.7+0.8)=87 ↓ The user may continue to attend this course. ○ means “enrolled” × means “not enrolled” Feature value (in this case, the number of logs) Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

③Using Time-Series Prediction Idea: Is there consistent trend at the numbers of unique users who attend the courses in specific days? If we know the number of unique users who attend the courses in dropout judgment period and an order of users who is more likely to attend the courses in its period, we can see the boundary of dropout users and non-dropout users. How to create features: Using ARIMA which is often used in financial prediction or telecommunication traffic prediction Predicting unique users in judgment period by using a transition of unique users in each specific time window (10days) Ranking users according to most useful feature values in previous dropout predict system. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

③Using Time-Series Prediction Trainsitions of unique users (time window:10days) Actual Value Prediction Using ARIMA Predicted number of unique users who attend the course in day 31~40 User ranking according to specific feature values (for example, the number of logs) and normalization by the predicted number of unique users Username Ranking of the feature value Normalized value a 1 0.001 b 1000 c 2000 2 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved Results Predicted Value Distribution Prediction of “Lower right” eID are improved Dropout eID Non Dropout eID Dropout eID Non Dropout eID Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved Results Final AUC becomes 0.90756 Final private score is 0.90597 Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved Miscellaneous In this competition, we didn’t use some features created from truth data of training set because we were afraid of over-fitting to training set. Maybe It restricted more flexible idea and was why we got no more than 6th rank. Creating wide variety and useful features was important. However of course, the choice of three kind of models (XGBoost, Regularized Greedy Forest, and Bagging Deep Learning) was also important of, so we really appreciate the authors of used models and libraries. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Thanks for your attention. Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved