Video Ad Mining for Predicting Revenue using Random Forest

Video Ad Mining for Predicting Revenue using Random Forest
Yuan-Hsin Huang Purdue University

Executive Summary Data Overview Youtube API Limitation
Variables of Predictive Model Model Summary Predictor: Customer Segmentation Predictors: Sentiment Analysis Future Work

Data Summary Leverage four quarters- total 39 videos on predicting the Coca-Cola company’s net operating revenue in north America. This model adopted two layers data framework; first layer is each video and second is all videos. Total sentiment is 34,378 records by individual user account (several accounts overlap).

Youtube API (version 3) Limitation
Daily general quota limits are 50,000 requests per project per day. Problem of execution is that data of each video can only be requested 500 times and the server will disconnect. If the data is more than 500 records, the overlapping data will repeat in the datasets. General Quota Limits

Elements of Predictive Model
This forecasting mechanism use random forest to test the predictive effect by means of 7 numerical predictors and 1 categorical predictor. Through user subscribed channels, this model applies two step clustering method to get the optimal number of viewer segmentation. Besides, I adopt textual comments in English to generate the polarity and its degree for use.

Dataset Need and solution #1 Need and solution #2 Need and solution #3

Summary of First Prototype(1/5)
Data time period: 2012Q1,2013Q1,2014Q1,2015Q1 Number of videos: 17 Predictive Method: only 20%~30% of revenue can be explained by this Random Forest model due to small number of video in pretest. pseudo R-squared =27.02%

Each tree applies a different bootstrap sample (25% of samples) for testing. 80.23% out-of-bag error implied that 20% out of sample accuracy for the training set.

The length of video is apparently no significant importance. Variables related with the traffic concept are significant for model use, including predictor Like, Dislike, viewercounts, all comments. The number of viewer groups(variable: Seg) also shows the importance for prediction that implies the segmentation of viewer will impact the video performance. Though the lexicon for automatic polarity judgment is still unsound, current pretest still shows the important predictive effect on revenue(variable: polarity).

Forest Structure:

This mechanism applied Root Mean Square Error(RMSE) to verify the correspondence of model assumption by error residual plot. The QQ plot shows the error is normal distributed without any big issues.

Customer Segmentation
The judgment of channel category is followed by the rule of Youtube (17 category) and it showed Games, Music, Entertainment are top 3 viewer preference in each video. The viewer segmentation is grouped by viewer preference and his social network influential size .

Data Processing(for one film)
1. Reserve people who subscribe channels. (2012: 71%, 2013: 53%, 2014:64%,2015:69%) 2. Adopt the category with the maximum frequency from user’s subscribed channels as his preference. 3. Use subscription category, total upload views, subscription counts to run the two step clustering. It has two steps 1) pre-cluster the cases (or records) into many small sub-clusters; 2) cluster the sub-clusters resulting from pre-cluster step into the desired number of clusters. It can also automatically select the number of clusters.

Cluster Quality Based on the Likelihood distance measure and Schwarz’s Bayesian criterion, it can test the cluster quality. All yearly quarter one have good cluster quality. 2012 Q1 2015 Q1

Sentiment Analysis In terms of sentiment analysis, there are two variables are generated to do my predictive analysis: Feature Scoring and Polarity. From current experimental result, feature scoring didn’t impact but polarity did. Based on the ROLEX-SP framework, my framework added additional logical and neutral words to the judgment system rather than merely used positive and negative lexical syntactic patterns. Example: “not” is a logical word which can reverse all the sentiment tendency. “feel like” is the neutral term which follows the polarity of its following words. In the pretest, the lexicon is coded by the words in the 2012 Q1 two films(651 records of comment) and the execution effect looks good for other video use. Brand is the most frequent feature which appeared in the all pretest videos.

LSP Generator(1/3) Goal: to extract positive, negative, logical and neutral lexical syntactic patterns. Input: Lexicon, 2 videos Lexicon: Feature Index and Tendency LSP Feature Index: Brand, Icon, Image, Music, Story_Festival, Story_Theme Tendency LSP: Positive, Negative, Logical, Neutral LSP Two videos: 2012 Q1 2 videos(total 651 comments) Output: Positive, Negative, Neutral

LSP Generator(2/3) Method: Apply the following instructions on Begin
1. Use lexicon to search words (1)This system firstly catch the words in the feature index and then search for polarity words. (2)If the system can’t catch the feature words, then only find out the polarity words. 2. Make frequency table for further judgment (1)Judge by each viewer comments and analyze each sentence in his comment. (2)The system use the frequency of LSP to set up its degree in expressing the opinion. (3)Leverage Chi-Square scores to normalize each film’s feature and polarity. Formula:

locations, etc.).

Next Add 2012-2014 other quarters film data to the model
Enlarge the lexicon to make the judgment more accurate

Thank You

Video Ad Mining for Predicting Revenue using Random Forest

Similar presentations

Presentation on theme: "Video Ad Mining for Predicting Revenue using Random Forest"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Video Ad Mining for Predicting Revenue using Random Forest

Similar presentations

Presentation on theme: "Video Ad Mining for Predicting Revenue using Random Forest"— Presentation transcript:

Similar presentations

About project

Feedback