Download presentation
Presentation is loading. Please wait.
Published byLynette Chapman Modified over 9 years ago
1
TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab 2015.05.20 Computational Intelligence Laboratory Toyota Technological Institute
2
Outline Introduction Original dataset Session Augmentation Unique IDs Decomposing Identical-Hierarchy Context window Text to vector representation Binary weighting Bootstrapping approach 2
3
Introduction 3 Training and Test Dataset A single product viewing log is composed of four columns u10001,2014-11-14 00:02:14,2014-11-14 00:02:20,A00001/B00001/C00001/D00001/ u10001 Session ID 2014-11-14 00:02:14 session.startTime ! features 2014-11-14 00:02:20 session.endTime ! features A00001/B00001/C00001/D00001/ Unique ID fetures Training and Test Dataset 15,000 (labeled), 15,000(un-labeled)
4
Session Augmentation Process 4 Step1 Session augmentation using unique IDs decomposition Step2 Session augmentation using Identical-Hierarchy Step3 Session augmentation using generating history based on context window Session [i-2] Session [i-2] Session [i-1)] Session [i-1)] Session [i] Session [i+1] Session [i+1] Session [i+2] Session [i+2]
5
Session Augmentation: Unique IDs Decomposing 5 Recall: Training data u10001,2014-11-14 00:02:14,2014-11-14 00:02:20,A00001/B00001/C00001/D00001/ To generate text to vector representation Each Unique ID can be decomposed into features using different combinations A00001/B00001/C00001/D00001 Uni-gram, Bi-gram, Tri-gram Unique
6
Unique IDs Decomposing (cont.) 6 Text to vector representation: Uni-gram A distribution of unique product IDs in the data is decomposed into eight different features For each Unique ID A00001/B00001/C00001/D00001 A00001, B00001, C00001, D00001, A00001-label, B0000l-label, C00001- label, and D00001-label Adding more features
7
Session Augmentation: Identical-Hierarchy 7 First: Generate hierarchy A category hierarchy of A000001/B000001/C000001/D000001 A00001 B00001 B00001 C00001 C00001 D00001
8
Second: Determining the Identical- Hierarchy 8 Identical categories The product IDs which are only appears in certain category Compute the class space density in female category and Compute the class space density in male category Identical-Hierarchy Is the complete parent- and child-list of a certain identical category Identical-hierarchies are extracted from training data
9
Example Hierarchy 9 A00001 A00002 A00011... B00001 B00002 B00003... B00091 C00001 C00002 C00003 C00091 C00441... D00001 D00002 D00003 D00091 D36121... D36122 Leaf Nodes Intermediate Nodes Top Nodes Training: 22,440 hierarchies Test:: 22,304 hierarchies Training + Test: 36,731 hierarchies
10
Session Augmentation: Identical-Hierarchy 10 Motivation Augment the training and test data with more features Why??? Exchange info between training and test using identical-hierarchy How???
11
Analyze: Training Data based on hierarchy 11 A00001/B00001/C00001/D00001 A: Most General Categories A00001 – A00011 (Appear: All, Missing: 0) B: Sub-categories B00001 – B00091 (Appear: 86, Missing: 5) C: Sub-subcategories C00001 – C00441 (Appear: 383, Missing: 58 D: Individual Products D00001 – D36122 (Appear: 21880, Missing: 14242)
12
Analyze: Test Data based on hierarchy 12 A00001/B00001/C00001/D00001 A: Most General Categories A00001 – A00011 (Appear: All, Missing: 0) B: Sub-categories B00001 – B00091 (Appear: 84, Missing: 7) C: Sub-subcategories C00001 – C00441 (Appear: 392, Missing: 49) D: Individual Products D00001 – D36122 (Appear: 21739, Missing: 14383)
13
Building Combined Hierarchy: Training + Test 13 A00001/B00001/C00001/D00001 A: Most General Categories A00001 – A00011 (Appear: All, Missing: 0) B: Sub-categories B00001 – B00091 (Appear: 91, Missing: 0) C: Sub-subcategories C00001 – C00441 (Appear: 440, Missing: 1) D: Individual Products D00001 – D36122 (Appear: 36092, Missing: 30)
14
Identical-Hierarchy based on Combined Hierarchy Parent- and child-list of identical-categories letter starting with ‘B’ Parent- and child-list of identical-categories letter starting with ‘C’ A00003 B00008 C00026 C00288 C00305 B00007 C00025 D00889 D00892 D01583 D30012 D33674
15
Why??? 15 B00007 C00025 C00025 D00089 C00025 D00892 C00025 D01583 C00025 D30012 C00025 D33674 B00007 C00025 D00889 D00892 D01583 D30012 D33674 Appears in TrainingAppears in Test
16
Adding Identical Categories from ‘B’ 16 A00003/B00008/C000026/D00070 Extract parent- and child-list from hierarchy based on Identical-Hierarchy A00003 B00008 =B00008 C00026 B00008 C00288 B00008 C00305 A00003/B00008/C000026/D00070;C00288/C00305 A00003 B00008 C00026 C00288 C00305
17
Adding Identical Categories from ‘C’ 17 A00002/B00007/C000025/D00089 Extract parent- and child-list from hierarchy based on Identical-Hierarchy B00007 C00025 C00025 D00089 =C00025 D00892 C00025 D01583 C00025 D30012 C00025 D33674 A00002/B00007/C000025/D00089; D00092/D01583/D30012/D33674 B00007 C00025 D00889 D00892 D01583 D30012 D33674
18
Session augmentation: Generating History based on window size 18
19
Generating History: Set window size = 3 19 Current Session: curSession.prevSession.endTime < curSession.startTime Build History curSession.endTime < curSession.nextSession.startTime Build History
20
Session Augmentation: Pros and Cons 20 Pros: Generate text to vector for a certain session uniformly Increase feature size Increase the system performance Cons It increase the system computational time
21
Term Weighting 21 Different Weighting approaches Term frequency (TF) TF.IDF IDF Inverse Document Frequency TF.IDF.ICSdF ICSdF Inverse Category Space Density Frequency
22
Term Weighting: Applied 22 Binary Weighting Approach Normalize the session
23
Bootstrapping: The Basic Idea 23 Bootstrapping is the process of re-sampling method to estimating the precision of sample by using subsets of available data. In the re-sampling process exchanging labels on data points when performing significant test.
24
Bootstrapping process 24 Perform 4-iteration for re-sampling the data If first_iteration Input: Training data (15000) 10-fold cross validation 9-fold for training data 1-fold for development data Build Training model Provide Test data (15000) Predict labels
25
Bootstrapping process (cont.) 25 If !first_iteration Input: Training + Test (30000) Assign labels Training: Gold labels Test: Predicted labels 10-fold cross validation 9-fold for training data 1-fold for development data Build Training model Provide Test data (15000) New predicted labels
26
Classification: LIBLINEAR 26 LIBLINEAR is a simple package for solving large-scale regularized linear classification Option parameters: -s 1 L2-regularized L2 loss support vector classification -c 1 -B 1 -wi weight: set the parameter C of class i to weight*C nfemale/nmale
27
Results: Bootstrapping Approach with LIBLINEAR 27 Iteration 0 Mean Accuracy: 0.960156 Accuracy for (female, male) = 0.966761, 0.953551 Iteration1 Mean Accuracy: 0.966785 Accuracy for (female, male) = 0.967530, 0.966040 Iteration 2 Mean Accuracy: 0.966834 Accuracy for (female, male) = 0.967188, 0.966480 Iteration 3 Mean Accuracy: 0.967122 Accuracy for (female, male) = 0.967444, 0.966800 Iteration 4 Mean Accuracy: 0.967122 Accuracy for (female, male) = 0.967444, 0.966800 (Remain unchanged)
28
Final Results: Bootstrapping Approach with LIBLINEAR 28 Predicted Labels using Bootstrapping Using submission system 85.47% Final Result 85.103191%
29
Summary 29 In this work Session augmentation Identical-Hierarchy Generating conditional history using context window Term weighting Binary weighting Re-sampling process Bootstrapping Classification problem SVM classifier
30
!!! Thank you !!! 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.