Presentation is loading. Please wait.

Presentation is loading. Please wait.

Missing or Inapplicable: Treatment of Incomplete Continuous- Valued Features in Supervised Learning Prakash Mandayam Comar +, Lei Liu +, Sabyasachi Saha.

Similar presentations


Presentation on theme: "Missing or Inapplicable: Treatment of Incomplete Continuous- Valued Features in Supervised Learning Prakash Mandayam Comar +, Lei Liu +, Sabyasachi Saha."— Presentation transcript:

1 Missing or Inapplicable: Treatment of Incomplete Continuous- Valued Features in Supervised Learning Prakash Mandayam Comar +, Lei Liu +, Sabyasachi Saha §, Pang-Ning Tan +, Antonio Nucci § + Department of Computer Science & Engineering, Michigan State University § Narus Incorporation

2 Data Quality ðData quality issues such as noise, outliers, incomplete data can degrade performance of supervised learning algorithms Image source: http://www.lovemytool.com/.a/6a00e008d957708834013484842699970c-pihttp://www.lovemytool.com/.a/6a00e008d957708834013484842699970c-pi Garbage In, Garbage Out

3 Incomplete Data ðAttribute values are not available when collecting the data ðEncoded as null values in database

4 Missing Values as Null ðMissing Value ØAttribute value exists but is not recorded for various reasons: ðExamples: ØNull values in user profile because users do not want to reveal their personal information (date of birth, weight, etc) ØNull values in survey data due to no response from the respondent ØNull values due to faulty sensors that fail to record a measurement

5 Inapplicable Values as Null ðInapplicable value ØAttribute is unsuited or not applicable for given data instance. ðExamples: ØNull in Last_surgery_date for patients who never had any surgeries ØNulll in NumBytes_in_Packet#5 for network traffic flows that transmit less than 5 packets ØNull in SalesCommission for non-sales executives

6 Missing vs Inapplicable ðMissing Value ØCan be replaced by a value in the domain of the attribute ØTypically, mean/median is used to impute the missing value ðInapplicable value. ØCannot take any of the domain values for the attribute ØFor categorical attributes, introduce a new categorical value `N/A’ ØUnclear how to deal with inapplicable continuous attributes

7 Contributions ðExamine the ill effects of inapplicable continuous- valued features on supervised learning algorithms ðPropose extensions to two supervised learning algorithms (tree-based and kernel-based classifiers) to deal with inapplicable features

8 Example: Social Recommendation ðRecommendation based on social networks 4 B A5 C 2 3 9 7 1 8 6 P-1 P-2 P-3 +1 +1 +1 User Product

9 Treatment of Inapplicable values ð Categorical attributes ØEasy to deal with. Just add the inapplicable value as another category ðContinuous attributes ØImpute the inapplicable values like missing values. But it has no interpretation ØImpute with token out of range value. For example, sales commission could be made -1 ØIs this correct???

10 Decision Trees ðWe consider one attribute at a time to partition the input space ðMissing values: Routed along all the branches with certain probability

11 Decision Trees ðInapplicable values: Should be routed through the best possible branch (and not all branches). X < τ or X =N/A False True Gain G R X < τ False True Gain G L Gain = max(G L, G R )

12 Kernels on Inapplicable value. ðContinuous valued attribute ØInapp. values (denoted by #) cannot be replaced by any numeric value. ØDefine domain R # = R Ṳ {#}. ØDefine mathematical operations on R #, such that p.s.d. kernels can be defined. ØIn this work, we define multiplication operation on R # and produce a linear kernel for vectors on R # d.

13 Multiplication Operation on R# For any two scalars (x1, x2) on R #, we define multiplication between scalars as follows. xyx y ##c #numeric0 #0 xy Maximum similarity score of c, if we data instances have inapplicable value for a attribute.

14 Dot Product on R# Given two vectors X = (x 1,x 2,…x d ) and Y = (y 1,y 2,.. Y d ), we define dot product using above multiplication table. xyx y ##c #numeric0 #0 xy

15 Modified Dot Product kernel ðTheorem: The proposed multiplication heuristic results in a dot product kernel that can be shown to be positive semi definite kernel. ØA useful kernel property in many ML applications. ðThe multiplication operation can be used to construct higher order polynomial kernels as follows.

16 Kernels using Decision Trees ðWe show that the decision trees built to work on inapplicable values can be used to build kernels. ØConstruct ground truth kernel from data labels. ØUsing boosting framework with decision trees as weak learners to learn the ground truth kernel.

17 Kernels using Decision Trees Y (label) 1 1 1 Ground Truth Kernel 111 111 111 111 111 111 K = yy T Given the label information (Y), we can construct the groundtruth kernel as K =yy T. Use the decision trees output labels (d) to construct base kernels (dd T ) and combine them using boosting algorithm.

18 KDD CUP Data Bi-partite links are labeled (+1/-1). No link means the item was not recommended to the user. Features: For each user compute the proportion of neighbors who has liked the product (P-i) Takes value between [0 – 1]. Inapplicable (#) if user has no neighbor who has been recommended the product. 4 B A5 C 2 3 9 7 1 8 6 P-1 P-2 P-3 +1 +1 +1 Network of UsersProducts

19 Experimental Evaluations: Data set ðKDD CUP Data 2012: ØTwitter like user network (directed graph) ØUser-Item recommendation network. [like (+1), dislike(-1) and no-recommendation (0) ]. ØFor a given item, we want to construct a binary classifier of a user liking or disliking it on recommendation. ØWe use SVM (1 vs all) as classifier for testing the kernels

20 Experimental Evaluations: Data set ðNetwork traffic data: ØThe task is to classify the network flows into malicious and non-malicious. ØFurther categorize the non-malicious flows into different types. ØFeatures are extracted from the flow/packets. Number of packets, inter arrival times, flow type etc. Size of packet k contains inapplicable values.

21 Experimental Evaluations: Metrics ðKDD Cup Data ØFor each product, we sampled 5000 data points in each train and test with 500 positive samples. ØPrecision (accepting the recommended product ) among the top 500 ranked users ðNetwork traffic data ØF measure on individual class

22 Results: Trees vs Kernels ðBinary classification of network flows into malicious and non- malicious ðTrees and modified trees are better than kernels. ClassTRAINTESTJ 48Mod J 48 # Malicious50038740.9520.9760.50.55 Legitimate10000400000.9950.9970.90.93 Tree kernel did not improve the results from decision tree

23 Results: Imputed vs Proposed Kernels ðKDD CUP Data ðImputed Kernel vs Proposed Kernel ØLinear or first order polynomial ØEven small increase in precision is useful in practice.

24 Results: Imputed vs Proposed Kernels ðKDD CUP DATA ðImputed Kernel vs Proposed Kernel ØSecond order polynomial ØConsistently better ( or as good as )across different products

25 Results: Tree Kernel ðMulticlass classification of malicious network flows. ClassTRAINTESTTree Kernel # A300 0.958 ± 0.0080.34670.8247 B91211 0.982 ± 0.0320.67090.9952 C4194 0.871 ± 0.1310.60830.9305 D56132 0.964 ± 0.0260.390.6977 E2533 0.846 ± 0.1190.85290.9688 F2531 0.818 ± 0.1030.16670.7576 G43100 0.966 ± 0.0270.67530.7857 H3001200 0.985 ± 0.0020.83740.9593 I3001070 0.994 ± 0.0010.60620.9854 J59137 0.885 ± 0.03900.9073 K3378 0.915 ± 0.0460.45030.7368 L2529 0.744 ± 0.10700.4127 M2762 0.951 ± 0.0330.11510.7154 J48 and Modified J48 gave similar results

26 Results: Summary ðThe proposed linear kernel # was always better than the mean imputed linear kernel. ðThe proposed modified trees were better than or as good as regular J48 trees. ðThe trees performed better on network traffic data where as kernels performed better on KDD CUP Data. ðAdditional work required on improving the performance.

27 Additional slides

28 P-1 P-2 P-3 +1 +1 +1 Network of UsersProducts

29 Additional Results: KDD CUP Data AUC by proposed kernel AUC by kernel on imputed data Linear Poly degree 2

30 Results: KDD CUP Data Product # (1+ ) 2 (1+ # ) 2 17.20%8.40% 4.60%7.80% 211.60%12% 9.20%10.60% 38.80%9.60% 9.20%10.60% 410.20%13.40% 11.40%11.20% 514.20%15.20% 13.80%14.20% 615.20%16.80% 12.60%12.00% 74.20%3.60% 4.80%6.10% 813.20%12.80% 14%14.80% 912.40%11.40% 11.80% 108.80%8.40% J48 and Modified J48 were very bad on this set


Download ppt "Missing or Inapplicable: Treatment of Incomplete Continuous- Valued Features in Supervised Learning Prakash Mandayam Comar +, Lei Liu +, Sabyasachi Saha."

Similar presentations


Ads by Google