Predicting Long-Term Impact of CQA Posts: A Comprehensive Viewpoint Yuan Yao Joint work with Hanghang Tong, Feng Xu, and Jian Lu Aug 24-27, KDD 2014 1
Background and Motivations Modeling Multi-aspect Computation Speedup Roadmap Background and Motivations Modeling Multi-aspect Computation Speedup Empirical Evaluations Conclusions 2
Background and Motivations Modeling Multi-aspect Computation Speedup Roadmap Background and Motivations Modeling Multi-aspect Computation Speedup Empirical Evaluations Conclusions 3
CQA 4
What is the Long-Term Impact of a Q/A post? Q: How many users will find it beneficial? (Survey: we conduct a survey, asking different users about which is the best metric as the long-term impact; and 79.4% users choose the voting score. Analogy: to some extent, the voting score resembles the number of the citations that a research paper receives in the scientific publication domain. It reflects the net number of users who have a positive attitude toward it.) What is the Long-Term Impact of a Q/A post?
Challenges Q: Why not off-the-shell data mining algorithms? Challenge 1: Multi-aspect C1.1. Coupling between questions and answers C1.2. Feature non-linearity C1.3. Posts dynamically arrive Challenge 2: Efficiency 6
C1.1 Coupling Fq Fa Strong positive correlation! Predicted Q Impact Yq [Yao+@ASONAM’14] Predicted Q Impact Yq Predicted A Impact Ya Fq 1 Fa 2 3 Voting Consistency Question Features Answer Features [Yao+@ASONAM’14] Y Yao, H Tong, T Xie, L Akoglu, F Xu, J Lu. Joint Voting Prediction for Questions and Answers in CQA. ASONAM 2014.
C1.1 Coupling LIP-M: [Yao+@ASONAM’14] Question prediction Answer prediction 1 2 3 Voting Consistency Regularization 8
C1.2 Non-linearity The kernel trick (e.g., SVM) Mercer kernel Kernel matrix as new feature matrix 9
+ C1.3 Dynamics Solution: recursive least squares regression (x1, y1) … (xt, yt) Current model Existing examples Modelt New model New examples + Modelt+1 (xt+1, yt+1) Solution: recursive least squares regression [Haykin2005] [Haykin2005] S Haykin. Adaptive filter theory. 2005.
This Paper Q1: how to comprehensively capture the multi-aspect in one algorithm? Coupling, non-linearity, and dynamics Q2: how to make the long-term impact prediction algorithm efficient? 11
Background and Motivations Modeling Multi-aspect Computation Speedup Roadmap Background and Motivations Modeling Multi-aspect Computation Speedup Empirical Evaluations Conclusions 12
Modeling Non-linearity Basic Idea: kernelize LIP-M Details - LIP-KM: Closed-form solution: Complexity: O(n3) Question prediction Answer prediction 1 2 3 Voting Consistency Regularization 13
Modeling Dynamics Basic idea: recursively update LIP-KM Details - LIP-KIM: Complexity: O(n3) -> O(n2) (matrix inverse lemma) 14
Background and Motivations Modeling Multi-aspect Computation Speedup Roadmap Background and Motivations Modeling Multi-aspect Computation Speedup Empirical Evaluations Conclusions 16
Approximation Method (1) Basic idea: compress the kernel matrix Details – LIP-KIMA: 1) Separate decomposition 2) Make decomposition reusable 3) Apply decomposition on LIP-KIM Complexity: O(n2) -> O(n) (Nyström approximation) (SVD on X1) n*n -> n*r (Eigen-decomposition) (Eigen-decomposition) 17
Approximation Method (2) Basic idea: filter less informative examples Details - LIP-KIMAA: Complexity: O(n) -> <O(n) Current model New examples (x1, y1) (x2, y2) … (xt, yt) Existing examples Modelt (xt+1, yt+1) Filtering New model LIP-KIMAA is built upon LIP-KIMA, + Modelt+1 ? 18
Summary LIP-KIMAA <O(n) K: Non-linearity I: Dynamics M: Coupling A: Approximation LIP-KIMA O(n) LIP-KIM O(n2) LIP-KM O(n3) LIP-KI (Recursive Kernel Ridge Regression) LIP-IM Each upward solid arrow makes the model more comprehensive Each dash arrow makes the model more scalable Shaded boxes: our proposed work in this paper White boxes: existing work LIP-K (Kernel Ridge Regression) LIP-M (CoPs) LIP-I (Recursive Ridge Regression) Coupling Non-linearity Dynamics Ridge Regression 19
Background and Motivations Modeling Multi-aspect Computation Speedup Roadmap Background and Motivations Modeling Multi-aspect Computation Speedup Empirical Evaluations Conclusions 20
Experiment Setup Datasets (http://blog.stackoverflow.com/category/cc-wiki-dump/) Stack Overflow , Mathematics Stack Exchange Features Content (bag-of-words) & contextual features Time Initial set Incremental set Training set Test set 21
Evaluation Objectives O1: Effectiveness How accurate are the proposed algorithms for long-term impact prediction? O2: Efficiency How scalable are the proposed algorithms? 22
Effectiveness Results Our methods (better) root mean squared error Comparisons with existing models. 23
Efficiency Results The speed comparisons. LIP-KIMAA (sub-linear) Ours (better) The speed comparisons. 25
Quality-Speed Balance-off Our methods (better) 26
Background and Motivations Modeling Multi-aspect Computation Speedup Roadmap Background and Motivations Modeling Multi-aspect Computation Speedup Empirical Evaluations Conclusions 27
Conclusions A family of algorithms for long-term impact prediction of CQA posts Q1: how to capture coupling, non-linearity, and dynamics? A1: voting consistency + kernel trick + recursive updating Q2: how to make the algorithms scalable? A2: approximation methods Empirical Evaluations Effectiveness: up to 35.8% improvement Efficiency: up to 390x speedup and sub-linear scalability 28
Authors: Yuan Yao, Hanghang Tong, Feng Xu, and Jian Lu Thanks! Q&A Yuan Yao, yyao@smail.nju.edu.cn Authors: Yuan Yao, Hanghang Tong, Feng Xu, and Jian Lu