ID Identification in Online Communities Yufei Pan Rutgers University.

Slides:



Advertisements
Similar presentations
FEATURE PERFORMANCE COMPARISON FEATURE PERFORMANCE COMPARISON y SC is a training set of k-dimensional observations with labels S and C b C is a parameter.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Perceptron Learning Rule
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Text Categorization Karl Rees Ling 580 April 2, 2001.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
Model Assessment, Selection and Averaging
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
As applied to face recognition.  Detection vs. Recognition.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Ensemble Learning: An Introduction
Three kinds of learning
Distributed Representations of Sentences and Documents
Chapter 5 Data mining : A Closer Look.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Mathematics.
8/15/2015Slide 1 The only legitimate mathematical operation that we can use with a variable that we treat as categorical is to count the number of cases.
Evaluating Performance for Data Mining Techniques
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Machine Learning Queens College Lecture 1: Introduction.
Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Statistical analysis of Skype conversations: recognizing individuals by their chatting style Candidato : Cristina Segalin Relatore: Dr. Marco Cristani.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Principles of Pattern Recognition
Chapter 3 Sections 3.5 – 3.7. Vector Data Representation object-based “discrete objects”
Presented by Tienwei Tsai July, 2005
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Wei Zhang Akshat Surve Xiaoli Fern Thomas Dietterich.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Spam Detection Ethan Grefe December 13, 2013.
Optimal Bayes Classification
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007.
3D reconstruction from uncalibrated images
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Data Mining and Decision Support
Show Me the Money! Deriving the Pricing Power of Product Features by Mining Consumer Reviews Nikolay Archak, Anindya Ghose, and Panagiotis G. Ipeirotis.
1 Introduction to Machine Learning Chapter 1. cont.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
High Throughput and Programmable Online Traffic Classifier on FPGA Author: Da Tong, Lu Sun, Kiran Kumar Matam, Viktor Prasanna Publisher: FPGA 2013 Presenter:
Alex Stabile. Research Questions: Could a computer learn to distinguish between different composers? Why does music by different composers even sound.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Active Flattening of Curved Document Images via Two Structured Beams
CEE 6410 Water Resources Systems Analysis
Arabic Text Categorization Based on Arabic Wikipedia
Introduction Machine Learning 14/02/2017.
Basic Estimation Techniques
Active Learning Lecture Slides
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Self organizing networks
Machine Learning Week 1.
Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, Baobao Chang, Zhifang Sui
Face Recognition and Detection Using Eigenfaces
Deep Belief Nets and Ising Model-Based Network Construction
Automatic Detection of Causal Relations for Question Answering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Homing sequence: to identify the final state.
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

ID Identification in Online Communities Yufei Pan Rutgers University

ID Identification Online communities are a large part of lives of people Interact with each other via different IDs. Who is that ID? –Text Identification: given a piece of text, could we identify the ID for it from known IDs? –ID Identification: given an ID with all its text evidences, could we identify it with any other known ID.

Text Identification vs Text Categorization Text categorization –Well-known area –Categorize the type based on the content –Treats the text as a bag of words Text Identification –Identify the ID who produces the text –The similarity of content wouldn’t help –Find out the constant features, independent of text content

Approach for Text Identification Stylometric Features –The style features of an author with his/her known text. –Rudman(1997) Steps : –Firstly, we would extract some kind of stylometric features. –Secondly, we would choose some kind of machine learning algorithms. –Finally, we conduct experiments to get the good results

Text Identification VS ID Identification Same? –No. Depends on the consistency of the stylometric features over different IDs. – What if the entity controls the text styles for each ID intentionally ? –Or he/she unconsciously changes the text behavior to match the expected behavior of ID ?

Style Variation Pattern Observation –An entity would demonstrate a certain style variation over changed environment –The variation may contain invariant pattern for each ID of this entity It means: –Find the constant variation pattern for an entity, which is independent of the ID it uses. –Use this pattern to identify IDs.

Experiment Setup Input Data –2nd light forum ( Stylometirc Features (56) (De vel, 2001 ) Machine learning algorithm –Support Vector Machine Average sentence length(number of words) Total number of function words/W Function word frequency distribution ……………….

Experiment Result Text Identification –TrainingTesting –Correctly Classified % % –Incorrectly Classified % % –Kappa statistic –Mean absolute error –Root mean squared error –Relative absolute error % % –Root relative squared error % % –Total Number of Instances 88 53

Experiment Result(cont’d) Variation Matrixes –VM[Floridave] –VM[paddleout] –Eigenvalues: 109.2, 67.1, , i, i

The End Thank you !