ID Identification in Online Communities Yufei Pan Rutgers University
ID Identification Online communities are a large part of lives of people Interact with each other via different IDs. Who is that ID? –Text Identification: given a piece of text, could we identify the ID for it from known IDs? –ID Identification: given an ID with all its text evidences, could we identify it with any other known ID.
Text Identification vs Text Categorization Text categorization –Well-known area –Categorize the type based on the content –Treats the text as a bag of words Text Identification –Identify the ID who produces the text –The similarity of content wouldn’t help –Find out the constant features, independent of text content
Approach for Text Identification Stylometric Features –The style features of an author with his/her known text. –Rudman(1997) Steps : –Firstly, we would extract some kind of stylometric features. –Secondly, we would choose some kind of machine learning algorithms. –Finally, we conduct experiments to get the good results
Text Identification VS ID Identification Same? –No. Depends on the consistency of the stylometric features over different IDs. – What if the entity controls the text styles for each ID intentionally ? –Or he/she unconsciously changes the text behavior to match the expected behavior of ID ?
Style Variation Pattern Observation –An entity would demonstrate a certain style variation over changed environment –The variation may contain invariant pattern for each ID of this entity It means: –Find the constant variation pattern for an entity, which is independent of the ID it uses. –Use this pattern to identify IDs.
Experiment Setup Input Data –2nd light forum ( Stylometirc Features (56) (De vel, 2001 ) Machine learning algorithm –Support Vector Machine Average sentence length(number of words) Total number of function words/W Function word frequency distribution ……………….
Experiment Result Text Identification –TrainingTesting –Correctly Classified % % –Incorrectly Classified % % –Kappa statistic –Mean absolute error –Root mean squared error –Relative absolute error % % –Root relative squared error % % –Total Number of Instances 88 53
Experiment Result(cont’d) Variation Matrixes –VM[Floridave] –VM[paddleout] –Eigenvalues: 109.2, 67.1, , i, i
The End Thank you !