Detecting Conversing Groups of Chatters: A Model, Algorithm and Tests S. A. Çamtepe, M. Goldberg, M. Magdon-Ismail, M. S. Krishnamoorthy {camtes, goldberg,

Slides:



Advertisements
Similar presentations
Correctness of Gossip-Based Membership under Message Loss Maxim GurevichIdit Keidar Technion.
Advertisements

Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Scalable and Dynamic Quorum Systems Moni Naor & Udi Wieder The Weizmann Institute of Science.
Register Laulima Workshop for Instructors Solutions to help you engage your students through Laulima.
A Hierarchical Multiple Target Tracking Algorithm for Sensor Networks Songhwai Oh and Shankar Sastry EECS, Berkeley Nest Retreat, Jan
Secure Multiparty Computations on Bitcoin
CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Evaluation of a Scalable P2P Lookup Protocol for Internet Applications
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 6 Instructor: Haifeng YU.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MASK. Agenda Introduction –IRC prelude –What is IRC? –How does IRC work? Architecture –Client/Server –IRC commands –3 major types of communication on.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
1 School of Computing Science Simon Fraser University, Canada PCP: A Probabilistic Coverage Protocol for Wireless Sensor Networks Mohamed Hefeeda and Hossein.
Monday, June 01, 2015 ARRIVE: Algorithm for Robust Routing in Volatile Environments 1 NEST Retreat, Lake Tahoe, June
BotMiner Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee College of Computing, Georgia Institute of Technology.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
Dept. of Computer Science & Engineering, CUHK1 Trust- and Clustering-Based Authentication Services in Mobile Ad Hoc Networks Edith Ngai and Michael R.
Scalable Application Layer Multicast Suman Banerjee Bobby Bhattacharjee Christopher Kommareddy ACM SIGCOMM Computer Communication Review, Proceedings of.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
Efficient and Robust Computation of Resource Clusters in the Internet Efficient and Robust Computation of Resource Clusters in the Internet Chuang Liu,
A Trust Based Assess Control Framework for P2P File-Sharing System Speaker : Jia-Hui Huang Adviser : Kai-Wei Ke Date : 2004 / 3 / 15.
Jessica Lin, Eamonn Keogh, Stefano Loardi
CS 376b Introduction to Computer Vision 04 / 04 / 2008 Instructor: Michael Eckmann.
Efficient Monitoring of QoS Parameters (EMQP) Authors: Vadim Drabkin Arie Orlovsky Constantine Elster Instructors: Dr. Danny Raz Mr. Ran Wolff.
An Authentication Service Against Dishonest Users in Mobile Ad Hoc Networks Edith Ngai, Michael R. Lyu, and Roland T. Chin IEEE Aerospace Conference, Big.
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
Applying Edge Partitioning to SPFD's 1 Applying Edge Partitioning to SPFD’s 219B Project Presentation Trevor Meyerowitz Mentor: Subarna Sinha Professor:
Correctness of Gossip-Based Membership under Message Loss Maxim Gurevich, Idit Keidar Technion.
Register Laulima Workshop for Instructors Solutions to help you engage your students through Laulima.
2009/9/151 Rishi : Identify Bot Contaminated Hosts By IRC Nickname Evaluation Reporter : Fong-Ruei, Li Machine Learning and Bioinformatics Lab In Proceedings.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Parallel and Distributed Simulation FDK Software.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Fixed Parameter Complexity Algorithms and Networks.
Efficient Gathering of Correlated Data in Sensor Networks
Designing a Discrete Event Simulation Tool Peter L. Jackson School of Operations Research and Industrial Engineering March 15, 2003 Cornell University.
Network Aware Resource Allocation in Distributed Clouds.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
1 Measurement and Classification of Humans and Bots in Internet Chat By Steven Gianvecchio, Mengjun Xie, Zhenyu Wu, and Haining Wang College of William.
Welcome to the Second Tutorial Welcome to the second part of this information system website tutorial! This tutorial is for church planters. If you’d like.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
NETE4631:Capacity Planning (2)- Lecture 10 Suronapee Phoomvuthisarn, Ph.D. /
Laulima Workshop for Instructors Solutions to help you engage your students through Laulima.
Attributed Visualization of Collaborative Workspaces Mao Lin Huang, Quang Vinh Nguyen and Tom Hintz Faculty of Information Technology University of Technology,
Chapter 6 Distributed File Systems Summary Bernard Chen 2007 CSc 8230.
Laulima Workshop for Instructors Solutions to help you engage your students through Laulima.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Identifying Multi-ID Users in Open Forums Hung-Ching Chen Mark Goldberg Malik Magdon-Ismail.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 14 How Internet Chat and IM Work.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Presentation for CDA6938 Network Security, Spring 2006 Timing Analysis of Keystrokes and Timing Attacks on SSH Authors: Dawn Xiaodong Song, David Wagner,
An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
Netprog: Chat1 Chat Issues and Ideas for Service Design Refs: RFC 1459 (IRC)
Fault Tolerance (2). Topics r Reliable Group Communication.
On Mobile Sink Node for Target Tracking in Wireless Sensor Networks Thanh Hai Trinh and Hee Yong Youn Pervasive Computing and Communications Workshops(PerComW'07)
NN k Networks for browsing and clustering image collections Daniel Heesch Communications and Signal Processing Group Electrical and Electronic Engineering.
Hierarchical Topic Detection UMass - TDT 2004 Ao Feng James Allan Center for Intelligent Information Retrieval University of Massachusetts Amherst.
GroupRocket.net. Years back checking s in the morning was the first ever thing most of the professionals would start their day with. And with the.
Discrete ABC Based on Similarity for GCP
Dynamic Graph Partitioning Algorithm
Peer-to-peer networking
Lesson Objectives Aims Understand the following “standard algorithms”:
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Predictive Performance
Pong: Diagnosing Spatio-Temporal Internet Congestion Properties
by Xiang Mao and Qin Chen
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Detecting Conversing Groups of Chatters: A Model, Algorithm and Tests S. A. Çamtepe, M. Goldberg, M. Magdon-Ismail, M. S. Krishnamoorthy {camtes, goldberg, magdon,

Motivation and Problem Internet chatrooms are open for exploitation by malicious users Chatrooms are open forums which offer anonymity. The real identity of participants are decoupled from their chatroom nicknames. Multiple threads of communication can co-exist concurrently. Our goal is to provide automated tools to study chatrooms to discover who is chatting with whom? Human monitoring is possible but not scalable.

Motivation and Problem (cont.) Not a trivial task even for a well trained human eye [20:19:40] what u been up to or down to? [20:19:42] Hi!! Anyone from Boston around? [20:20:00] amenazas d eataque [20:20:04] sorry [20:20:29] laying around sick all weekend... [20:20:30] can anyone in here speak english?? [20:20:37] what about you ? [20:20:40] whats wrong? [20:20:46] not me [20:20:48] sinus infection [20:20:57] i had that last week [20:20:58] Hmm, those seem rather infectious down there [20:21:07] its this stupid weather [20:21:32] yeah...darn weather [20:21:43] i wont get good and over them until we get a good rain or a really hard freeze

Outline Related work Contributions The Model Algorithms Cluster Connect Color & Merge Results Conclusion

IRC - Internet Relay Chat RFC 2810 … 2813 Interactive and public forum of communication for participants with diverse objectives. IRC is a multi-user, multi-channel and multi-server chat system which runs on a Network. It is a protocol for text based conferencing: Provides people all over the world to talk to one another in real time. Conversation or chat takes place either in private or on a public channel.

Related Work Multi-users in open forums H.-C. Chen, M. Goldberg, M. Magdon-Ismail, “Identifying multi- users in open forums.” ISI 04. An automated surveillance system S. A.Camtepe, M. S. Krishnamorthy, B. Yener, “A tool for Internet chatroom surveillance”, ISI 04. PieSpy P. Mutton and J. Golbeck, “Visualization of Semantic Metadata and Ontologies”, IV03. P. Mutton, “PieSpy Social Network Bot”, Chat Circle F. B. Viegas and J. S. Donath, "Chat Circles“, CHI Social Network Analysis (SNA) V. Krebs, “An Introduction to Social Network Analysis”,

Contributions A model which does not use semantic information, chatters are nodes in a graph, collection of chatters is a hyperedge, Two efficient algorithms Uses statistical information on the posts to create candidate hyperedges, “Cleans” the hyperedges using: Transitivity, Graph coloring, Algorithms are rigorously tested using simulation on the model.

The Model - Assumptions We model a single chatroom which corresponds to a topic. Members form small groups and talk on one or more subtopics: Subtopics are created at the beginning and never halts. A user participates in one subtopic only. A user: arrives, selects a subtopic to talk on, stays in the same subtopic during his/her lifetime. At any time, only one user is selected to post in a subtopic Message interarrival times are random according to a given distribution.

The Model – Assumptions (cont.) User’s arrival and departure times are selected uniformly at random. To make a user to post enough messages for analysis: Simulation time is divided into “n” regions Arrival times are selected uniformly at random from the first region, Departure times are selected uniformly at random from the last region. At any time, messages coming from all subtopics are uniformly at random shuffled and output.

The Model - Parameters Simulation time and number of regions Number of users Number of subtopics Probability distribution and parameters (mean, variance, …) for: User to subtopic assignment Message interarrival time Step size K ( will be defined in the next slide )

The Model - Algorithm Single event queue Message post events (post, user, subtopic, time) User join events (join, user, subtopic, time) User leave events (leave, user, subtopic, time) K-step posting probability for each subtopic A list of size K named as “Probability History List” A user who post recently is pushed to front A user at the front has smallest probability of post next A user at the end and users not in the list have the highest probability of post next

The Model - Algorithm (cont.) Load parameters For each user select an arrival time, generate an arrival event for the user select a departure time, generate a departure event for the user select a subtopic, generate join event For each timestep For each events of current time If post event  insert the message to message buffer  create new post event  select next user to send according to K-step probability  select time for next post (message interarrival time)  update K-step probability (probability history list) If join event  add user to subtopic  If first user in the subtopic,  create a post event  update K-step probability (probability history list) If leave event  remove user from subtopic Shuffle the message buffer Log the messages to a file

The Model - Output Sample chat log TIME 6 USER 20 TIME 7 USER 15 TIME 9 USER 61 TIME 12 USER 41 TIME 12 USER 24 …… User to subtopic assignments SubtopicMembers ,

Algorithms Initial processing of message logs Consider every consecutive messages Generate list of node pairs and the corresponding interarrival times users (20,15) int. time 1 users (15,61) int. time 2 users (61,41) int. time 3 users (41,24) int. time 0 TIME 6 USER 20 TIME 7 USER 15 TIME 9 USER 61 TIME 12 USER 41 TIME 12 USER 24 Sample Log Node-pair, Interarrival list

Algorithms – Kmeans Simple Clustering (K-Means) K-means clustering algorithm is applied on Interarrival list Generates two clusters Red: pairs which has small interarrival times are put into this cluster Blue: pairs which has large interarrival times are put into this cluster

Algorithms – Kmeans (cont.) Simple Clustering (K-Means) K-means clustering on Interarrival list Generates two clusters: Red: pairs which has small interarrival times Blue: pairs which has large interarrival times Declares: Red pairs as not engaged in conversation Blue pairs as engaged in conversation Idea: interarrival time between messages of two users, who exchanges messages over a subtopic, can not be smaller then a threshold:  It takes time for user to read, interpret, prepare answer  Network and servers introduce additional latency

Algorithms – Kmeans (cont.) Issues: Incomplete, it does not identify members of sub topics (conversing groups) May include contradictory information  For group of three users a,b,c  (a,b) and (a,c) are blue, (b,c) is red  Are (a,b,c) in the same subgroup??? Algorithms Connect and Color_and_Merge Reconcile possible contradictions Produce complete output

Algorithms – Connect Takes blue and red clusters Trusts blue cluster Considers blue cluster as the edge set of a graph B Finds connected components in B breath-first search on B Consider previous example For group of three users a,b,c (a,b) and (a,c) are blue, (b,c) is red Connect concludes that (a,b,c) are in the same subgroup.

Algorithms – Color Takes blue and red clusters Trusts red cluster more than blue cluster Considers red cluster as the edge set of a graph R Applies vertex coloring Uses heuristic Greedy to find an approximate solution Generates color classes

Algorithms – Merge Takes color classes generated by color For each pair of color classes C 1 and C 2 e b = number of user pairs (x, y) where (x, y) in blue cluster (x in C 1 and y in C 2 ) or (y in C 1 and x in C 2 ) If (e b /|C 1 |.|C 2 | ≥ threshold) merge C 1 and C 2 For our model, we found that threshold 0.7 gives good results Announce final color classes as subtopics.

Tests Parameters of the model are tuned according to observations and statistical analysis over real chatroom logs. A user pair which is announced correctly as being in the same subtopic is accepted as a correct result Success rate = # correct results / all Following slide lists results for: 5 topics, 50 users 5 topics, 75 users 10 topics, 50 users 10 topics, 75 users

Results For sufficiently long log size, all algorithms converges to 100% success Critical factor is number messages per user. As the number of users increases, larger logs are required Color_and_Merge algorithm provides the best result. Converges to 100% success very quickly Connect is the most sensitive algorithm to log size As the log size decrease, connect fails faster Why? A single false edge may connect two components yielding too much false results

Conclusion We presented a model for which we showed that it is possible to accurately determine the conversation Ideas can be generalized to more elaborate models Future work: Enhance the model Users may belong to multiple conversations Users may switch between conversations Apply algorithms to real chatroom logs