NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions An Algorithm for Temporal Analysis of Social Positions Scott Christley, Greg Madey Dept.

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Probabilistic Clustering-Projection Model for Discrete Data
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HICSS 2007Analysis of Activity Analysis of Activity in the Open Source Software Development Community Scott Christley and Greg Madey Dept. of Computer.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Lecture 2: Basic steps in SPSS and some tests of statistical inference
Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using.
Ch 15 - Chi-square Nonparametric Methods: Chi-Square Applications
Radial Basis Function Networks
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Evaluating Performance for Data Mining Techniques
AM Recitation 2/10/11.
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
Copyright, Gerry Quinn & Mick Keough, 1998 Please do not copy or distribute this file without the authors’ permission Experimental design and analysis.
Chapter 26: Comparing Counts AP Statistics. Comparing Counts In this chapter, we will be performing hypothesis tests on categorical data In previous chapters,
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Comparing Two Proportions
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Experimental Evaluation of Learning Algorithms Part 1.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Chi-Square Goodness-of-Fit Test PowerPoint Prepared.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Experimental Design and Analysis Instructor: Mark Hancock February 8, 2008Slides by Mark Hancock.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
Agent 2004Scott Christley, Public Goods Theory of Open Source Community Public Goods Theory of the Open Source Development Community using Agent-based.
Nonparametric Statistics
Chapter 22: Comparing Two Proportions. Yet Another Standard Deviation (YASD) Standard deviation of the sampling distribution The variance of the sum or.
Data Analysis for Two-Way Tables. The Basics Two-way table of counts Organizes data about 2 categorical variables Row variables run across the table Column.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Chapter 15 – Analysis of Variance Math 22 Introductory Statistics.
CHI SQUARE TESTS.
Copyright © 2010 Pearson Education, Inc. Slide
Comparing Counts.  A test of whether the distribution of counts in one categorical variable matches the distribution predicted by a model is called a.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Analyzing Expression Data: Clustering and Stats Chapter 16.
Inference ConceptsSlide #1 1-sample Z-test H o :  =  o (where  o = specific value) Statistic: Test Statistic: Assume: –  is known – n is “large” (so.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Confidence Intervals INTRO. Confidence Intervals Brief review of sampling. Brief review of the Central Limit Theorem. How do CIs work? Why do we use CIs?
+ Section 11.1 Chi-Square Goodness-of-Fit Tests. + Introduction In the previous chapter, we discussed inference procedures for comparing the proportion.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Lake Arrowhead 2005Scott Christley, Understanding Open Source Understanding the Open Source Software Community Presented by Scott Christley Dept. of Computer.
Lesson Runs Test for Randomness. Objectives Perform a runs test for randomness Runs tests are used to test whether it is reasonable to conclude.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Chapter 22 Inferential Data Analysis: Part 2 PowerPoint presentation developed by: Jennifer L. Bellamy & Sarah E. Bledsoe.
Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted by a model is called a goodness-of-fit.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Simulation-Based Approach for Comparing Two Means
Chapter 11: Inference for Distributions of Categorical Data
Inferences Between Two Variables
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions An Algorithm for Temporal Analysis of Social Positions Scott Christley, Greg Madey Dept. of Computer Science and Engineering University of Notre Dame Supported in part by National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Motivation Successful software development requires various positions to be filled; developers, testers, administrators, management, end-users, etc. People in Open Source Software community self- select into a social position on a software project. We don’t know what these positions are; emerged from the self-organization of the community. Do people stay in same social position, or does there position change over time? Positional analysis seeks to group actors into disjoint subsets according to their social position in the network.

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Structural Equivalence Actors who are similarly embedded occupy similar social position. C ~ D have same relationships with same other actors. Exact equivalence is too strict so use an approximate measure, like Euclidean distance. Weighted relationships C D E AB

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Clustering Standard data mining algorithms –K-means, Expectation-Minimization (EM) What’s wrong with Euclidean distance? –Data mapped to points in an N-dimensional space. –Points “close” in space are in same cluster. –Normalization techniques very important. –Not comparing the underlying distributions. Assume Gaussian (normal) distribution What can we use instead of a distance metric? –Statistical test

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Clustering with a Statistical Test Fisher’s contingency-table test (non-parametric) –Chi-square family of goodness-of-fit tests Given two independent samples –First sample, S 1, with n 1 random variables –Second sample, S 2, with n 2 random variables –Where n 1 not necessarily equal to n 2, each r.v. in each samples placed in one of C categories. H 0 : The distributions of S 1 and S 2 do not differ. H A : The distributions S 1 and S 2 differ. Structural In-equivalence

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Algorithm (Intersection) While (still unclustered samples) Put all unclustered samples into one cluster. While (some samples not yet pairwise compared) A = Pick sample from cluster For each other sample, B, in cluster Run statistical test on A and B. If significant result Remove B from cluster. Rejection of null hypothesis means A and B must be in different clusters. Confidence level tightens/broadens cluster inclusion. Any statistical test for a two-sided test problem.

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions OSS Activity User performs an activity for a project. 21 activities; submit bug, submit feature request, assign bug, post forum message, create file release, create project task, etc. Multi-relational, weighted, bipartite network. –Activity = relation, weight = activity count Activity distribution for user/project pair defines a sample for our statistical test. That is, the activity a user performs on a project defines their social position for that project.

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Social Positions of OSS Social PositionSize# of clusters Brief Flame Message Posting Task Management27625 Release Management65095 Documentation12664 Job Posting8992 Artifact Management16746 Administrators Not Categorized Total User/Project Pairs209994

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Temporal Analysis Previous analysis, activity over 10 years, lose knowledge of evolution of positions. How to deal with time (data)? –Global time; snapshot of the whole network at points in time: node/edge add/remove, attribute change, tends to get aggregate measures. –Local time; user/project’s first activity is time 0, aligns actors in a time-relative way to the network, egocentric viewpoint. Chunk data into monthly activity, run clustering algorithm for data for each time period.

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Temporal Social Positions of OSS Social PositionPeriod 1Period 2Period 3Period 4 Brief Flame Message Posting Administrators Release Management Task Management Artifact Management Documentation Job Posting Not Categorized Handyperson Total User/Project Pairs Total Clusters

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Summary Clustering algorithm using a statistical test. –Don’t have to specify # of clusters a priori. –No assumption of underlying distribution. –Must be appropriate statistical test. Temporal Analysis –How you organize/view your data is important. –Global metrics --> global time –Egocentric measures --> local time

NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions Iterative Classification Order of comparison matters. Clustering is NP-complete so intractable to check all combinations to find the optimal. Iterative approach –Perform initial clustering –Calculate cluster center