Download presentation
Presentation is loading. Please wait.
1
GitHub Insights Understanding Open Source @jeffmcaffer – Microsoft
Georgios Gousios – Delft University of Technology (TU Delft) Kevin Lewis – Microsoft
2
Snapshot overview
3
Inspire confidence
4
How open is a project? % commits from project community vs core team % comments and commenters % forks contributing PR lifelines …
5
Commits (core vs community)
6
Commits (origin)
7
Comments (core vs community)
8
PR lifelines
9
Are we using git in a distributed way?
10
How may devs are there per country?
11
Insights
12
Business insights Project health Product adoption
Ours – are we building a good community? Yours – is this a good project to use/invest in? Product adoption Trends for products, APIs, technologies, … Sample/tutorial effectiveness API health and evolution How are people using our API? Are they using it “right”? Can it be improved? How many people would a change break?
13
Research insights User characterization Software engineering research
What is a Python developer? Is there such a thing? What else do they use? Software engineering research Project behavior Collaborative development approaches (ICSE) Code evolution Biases – Gender, location, … Automated code review
14
Cross-domain insights
Mix this data with Social media data StackOverflow questions Slack conversations Sentiment analysis Customer satisfaction data Get a more holistic view
15
Operational insights Repository management User management
Linting – license, readme, contributing, … Approvals – controlling public access Cataloging – giving structured access to an inherently unstructured world Reporting – compliance Security – API keys, etc. User management 2FA, de-provisioning, settings, … Org/team membership, CLAs, … Multi-org insights
16
Approach Data for the masses
17
GitHub by the numbers (Mid 2016)
14 Million users 35 Million repos (~half private) ~1 Million events per day
18
Approach @GHTorrent (http://ghtorrent.org)
Software engineering research project Open collaboration with 100s of users and researchers Archive of ALL public events Complete capture of entities related to each event Enable the analytics that people need Apply “Big Data” techniques Visualizations
19
How does it work? Events or Webhooks
For each event, walk the JSON recursively Store results in MongoDB tables by entity type Remember relationships in MySQL Periodically revisit entities to update Handle missed events Handle absent events Deal with deletes and updates (GH only emits “create” events)
20
Example event (condensed)
{ "id": " ", "type": "PushEvent", "actor": {"id": , "url": " "repo": {"id": , "url": " "payload": { "push_id": , "commits": [{ "url": " }, "org": { "id": , "url": " }}
21
Entities Commits Commit comments Events Followers Forks Issues Issue comments Issue events Orgs Org members Pull request comments Pull requests Repo collaborators Repo labels Repos Users Watchers
22
GHTorrent architecture
23
GHTorrent by the numbers
Data from Feb 2012 to present ~5B event rows in MySQL ~10TB of entity data in MongoDB Growing by 10GB per day
24
Using the data You can do it too!
25
Using the data: Hosted Online – Query live MySQL and MongoDB Convenient, nothing to get or install Great for point investigations 100 second query limit
26
Using the data: Download
Get MySQL and MongoDB dumps from ghtorrent.org Run your own database servers Full control and power to query as needed
27
Using the data: Self-service
Install and configure your own GHTorrent Seed with existing data or start fresh Seeding can take a while Use webhooks instead of events Enable tracking of your private repos Need to get API key sets to avoid throttling
28
Using the data: Azure Data Lake
Scale out in Azure Data Lake Store ghtorrent.org data pumped into Data Lake Storage Exposes a WebHDFS access layer Compute Data Lake Analytics and query using U-SQL Hadoop Spark
29
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights
@gousiosg @jeffmcaffer @kelewis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.