Presentation is loading. Please wait.

Presentation is loading. Please wait.

GitHub Insights Understanding Open – Microsoft

Similar presentations


Presentation on theme: "GitHub Insights Understanding Open – Microsoft"— Presentation transcript:

1 GitHub Insights Understanding Open Source @jeffmcaffer – Microsoft
Georgios Gousios – Delft University of Technology (TU Delft) Kevin Lewis – Microsoft

2 Snapshot overview

3 Inspire confidence

4 How open is a project? % commits from project community vs core team % comments and commenters % forks contributing PR lifelines …

5 Commits (core vs community)

6 Commits (origin)

7 Comments (core vs community)

8 PR lifelines

9 Are we using git in a distributed way?

10 How may devs are there per country?

11 Insights

12 Business insights Project health Product adoption
Ours – are we building a good community? Yours – is this a good project to use/invest in? Product adoption Trends for products, APIs, technologies, … Sample/tutorial effectiveness API health and evolution How are people using our API? Are they using it “right”? Can it be improved? How many people would a change break?

13 Research insights User characterization Software engineering research
What is a Python developer? Is there such a thing? What else do they use? Software engineering research Project behavior Collaborative development approaches (ICSE) Code evolution Biases – Gender, location, … Automated code review

14 Cross-domain insights
Mix this data with Social media data StackOverflow questions Slack conversations Sentiment analysis Customer satisfaction data Get a more holistic view

15 Operational insights Repository management User management
Linting – license, readme, contributing, … Approvals – controlling public access Cataloging – giving structured access to an inherently unstructured world Reporting – compliance Security – API keys, etc. User management 2FA, de-provisioning, settings, … Org/team membership, CLAs, … Multi-org insights

16 Approach Data for the masses

17 GitHub by the numbers (Mid 2016)
14 Million users 35 Million repos (~half private) ~1 Million events per day

18 Approach @GHTorrent (http://ghtorrent.org)
Software engineering research project Open collaboration with 100s of users and researchers Archive of ALL public events Complete capture of entities related to each event Enable the analytics that people need Apply “Big Data” techniques Visualizations

19 How does it work? Events or Webhooks
For each event, walk the JSON recursively Store results in MongoDB tables by entity type Remember relationships in MySQL Periodically revisit entities to update Handle missed events Handle absent events Deal with deletes and updates (GH only emits “create” events)

20 Example event (condensed)
{ "id": " ", "type": "PushEvent", "actor": {"id": , "url": " "repo": {"id": , "url": " "payload": { "push_id": , "commits": [{ "url": " }, "org": { "id": , "url": " }}

21 Entities Commits Commit comments Events Followers Forks Issues Issue comments Issue events Orgs Org members Pull request comments Pull requests Repo collaborators Repo labels Repos Users Watchers

22 GHTorrent architecture

23 GHTorrent by the numbers
Data from Feb 2012 to present ~5B event rows in MySQL ~10TB of entity data in MongoDB Growing by 10GB per day

24 Using the data You can do it too!

25 Using the data: Hosted Online – Query live MySQL and MongoDB Convenient, nothing to get or install Great for point investigations 100 second query limit

26 Using the data: Download
Get MySQL and MongoDB dumps from ghtorrent.org Run your own database servers Full control and power to query as needed

27 Using the data: Self-service
Install and configure your own GHTorrent Seed with existing data or start fresh Seeding can take a while Use webhooks instead of events Enable tracking of your private repos Need to get API key sets to avoid throttling

28 Using the data: Azure Data Lake
Scale out in Azure Data Lake Store ghtorrent.org data pumped into Data Lake Storage Exposes a WebHDFS access layer Compute Data Lake Analytics and query using U-SQL Hadoop Spark

29 Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights
@gousiosg @jeffmcaffer @kelewis


Download ppt "GitHub Insights Understanding Open – Microsoft"

Similar presentations


Ads by Google