Skip to content

Latest commit

 

History

History
113 lines (92 loc) · 4.58 KB

gcloud.md

File metadata and controls

113 lines (92 loc) · 4.58 KB
layout title tagline
page
GHTorrent on the Google cloud

GHTorrent can be accessed over Google Cloud services. To access the data requires you to have a Google Cloud account. Reasonable use is free of charge and, in the case of BigQuery, it should no longer require a credit card. (Pub/Sub still requires a credit card). You can check what Google considers reasonable at any given moment here.

Both services can be accessed through the Web, the command line (after installing the Google Cloud command line utils) or though various programming languages.

BigQuery

With BigQuery, you can query GHTorrent's MySQL dataset using an SQL-like language (lately, BigQuery also supports vanilla SQL); more importantly, you can join the dataset with other open datasets (e.g. GitHub's own project data, Reddit, TravisTorrent etc) hosted on BigQuery.

To get the most popular programming languages by number of bytes written, run the following:

{% highlight sql %} select pl3.lang, sum(pl3.size) as total_bytes from ( select pl2.bytes as size, pl2.language as lang from ( select pl.language as lang, max(pl.created_at) as latest, pl.project_id as project_id from [ght.project_languages] pl join [ght.projects] p on p.id = pl.project_id where p.deleted is false and p.forked_from is null group by lang, project_id ) pl1 join [ght.project_languages] pl2 on pl1.project_id = pl2.project_id and pl1.latest = pl2.created_at and pl1.lang = pl2.language ) pl3 group by pl3.lang order by total_bytes desc {% endhighlight %}

To get the user with the most Java commits in the Netherlands in June 2016, do the following:

{% highlight sql %} select u.login as login, u.location as location, count(c.id) as num_commits from [ghtorrent-bq.ght.project_commits] pc join (SELECT id, author_id FROM [ghtorrent-bq.ght.commits] WHERE date(created_at) between date('2016-06-01') and date('2016-07-01') )c on pc.commit_id = c.id join (SELECT id FROM [ghtorrent-bq.ght.projects] WHERE language = 'Java') p on p.id = pc.project_id join (SELECT login, location, id FROM [ghtorrent-bq.ght.users] WHERE country_code = 'nl') u on c.author_id = u.id, group by login, location order by num_commits desc; {% endhighlight %}

See also some queries by Felipe Hoffa.

Pub/Sub

Pub/Sub allows subscribers to get events of what is happening on GitHub (or at least GHTorrent's interpretation of what is happening on GitHub) in almost real time. To do so, one needs to subscribe to one of the available topics with a client in order to start receiving events.

The service is complimentary, even though less fine-grained, to GHTorrent's own streaming interface. As is also the case with GHTorrent streaming, the contents of the streams are generated by following the live MongoDB server replication stream. See the code here.

To subscribe to a topic, e.g. commits, run the following:

gcloud beta pubsub subscriptions create my_commits_subscription --topic projects/ghtorrent-bq/topics/commits

To start receiving events, you can try the command line

gcloud beta pubsub subscriptions pull --auto-ack --max-messages 5 -- my_commits_subscription

The available topics are the following:

{% highlight txt%} projects/ghtorrent-bq/topics/commits projects/ghtorrent-bq/topics/events projects/ghtorrent-bq/topics/followers projects/ghtorrent-bq/topics/forks projects/ghtorrent-bq/topics/issue_comments projects/ghtorrent-bq/topics/issue_events projects/ghtorrent-bq/topics/issues projects/ghtorrent-bq/topics/org_members projects/ghtorrent-bq/topics/pull_request_comments projects/ghtorrent-bq/topics/pull_requests projects/ghtorrent-bq/topics/repo_collaborators projects/ghtorrent-bq/topics/repo_labels projects/ghtorrent-bq/topics/repos projects/ghtorrent-bq/topics/users projects/ghtorrent-bq/topics/watchers {% endhighlight %}