list of sites we will scrap:
- Jenkinsfile
- travis yml
- gitlab yml
- circle ci yml
- go cd yml
Most of them can be done via search query into csv file. However JenkinsFile will be really hard to search for as it doesn't have a file extension to search for :(
pip install -r requirements.txt
create .env file with token inside named GITHUB_TOKEN
scraper.py to create the data
then run main.py with the .env setup to CHECK and then RENDER to create all the things
orca will be required in order to render the plotly sankey graph
npm install -g electron@1.8.4 orca1
how much things differ from standard templates for given languages? for travis and other languages
-
detect snippets as described in docs
-
before and after scripts
-
comment frequency and single line vs multi line
results need to publishable
do they use build matrices etc.?
what other enviroment variables do they use?
how do they use stages / tasks ? and how are they executed?
groups configuration lines of code distribution
distribution per language and the length of the ci script
plan:
- finish scrapping the data [ ]
- need to make this more consistent so we can get a long run for this!!! [X]
- group the data sets into one large one [X]
- process the data removing duplicates [X]
- get the graphs for language and thing type to create sepearte graphs for each thing
- get that to pick up on when there is no configuration in the % otherwise it is squed
- yaml parser project!
Bugs:
- none so far....
Fixed:
-
watchers aren't being picked up properly
-
consistent high speed data gathering is hard atm
- read time out
- 502 error
saving >>> drizzleDumper saving >>> DZNSegmentedControl saving >>> DLCImagePickerController Traceback (most recent call last): File "main.py", line 254, in <module> main() File "main.py", line 226, in main getReposStuff("raptor_webb", i, i + 100) File "main.py", line 185, in getReposStuff saveData = saveRepos(page, ["" for i in range(len(page))], name) File "main.py", line 42, in saveRepos dictionary = dict([(k, fixEncoding(getattr(repo, k))) for k in keys]) File "main.py", line 42, in <listcomp> dictionary = dict([(k, fixEncoding(getattr(repo, k))) for k in keys]) File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/Repository.py", line 732, in topics self._completeIfNotSet(self._topics) File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/GithubObject.py", line 262, in _completeIfNotSet self._completeIfNeeded() File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/GithubObject.py", line 266, in _completeIfNeeded self.__complete() File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/GithubObject.py", line 273, in __complete self._url.value File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/Requester.py", line 268, in requestJsonAndCheck return self.__check(*self.requestJson(verb, url, parameters, headers, input, self.__customConnection(url))) File "/home/eat/jl653/private/scraperGithub/venv2/lib/python3.6/site-packages/github/Requester.py", line 279, in __check raise self.__createException(status, responseHeaders, output) github.GithubException.GithubException: 502 {"message": "Server Error"}
so probably could but an exception handler around this to ignore it maybe???
-
jenkins configuration wasn't being picked up as we were only getting .yml or .yaml file types
- this should be good validation but do we want data for weird files that don't match???
csv -> line -> base64 -> text -> yaml
so it seems like it could be possible to create a python set of classes for parsing the yaml or we could just have them as yaml objects
yaml:
- present:
- build matrices
- enviroment varaibles and secrets
- analyses
- how do they use stages and tasks? what are common names etc. how does this differ per langauge
- comments lines and line vs blocks of comments
- adv analysis:
- how much things differ from standard templates for given languages? for travis and other languages
- detect snippets as described in docs
- before and after scripts
- how much things differ from standard templates for given languages? for travis and other languages
- loc [x]
what do people use CI for???? testing or deployment or notifications for deplosy: the tag and maybe string search in noticiations but not 100% reliable as scripts
groups configuration lines of code distribution
distribution per language and the length of the ci script