Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor regression package and add unit tests and Travis CI configuration #3

Merged
merged 4 commits into from
Dec 6, 2013

Conversation

JoshRosen
Copy link
Contributor

This pull requests refactors the regression package and adds unit tests. For now, the tests just verify that the code doesn't crash. To allow the regression code to be automatically tested, I refactored the actual job into a function that accepts a SparkContext and arguments, then added a __main__ method that calls it.

I used argparse to process command line arguments; it's included in the standard library in Python 2.7+ and available via pip/easy_install argparse for earlier Python versions. Down the line, it might be cool to implement some common optional arguments across all of the scripts; for example, there could be a --serializer option for using a custom PySpark serializer. Argparse has a lot of other cool features; its sub-commands feature could be used to implement a single thunder script with multiple sub-commands (like thunder regress, thunder tuning, etc).

To run the tests locally:

cd python/test
export SPARK_HOME=...
./run-tests.sh

This uses the nosetests test runner.

To convince myself that my refactorings didn't alter the code's behavior, I added a matdiff.py script for comparing the .mat files in output directories.

I also added a configuration for automatically running the unit tests on Travis CI. Here's the Travis page for my fork, showing test results from my latest commits: https://travis-ci.org/JoshRosen/thunder

If you want to run the Travis tests on your repository after merging this, log into https://travis-ci.org/ with your GitHub account and follow the instructions to enable the GitHub service integration hook for Travis.

@freeman-lab
Copy link
Member

This all looks great! I'll refactor the other packages (factorization, etc.) in line with your changes here.

I like the idea of common arguments. I'm routinely running all these on much bigger data sets, so can definitely test out how the alternative serializer scales up (vs. data size and number of nodes) once that's ready.

The matdiff.py is also very useful, probably worth adding to the unit tests. For analyses with answers that vary with the initialization (e.g. kmeans, ICA), we can make sure the tests use a fixed initialization. For others, it won't matter.

freeman-lab added a commit that referenced this pull request Dec 6, 2013
Refactor regression package and add unit tests and Travis CI configuration
@freeman-lab freeman-lab merged commit 5412b35 into thunder-project:master Dec 6, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants