Skip to content

Latest commit

 

History

History

{{cookiecutter.project_dir}}

{{cookiecutter.project_name}}

This Azure Machine Learning project was generated using Cookiecutter along with @wmeints's azureml-cookiecutter template.

Although not required, we recommend you use virtualenv for your machine learning projects. You can install it using pip install virtualenv.

Create a virtual environment using the following command:

virtualenv -p 3.8.6 --pip 20.1 venv

When the installation is done, make sure you activate the python environment.

  • Powershell: . venv/scripts/activate.ps1
  • Bash: source venv/scripts/activate

To set up your project, follow these steps:

  • pip install -r requirements.txt
  • python tasks/make_workspace.py --name <my_workspace> --resource_group <my_resource_group>

You now have all the requirements for your project installed on your machine and a workspace to train your models in.

You only need to create your workspace once.

Please note, you only need to create the workspace once. Other team members can download the settings.json from the Azure ML workspace and copied into the .azureml folder within the project directory.

The following additional steps are useful to set up a dataset and compute environment for your project. We recommend at least setting up the compute environment as you need this to run experiments remotely.

  • python tasks/make_dataset.py --name <dataset_name> --input_file <my_data_file>
  • python tasks/make_environment.py --name <name>

The project contains a number of tasks. Tasks are implemented as python scripts and allow you to manage your project. For example, tasks allow you to manage the machine learning workspace, run an experiment on a remote environment, and deploy your model to production.

Check out the predefined tasks to get an idea of what they do and how to create your own tasks in the project.

Next to the tasks, there's the code that you use to train your model or perform inference using your model. These files are stored in the {{cookiecutter.package_name}} folder.

The following files are created by default:

  • train.py - This file is used for training the model
  • score.py - This file is used for inferencing

We recommend you split data preprocessing from training as much as possible. We also recommend that you split your data preprocessing in several sub-tasks each with its own script in the :code`tasks` folder. This will allow you to repeat a step when something fails.

To help you split the preprocessing steps from the training code, we added a data folder. This folder is split into three sub-folders: raw, interim, and processed. You can use these folders to store raw, interim, and fully pre-processed datasets. We recommend you create a preprocess.py script in the project to process raw data into intermediate, and preprocessed datasets.

Use the tasks/make_dataset.py script to upload the datasets to the azure workspace.

If you're working with a separate preprocess.py script, we recommend you integrate that python script into the tasks/make_dataset.py python script.

This repository contains a folder called notebooks. You can add your Python notebooks to this folder. Use them as your scratch space to explore data.

We encourage you to use scripts instead of the notebooks for any production code. Notebooks have several limitations that will hurt you in the long run:

  • The order is determined by how you execute the cells. It doesn't have to be in the order of appearance inside the notebook file.
  • You can't run notebooks from the command line directly.
  • It's hard to merge the contents of a notebook should your run into merge-conflicts in your source control environment.

Notebooks are great for exploring data and visualizing things. So we feel that they still have a place in this template.

You can use the reports folder to store any generated reports, such as reports generated by the pandas-profiling package.

We recommend generating reports to visualize the performance of your models, explain outcomes, or explore data.

To make documentation easier, we've included Sphinx docs in the project. You can generate the HTML documentation using the following command:

cd docs
make html

This command works on Windows, Mac, and Linux.

Please refer to the Sphinx documentation to learn more about writing rich documentation based on your code and custom restructured text documents.

Note, we're using NumPy style docstrings to document functions, methods, modules, and classes. Please consult the Numpy docstring styleguide for more information.

It's highly recommended to write automated tests. You can use pytest to run unit-tests. We recommend placing the test code in a folder called tests in the root of the project. This isolates the tests from the rest of the project.

We recommend installing the project using pip in editable mode by running the following command in the root of the project:

pip install -e .

Please note, setup.py contains a list of dependencies required by your project. The same list of dependencies is contained in the conda_dependencies.yml file. This is required, because setuptools doesn't allow the use of Anaconda dependency files. Whenever you change the dependencies of your project, you'll need to add the dependency to the setup.py file as well as the conda_dependencies.yml file.