Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
ryantwolf committed Mar 15, 2024
0 parents commit ea680b0
Show file tree
Hide file tree
Showing 179 changed files with 21,765 additions and 0 deletions.
153 changes: 153 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
*.cpp

# C extensions
*.so
*.c

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# Dask cache
dask-worker-space/

# Data downloaded and generated when running the examples.
data/

# SLURM Files
*.out
*.err

# Text Editor / IDE Files
.vscode
3 changes: 3 additions & 0 deletions .style.yapf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[style]
based_on_style = google
indent_width = 2
118 changes: 118 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Checklist

We are glad you are contributing to NeMo Curator! Before you make a PR, be sure to read over this guide in detail.
This checklist ensures that NeMo Curator stays easy-to-use by both users and developers.
Not all steps are necessary for some contributions, so read the linked sections for more information about each item.

1. [Follow the general principles in your design](#general-principles)
1. [Write your code in the proper place](#repo-structure)
1. [Write examples and documentation for using your code](#examples-and-documentation)
1. [Format using the style guide](#python-style)
1. [Write unit tests](#unit-tests)
1. [Make a pull request](#pull-requests-pr-guidelines)

## General principles
1. **User-oriented**: make it easy for end users, even at the cost of writing more code in the background
1. **Robust**: make it hard for users to make mistakes.
1. **Reusable**: for every piece of code, think about how it can be reused in the future and make it easy to be reused.
1. **Readable**: code should be easier to read.
1. **Legal**: if you copy even one line of code from the Internet, make sure that the code allows the license that NeMo Curator supports. Give credit and link back to the code.
1. **Sensible**: code should make sense. If you think a piece of code might be confusing, write comments.

## Code Structure
The repository is home to flexible Python modules, sample scripts, tests, and more.
Here is a brief overview of where everything lives:
- [config](config/) - A collection of example configuration files for many of the curator's modules.
- [docs](docs/) - Walkthroughs and motivations for each of the modules.
- [examples](examples/) - Example scripts for how users may want to compose the curator.
- [nemo_curator](nemo_curator/) - The main home for all the NeMo Curator's Python APIs.
- [modules](nemo_curator/modules) - Classes for the modules.
- [filters](nemo_curator/filters) - Classes for the filters.
- [utils](nemo_curator/utils) - Common utilities for file/network operations.
- [tests](tests/) - Unit tests for each module.

## Examples and Documentation
Examples provide an easy way for users to see how the curator works in action.
There should be at least one example per module in the curator.
They should be incredibly lightweight and rely on the core `nemo_curator` modules for their functionality.
Most should be designed for a user to get up and running on their local machines, but distributed examples are welcomed if it makes sense.
Python scripts should be the primary way to showcase your module.
Though, SLURM scripts or other cluster scripts should be included if there are special steps needed to run the module.

The documentation should complement each example by going through the motivation behind why a user would use each module.
It should include both an explanation of the module, and how it's used in its corresponding example.
The documentation should also cover potential pitfalls and performance considerations when running the module at scale.
This existing examples and documentation should serve as a good reference to what is expected.

## Python style
We use ``black`` as our style guide. To fix your format run `pip install pre-commit && pre-commit install && pre-commit run --all`.

1. Include docstrings for every class and method exposed to the user.
1. Avoid wild import: ``from X import *`` unless in ``X.py``, ``__all__`` is defined.
1. Minimize the use of ``**kwargs``.
1. ``RaiseError`` is preferred to ``assert``. Write: ```if X: raise Error``` instead of ```assert X```.
1. Classes are preferred to standalone methods.
1. Methods should be atomic. A method shouldn't be longer than 75 lines, e.g. can be fit into the computer screen without scrolling.
1. If a method has arguments that don't fit into one line, each argument should be in its own line for readability.
1. Add ``__init__.py`` for every folder.
1. F-strings are prefered to formatted strings.
1. Loggers are preferred to print.
1. Private functions (functions start with ``_``) shouldn't be called outside its host file.
1. If a comment lasts multiple lines, use ``'''`` instead of ``#``.

## Unit tests
Unit tests should be simple and fast.
Developers should be able to run them frequently while developing without any slowdown.
```
pytest
# If you don't have NVIDIA GPU do:
# pytest --cpu
```

## Pull Requests (PR) Guidelines

**Send your PRs to the `main` or `dev` branch**

1) Make sure your PR does one thing. Have a clear answer to "What does this PR do?".
2) Read General Principles and style guide below
3) Make sure you sign your commits. E.g. use ``git commit -sS`` when committing.
4) Make sure all unittests finish successfully before sending PR ``pytest`` or (if your dev box does not have GPU) ``pytest --cpu`` from the root folder
5) Send your PR and request a review

The `dev` branch is for active development and may be unstable. Unit tests are expected to pass before merging into `dev` or `main`.
Every release `dev` and `main` will sync to be the same.

Full text of the DCO:

```
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or
(c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
```

## Whom should you ask for review:

Joseph Jennings (@jjennings) or Ryan Wolf (@rywolf)

They may ask for other reviewers depending on the scope of the change. Your pull requests must pass all checks and peer-review before they can be merged.


Thank you for contributing to NeMo Curator!
Loading

0 comments on commit ea680b0

Please sign in to comment.