-
-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor split function with tests #811
Conversation
The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @srdas. Please find comments / suggestions below.
Please make sure you are using pytest. Currently there is nothing in your tests incompatible with pytest so just implement comments suggesting use of different pytest features and make sure to run testing suite with pytest -vv -r ap --cov jupyter_ai
or just your test with pytest -vv -k test_collect_files
when testing while working on your test. Pytest docs for reference: https://docs.pytest.org/en/8.2.x/getting-started.html#.
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.
for more information, see https://pre-commit.ci
Replaced testing using unittests with testing using pytest fixtures.
…pyter-ai into refactor_split_and_test
for more information, see https://pre-commit.ci
replacd unittests with pytests
…pyter-ai into refactor_split_and_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing changes. Please find a couple of additional comments below
… split Further improvements to the code suggested from the review of PR
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes @srdas. Everything looks good except of one small comment, please give it a look: #811 (comment)
Changed function level constant from all caps to lower case to line up with the convention in https://peps.python.org/pep-0008/#constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @srdas.
[1] The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new
collect_files
function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into thesplit
function, which callscollect_files
. This separates the file collection from the sharding in the RAG pipeline.[2] Added a unit test for the new
collect_files
function indirectory.py
. The refactoring in [1] makes unit testing separately for the file selection component simpler. Added a new test foldertests/doc_loaders
with test filetest_eligible_files.py
.