Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cached-path file utils #74

Merged
merged 8 commits into from
Sep 12, 2020
Merged

Conversation

epwalsh
Copy link
Contributor

@epwalsh epwalsh commented Sep 9, 2020

closes #72

So far I've just added a common/file_utils.rs module. The idea being that this would replace what's currently in common/resources.rs.

My question for you is, do you want to keep the LocalResource and RemoteResource abstractions?

@guillaume-be
Copy link
Owner

@epwalsh thank you for the suggestions. I believe the cached_path functionality could be easily added without causing breaking changes. My idea was to just replace the body of the existing

pub fn download_resource(resource: &Resource) -> Result<&PathBuf, RustBertError> {
and replace the check for the existing file & download by a call to cached_path. Could you please use the existing file and methods in resources instead of creating a separate file_utils file?

For now I'd like to keep the concept of local and remote resources - and I believe this would still work with the proposed changes.

Cargo.toml Outdated
@@ -37,6 +37,8 @@ serde = { version = "1.0.114", features = ["derive"] }
dirs = "3.0.1"
itertools = "0.9.0"
ordered-float = "2.0.0"
cached-path = "0.4.0-rc1"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please ensure a release is added as a dependency for the final version of this PR?

@@ -0,0 +1,29 @@
use crate::common::error::RustBertError;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the existing resources file for these utilities

.unwrap();
}

fn _get_cache_directory() -> PathBuf {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists at

fn _get_cache_directory() -> PathBuf {

home
}

pub fn cached_path(resource: &str) -> Result<PathBuf, RustBertError> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please wrap the cached path functionality directly in

pub fn download_resource(resource: &Resource) -> Result<&PathBuf, RustBertError> {
?

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 10, 2020

Hey @guillaume-be, no problem.

One issue though is that, like transformers/file_utils.py, cached-path automatically generates a name for the files it downloads. The name is generated from a hash of the resource string and the ETag. But your RemoteResource struct takes a target_path.

I think I could make cached-path work with custom target paths but I'll have to change some things.


Update: Actually I think it's inherently unsafe to allow for custom arbitrary target paths since this would allow for race conditions. cached-path avoids these race conditions by giving each version of a cached resource a unique file name.

Maybe target_path could just be a symbolic link to the cached resource?

@guillaume-be
Copy link
Owner

guillaume-be commented Sep 10, 2020

What if the target_path would point to a directory instead (i.e. remove the file name).

impl BartModelResources {
    /// Shared under MIT license by the Facebook AI Research Fairseq team at https://github.com/pytorch/fairseq. Modified with conversion to C-array format.
    pub const BART: (&'static str, &'static str) = (
        "bart/model.ot",
        "https://cdn.huggingface.co/facebook/bart-large/rust_model.ot",
    );
}

Would for example become:

impl BartModelResources {
    /// Shared under MIT license by the Facebook AI Research Fairseq team at https://github.com/pytorch/fairseq. Modified with conversion to C-array format.
    pub const BART: (&'static str, &'static str) = (
        "bart",
        "https://cdn.huggingface.co/facebook/bart-large/rust_model.ot",
    );
}

and the .ot file name would be dynamically generated by cached_path?
I like the folder structure for cached files, makes it easier to delete unused cache for specific models as opposed to a full hashing of the file in a common folder.

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 10, 2020

That seems reasonable

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 10, 2020

Just made some updates. CI will probably fail. Let me know if you're happy with the API and then I'll fix the compile errors. IMHO though it doesn't make sense to have a download_resource function when we get the same functionality from Resource.get_local_path().

Let me know.

/// Gets the local path for a given resource
/// Gets the local path for a given resource.
///
/// If the resource is a remote resource, it is downlaoded and cached. Then the path
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@guillaume-be
Copy link
Owner

guillaume-be commented Sep 11, 2020

@epwalsh Thank you for the change. The get_local_path indeed is more idiomatic and is a nicer way to get to access local resources or download remote files. I wanted to avoid breaking changes - maybe the download_function could be kept in parallel to the now standard get_local_path method, but deprecated for a version cycle, i.e.

#[deprecated(
    since = "0.9.1",
    note = "Please use `Resource.get_local_path()` instead"
)]

Since get_local_path will return an owned object, can you please update the config creation object to take both owned and borrowed Path objects, i.e.:

fn from_file<P: AsRef<Path>>(path: P) -> T {
	let f = File::open(path).expect("Could not open configuration file.");
	let br = BufReader::new(f);
	let config: T = serde_json::from_reader(br).expect("could not parse configuration");
	config
}

@guillaume-be
Copy link
Owner

@epwalsh Note that I am unable to build the cached-path crate on my machine (Windows). The file-lock dependency seems to be incompatible with my setup. This may be because gcc is listed as a dependency for this crate while the MSVC is used as a standard toolchain for windows machine. I have added a Windows build to the CI, hopefully it catches the issue.

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 11, 2020

I'll add a Windows build to cached-path as well as see if I can fix the issue.

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 11, 2020

Okay, should build on Windows now 👌

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 11, 2020

Still TODO:

@guillaume-be
Copy link
Owner

@epwalsh This looks great thanks.

Actually regarding the first TODO - one could potentially leave it the way it is. As you can see in the CI, one nice effect is that the target file name becomes the name of the folder where the cached version will be stored. This is quite nice as the hashed version makes it difficult to differentiate a config.json from a vocab.txt otherwise. The current path makes it easy to know exactly what each cache file corresponds to.

What do you think?

@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 11, 2020

@epwalsh This looks great thanks.

Actually regarding the first TODO - one could potentially leave it the way it is. As you can see in the CI, one nice effect is that the target file name becomes the name of the folder where the cached version will be stored. This is quite nice as the hashed version makes it difficult to differentiate a config.json from a vocab.txt otherwise. The current path makes it easy to know exactly what each cache file corresponds to.

What do you think?

I'm fine with that if you are 👍

Resource::Local(_) => Ok(resource.get_local_path()),
}
pub fn download_resource(resource: &Resource) -> Result<PathBuf, RustBertError> {
Ok(resource.get_local_path()?)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nitpicking, but I believe you could return resource.get_local_path() directly as this is a Result<PathBuf, RustBertError>

@epwalsh epwalsh marked this pull request as ready for review September 11, 2020 21:06
@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 11, 2020

@guillaume-be I noticed there were bunch of warnings (unrelated to this PR) when I ran clippy:

cargo clippy --all-targets --all-features -- -D warnings

Would you want to add a clippy check to the CI steps? And maybe documentation tests as well?

And I'm not sure if travis CI supports caching, but I've been able to get a huge speed-up in CI by using caching in GitHub Actions.

@guillaume-be
Copy link
Owner

@epwalsh thank you very much for the PR!

I did notice the clippy warnings, I am working on fixing them and they will be addressed in the next release. I will then add a clippy check to the CI. The documentation tests are already run and check as part of the CI.

Regarding caching, I had it enabled at the beginning for this project, but the time required to compress and decompress the large cache was actually longer than downloading the files again.

@guillaume-be guillaume-be merged commit 34e061c into guillaume-be:master Sep 12, 2020
@epwalsh
Copy link
Contributor Author

epwalsh commented Sep 12, 2020

Oh right. By “large files” are you talking about model weights / checkpoint files?

I just meant caching the cargo build artifacts. That can make a big difference especially for Clippy, which takes a while to run from scratch every time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider using cached-path crate in resources module
2 participants