Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support @fromfile option values in the Rust options parser. #20634

Merged
merged 6 commits into from
Mar 5, 2024

Conversation

benjyw
Copy link
Sponsor Contributor

@benjyw benjyw commented Mar 2, 2024

As part of this, streamlines away some previous function pointer
clunkiness via a new Parseable trait that implements parsing
of scalars and list of scalars for the scalar types bool/i64/f64/String.

While messing with this code: this change also implements the
"source" field for list- and dict-valued option values.
Previously this was only provided for scalar-valued options. However
the highest-priority source is still a valid and useful concept even
for compound values that might be composed across multiple sources.

@benjyw benjyw added the category:internal CI, fixes for not-yet-released features, etc. label Mar 2, 2024
@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Mar 2, 2024

Reviewers - thanks for your attention and patience. The diff size seems scary, but the vast majority of it is in new tests, and the Cargo.lock changes.

I recommend starting by looking at the changes in parse.rs (and my GH comments there), then lib.rs, then the changes in args.rs/env.rs/config.rs should be straightforward applications of those, and then there are the new tests, which are very uniform across args_tests.rs/env_tests.rs/config_tests.rs.

@@ -274,49 +278,188 @@ fn format_parse_error(
))
}

#[allow(dead_code)]
pub(crate) fn parse_bool(value: &str) -> Result<bool, ParseError> {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we called these for spot-parsing in various places, or passed these around as function pointers in generic code. Instead, we now implement these via a new trait, Parseable, which allows the straightforward bool::parse() and bool::parse_list() instead.

#[allow(dead_code)]
pub(crate) fn parse_bool(value: &str) -> Result<bool, ParseError> {
option_value_parser::bool(value).map_err(|e| format_parse_error("bool", value, e))
pub(crate) fn parse_dict(value: &str) -> Result<DictEdit, ParseError> {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do still need an explicit parse_dict() though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because dict values are heterogeneous, so there is no scalar type to attach their conversion to, like we do for lists. We could implement parse() (but not parse_list()) on HashMap<String, Val> but that doesn't really buy us anything. The point of the other cleanups was to get rid of boilerplate via type parametrization, but that doesn't apply here.

}
}

pub(crate) trait ListMember: Parseable + DeserializeOwned {}
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a tagging trait for list members, to ensure they are also deserializable into an owned instance by serde, when we read them from json/yaml fromfiles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of three goals with this sort of trait:

  1. an alias that's more convenient than writing T: Parseable + DeserializeOwned
  2. wrapping multiple traits up into a single one, for use in a trait object (i.e. &dyn ListMember works, but &dyn (Parseable + DeserializeOwned) doesn't; the + syntax only supports built-in traits like Send)
  3. an "assertion" that particular types implement the parent traits

Is this usage fitting into one of those categories? If the first or second one, an option for the impl might be a blanket impl, something like:

impl<T: Parseable + DeserializeOwned + ?Sized> ListMember for T {}

(With the ?Sized bound only required to support trait objects well, I think. So not necessary.)

If the third one, an another option is something like:

const fn assert_is_list_member<T: Parseable + DeserializeOwned>() {}
const _ASSERT_BOOL: () = assert_is_list_member::<bool>();
...

but now that I write it out, that's not amazingly nicer. 🤷‍♂️

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was #1.

But this code evolved while I was working on it, and now that I look at where it ended up, ListMember isn't actually needed anyway. Every Parseable can be a list member (because of its parse_list()), so it's easier to make Parseable a subtrait of DeserializeOwned and be done with it.

Good call!

pub(crate) fn parse_bool_list(value: &str) -> Result<Vec<ListEdit<bool>>, ParseError> {
option_value_parser::bool_list_edits(value)
.map_err(|e| format_parse_error("bool list", value, e))
fn maybe_expand(value: String) -> Result<ExpandedValue, ParseError> {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A helper function for the expand* functions below that are the ones actually used by args/env/config.

#[allow(dead_code)]
pub(crate) fn parse_int_list(value: &str) -> Result<Vec<ListEdit<i64>>, ParseError> {
option_value_parser::int_list_edits(value).map_err(|e| format_parse_error("int list", value, e))
pub(crate) fn expand(value: String) -> Result<Option<String>, ParseError> {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does no type conversion, just expands "path to @fromfile" -> "content of @fromfile". See usage in args/env/config.rs.

if let Some(ext) = path.extension() {
if ext == "json" {
deserialized_items =
Some(serde_json::from_str(&value).map_err(|e| mk_parse_err(e, &path))?)
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde is freakin' magic! This (plus a minor annotation on the Val class in lib.rs) is all I needed to get Rust to parse json and yaml into our existing data structures!

if let Some(ext) = path.extension() {
if ext == "json" {
deserialized_items =
Some(serde_json::from_str(&value).map_err(|e| mk_parse_err(e, &path))?)
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a little bit of repetition here vis the previous function, but the differences are just subtle enough that it became more confusing than it was worth to try and unify them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about at least pulling out the file-type-identification? Maybe something like

enum FileType {
    Json,
    Yalm
}

impl FileType {
    fn guess(path: &Path) -> Option<FileType> {
        match path.extension() {
            "json" => Some(FileType::Json),
            "yaml" | "yml" => ..., 
            _ => None,
        }
    }
}

and then this section of the code in both could could be something like:

let deserialized_items = match path_opt.flat_map(|p| FileType::guess(&p)) {
   Some(FileType::Json) => ...,
   Some(FileType::Yaml) => ...,
   None => None,
};

One could potentially then even move the "call serde_json or serde_yaml" decision to a generic method on FileType, and have that be shared.

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I toyed with this some more and was able to create a generic deserialization helper function that works for lists and dicts, so hooray Rust type system!

@@ -38,7 +40,18 @@ fn check_with_arg<T: PartialEq + Debug>(
#[test]
fn test_parse_quoted_string() {
fn check_str(expected: &str, input: &str) {
check!(expected.to_string(), parse_quoted_string(input));
// This is slightly convoluted: quoted strings appear as list items,
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so I could delete the otherwise dead code parse_quoted_string(). It makes sense to specifically test that we handle quoted strings correctly, but now we do so where we actually encounter them - inside lists.

@@ -648,3 +661,190 @@ fn test_parse_heterogeneous_dict() {
)
);
}

#[test]
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here we have some very exhaustive tests of the fromfile mechanism. We also test this mechanism's application in args/env/config in the appropriate test files.

Copy link
Contributor

@huonw huonw Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

In addition to this set, could we have a test or two covering some more error cases beyond just "not existing": the major class I can think of is when the file contents is invalid, both as a scalar, and, probably more interestingly, as a list or dict (e.g. where serde json/yaml can't even parse it).

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding some list/dict cases as you suggest. There is no such thing as an invalid scalar at this level, since we don't convert to the type but just expand string to string. Can test for this in args/config/env maybe. I'll look into adding something.

///
/// A source of option values.
///
/// This is currently a subset of the types of options the Pants python option system handles.
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer a subset! So this whole comment can go.

pub source: Source,
pub value: T,
}

#[derive(Debug)]
pub struct ListOptionValue<T> {
pub derivation: Option<Vec<(Source, Vec<ListEdit<T>>)>>,
// The highest-priority source that provided edits for this value.
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly needed for fromfile, but will be needed for replacing the Python parser, and it was easy to sneak in here...

id: &OptionId,
parse_list: fn(&str) -> Result<Vec<ListEdit<T>>, ParseError>,
) -> Result<Option<Vec<ListEdit<T>>>, String> {
fn get_list<T: ListMember>(&self, id: &OptionId) -> Result<Option<Vec<ListEdit<T>>>, String> {
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more need for the parse_list function pointer - ListMember has the functionality we need (via its Parseable supertrait).

@benjyw benjyw requested review from huonw and tgolsson March 2, 2024 05:26
@@ -54,7 +56,8 @@ pub use types::OptionType;
// We only use this for parsing values in dicts, as in other cases we know that the type must
// be some scalar or string, or a uniform list of one type of scalar or string, so we can
// parse as such.
#[derive(Clone, Debug, PartialEq)]
#[derive(Clone, Debug, PartialEq, Deserialize)]
#[serde(untagged)]
Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I needed to do to get json and yaml deserialization working!

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice!

#[allow(dead_code)]
pub(crate) fn parse_bool(value: &str) -> Result<bool, ParseError> {
option_value_parser::bool(value).map_err(|e| format_parse_error("bool", value, e))
pub(crate) fn parse_dict(value: &str) -> Result<DictEdit, ParseError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

}
}

pub(crate) trait ListMember: Parseable + DeserializeOwned {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know of three goals with this sort of trait:

  1. an alias that's more convenient than writing T: Parseable + DeserializeOwned
  2. wrapping multiple traits up into a single one, for use in a trait object (i.e. &dyn ListMember works, but &dyn (Parseable + DeserializeOwned) doesn't; the + syntax only supports built-in traits like Send)
  3. an "assertion" that particular types implement the parent traits

Is this usage fitting into one of those categories? If the first or second one, an option for the impl might be a blanket impl, something like:

impl<T: Parseable + DeserializeOwned + ?Sized> ListMember for T {}

(With the ?Sized bound only required to support trait objects well, I think. So not necessary.)

If the third one, an another option is something like:

const fn assert_is_list_member<T: Parseable + DeserializeOwned>() {}
const _ASSERT_BOOL: () = assert_is_list_member::<bool>();
...

but now that I write it out, that's not amazingly nicer. 🤷‍♂️


fn mk_parse_err(err: impl Display, path: &Path) -> ParseError {
ParseError::new(format!(
"Problem reading {path} for {{name}}: {err_msg}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the {{name}} mean here?

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A placeholder that ParseError.render() fills in with the option name, rendered suitably for the source type (e.g., --foo vs PANTS_FOO vs [GLOBAL] foo . This was all preexisting before any of my recent changes.

Some(subsuffix) => {
// @? means the path is allowed to not exist.
let path = PathBuf::from(subsuffix);
match fs::read_to_string(&path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that this is doing synchronous IO, but generally pants does async. Is the sync IO acceptable here?

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? We expect very few fromfile options, and they are read synchronously today in the Python code.

The trickier part is invalidating options when the fromfile changes (which we don't to today either). Making this async though would be a pretty big change that would be best done as its own project. For now we're no worse off than today.

if let Some(ext) = path.extension() {
if ext == "json" {
deserialized_items =
Some(serde_json::from_str(&value).map_err(|e| mk_parse_err(e, &path))?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about at least pulling out the file-type-identification? Maybe something like

enum FileType {
    Json,
    Yalm
}

impl FileType {
    fn guess(path: &Path) -> Option<FileType> {
        match path.extension() {
            "json" => Some(FileType::Json),
            "yaml" | "yml" => ..., 
            _ => None,
        }
    }
}

and then this section of the code in both could could be something like:

let deserialized_items = match path_opt.flat_map(|p| FileType::guess(&p)) {
   Some(FileType::Json) => ...,
   Some(FileType::Yaml) => ...,
   None => None,
};

One could potentially then even move the "call serde_json or serde_yaml" decision to a generic method on FileType, and have that be shared.

Ok(Some("FOO".to_string())),
expand(format!("@{}", fromfile_path_str))
);
assert_eq!(Ok(None), expand("@?/does/not/exist".to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also having a test for an optional file that does exist, i.e. make sure that we're stripping off the ? correctly? (I could imagine a future refactoring making a typo that causes us to pass in ?/does/not/exist literally as the file path.)

In that vein, is it worth having "edge-case tests" that at least lock down the behaviour for "weird" things like @@path/to/file reading @path/to/file and similarly @??path/to/file (optionally) reading ?path/to/file?

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Note that @@ is already covered - that is the escape hatch for a literal value that starts with @, so we never allow such a fromfile.

@@ -648,3 +661,190 @@ fn test_parse_heterogeneous_dict() {
)
);
}

#[test]
Copy link
Contributor

@huonw huonw Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

In addition to this set, could we have a test or two covering some more error cases beyond just "not existing": the major class I can think of is when the file contents is invalid, both as a scalar, and, probably more interestingly, as a list or dict (e.g. where serde json/yaml can't even parse it).

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Mar 4, 2024

Thanks for the helpful review! I've addressed all your comments. PTAL.

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

src/rust/engine/options/src/args_tests.rs Show resolved Hide resolved
src/rust/engine/options/src/parse.rs Outdated Show resolved Hide resolved
@benjyw benjyw merged commit af1b5c7 into pantsbuild:main Mar 5, 2024
24 checks passed
@benjyw benjyw deleted the rust_options_fromfile2 branch March 5, 2024 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:internal CI, fixes for not-yet-released features, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants