Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Allow users to specify the format of the hashed replacement string in the replace_pattern* editors #27820

Closed
Tracked by #28892
rnishtala-sumo opened this issue Oct 17, 2023 · 24 comments
Labels
enhancement New feature or request pkg/ottl

Comments

@rnishtala-sumo
Copy link
Contributor

rnishtala-sumo commented Oct 17, 2023

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

This function/editor would be used to replace substrings (identified with a regex pattern) with a hash value/digest

Let's assume we have the following log message

  • "username = user, passwd = password"

To replace the password with its hash value, one would do the following

  • replace_pattern(message, passwd\=([^\s]*), "$$1", SHA256)

However the above ottl expression would replace the entire password substring with the hash value of "password" like so

  • "username = user, <hash digest of "password">"

However the end user might need it to be something like below:

  • "username = user, passwd = <hash digest of "password">"

Note the "passwd=" prefix

This isn't a problem with masking because one could do something like

  • replace_pattern(message, passwd\=([^\s]*), "passwd=*****")

This includes the prefix "passwd=" in the replacement string. This isn't viable for hashing since we hash the entire replacement string including any prefix.

Describe the solution you'd like

Proposing that we add another optional argument, which allows the end-user to specify a format string for the replacement

  • replace_pattern(message, passwd\=([^\s]*), "$$1", "passwd=%s", SHA256)

which results in adding a prefix to the hashed password

  • "username = user, passwd = <hash digest of "password">"

Describe alternatives you've considered

No response

Additional context

No response

@rnishtala-sumo rnishtala-sumo added enhancement New feature or request needs triage New item requiring triage labels Oct 17, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@rnishtala-sumo rnishtala-sumo changed the title Add a new optional argument to the replace_pattern editors which allows users to add a prefix to the replacement string Add a new optional argument to the replace_pattern editors which allows users to specify a prefix to the replacement string Oct 17, 2023
@TylerHelmuth
Copy link
Member

TylerHelmuth commented Oct 18, 2023

This isn't the most efficient solution, but I believe it works

 - merge_maps(cache, ExtractPatterns(message, "passwd\=(?P<password>[^\s]*)"), "upsert")
 - set(cache["password"], SHA256(cache["password"]))
 - set(cache["password"], Concat(["passwd=", cache["password"]], ""))
 - replace_pattern(message, passwd\=([^\s]*), cache["password"])

I am a little worried about the complexity we're adding to the replace pattern funcs, but I am willing to discuss further.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Oct 18, 2023

Outside of a defining an additional optional argument for replace_pattern, the proposed solution would only need to check for a prefix as seen in this code block

With this code change the replace_pattern func would have the following arguments

  • target (required)
  • regex (required)
  • replacement (required)
  • replacementPrefix (optional)
  • function (optional)

A total of 5 arguments including two optional arguments. Doing this allows us to accomplish hashing with a prefix in one statement as below

replace_pattern(message, passwd\=([^\s]*), "$$1", "passwd=", SHA256)

IMO if we're supporting hashing with replace_pattern (to simplify the UX) allowing users to add a prefix completes the feature because its a valid use case. But I agree that this can also be accomplished using your approach

We already merged the changes that allow this today

  • replace_pattern(message, passwd\=([^\s]*), "$$1", SHA256)

where the hash function is optional. But if users decide to use this, they would lose the "passwd=" prefix which seems valuable. Which is where the replacement prefix comes in.

@evan-bradley
Copy link
Contributor

If we're going to add an option for providing a prefix, I think we should also allow specifying a suffix. In my opinion the easiest way to handle this would be to add an option to provide a format string that handles both at once.

That said, I share the concern @TylerHelmuth brought up around complexity. I wonder if there's a pattern or mechanism we could use here so this could be generalized to other functions.

@TylerHelmuth
Copy link
Member

Posting comments shared via the Collector SIG meeting today:

This is a valid usability concern and use case that users will experience. There is concern about the complexity and edge cases it introduces into the already complex landscape of OTTL replace functions.

A couple questions I'd like to think about:

  • Can prefix (or suffix or format or whatever this becomes) be used without the function parameter.
  • If usable without the function parameter, is it providing any value? I think it would be the same as something like prefix$$1.

@andrzej-stencel
Copy link
Member

As a user, I think this kind of solution would be something I'd understand:

replace_pattern(message, (passwd\=)([^\s]*), "$$1SHA256($$2)")

I'm not sure it's possible to implement though. 😅

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Oct 18, 2023

If usable without the function parameter, is it providing any value? I think it would be the same as something like prefix$$1

yes I agree that the prefix or suffix without the hash function isn't adding any value

using the same approach, another way this could look like is

replace_pattern(message, passwd\=([^\s]*), "$$1", SHA256, "passwd=")

so the prefix can only be specified if we provide a hash function

@rnishtala-sumo
Copy link
Contributor Author

replace_pattern(message, (passwd\=)([^\s]*), "$$1SHA256($$2)")

supporting something like this will need changes to the ottl grammar to use Converters inside a string which seems like a complex thing to do.

@rnishtala-sumo rnishtala-sumo changed the title Add a new optional argument to the replace_pattern editors which allows users to specify a prefix to the replacement string Allow users to specify a prefix to a hashed replacement string in the replace_pattern* editors Oct 18, 2023
@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Oct 18, 2023

In my opinion the easiest way to handle this would be to add an option to provide a format string that handles both at once.

I went ahead and changed the replacement prefix to be a format string in this commit and added tests for it.

It includes some basic validation that the argument replacementFormat must be a format string (contains %s in it). This approach supports suffixes as well. Please let me know your thoughts.

@rnishtala-sumo rnishtala-sumo changed the title Allow users to specify a prefix to a hashed replacement string in the replace_pattern* editors Allow users to specify the format of the hashed replacement string in the replace_pattern* editors Oct 24, 2023
@TylerHelmuth
Copy link
Member

I think an optional format argument that can only be supplied if the optional function is supplied is probably the best approach. In the function's arg struct in needs to come after the function to be a non-breaking change.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Oct 26, 2023

optional format argument that can only be supplied if the optional function is supplied

yes I can make this update to the PR

@TylerHelmuth
Copy link
Member

@rnishtala-sumo we should also validate as early as possible that format can only be set if function is set. I am hoping we can do that during startup and not during hot-path execution.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Oct 31, 2023

From #27686 (review) by @MovieStoreGuy:

From my perspective, after ready through the issue I believe it makes more sense for functions like HASH to only be applied to the extracted groups which avoids the need of providing a format?

ie: replace_pattern(name, "^kube_(...)$", "prefix-$$1--infix-suffix", SHA256)

This is a pretty interesting idea that I'd like to consider because it keeps the function arguments simpler. Are there any use cases/capabilities we would be losing if we forced the optional function to only be applied on a capture group?

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Oct 31, 2023

It seems like a good idea, will need understand how the match group is evaluated in replace_pattern. This is abstracted behind replacement.Get(ctx, tCtx) at the moment. My initial thought is we hash the match group and then re-construct the replacement string. We would only try to extract match groups (represented by $$x) and not any other substring from replacement if I understand what's being suggested.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Oct 31, 2023

Are there any use cases/capabilities we would be losing if we forced the optional function to only be applied on a capture group?

I think we should be fine with limiting the optional function to a capture group, as one mostly tries to identify sensitive data using a prefix.

@TylerHelmuth TylerHelmuth changed the title Allow users to specify the format of the hashed replacement string in the replace_pattern* editors [pkg/ottl] Allow users to specify the format of the hashed replacement string in the replace_pattern* editors Nov 7, 2023
@evan-bradley
Copy link
Contributor

Are there any use cases/capabilities we would be losing if we forced the optional function to only be applied on a capture group?

@TylerHelmuth I don't have any real-world cases to point to here, but I'm a little worried we would lose too much control over how the hash is created if we do this.

For example if I'm trying to create a hash that is based on a username and client, I may want to combine the values in a certain way before hashing them so its consistent throughout my system. So I might have name=user123, client=abc and will track this combo through hash SHA256($client$username). Only supplying a single capture group to the optional function wouldn't let me create this hash value.

@evan-bradley
Copy link
Contributor

evan-bradley commented Dec 20, 2023

It would require a lot of work in the language, but I had an idea that would allow us to do this in a way that I think would provide a consistent user experience at the cost of being more technically sophisticated.

If we change the optional function parameter of the replace_pattern functions to be a list of functions that are applied in sequence, and add support for currying with arbitrary parameters to OTTL (indicated by a character like _), we can do something like this:

replace_pattern(message, "passwd\=([^\s]*)", "$$1", [SHA256, Format("passwd=%s", [_])])

So the resulting set of calls to get the replacement string would look something like

Format("passwd=%s", [SHA256("$$1")])

This would require adding support for curried functions in OTTL. So any function call with an _ argument (or some other special character) would produce a function that takes arguments to complete the function invocation. This would essentially let us declare lambda functions with less notation. So if Format has a signature of Format(fmt string, args []any), performing a function call of Format("passwd=%s", [_]) will create a function that looks something like AnonFn(arg0 any) that makes a call like Format("passwd=%s", [arg0]) when invoked.

I think an advantage of this approach is that it would let us format the string as we wish, support more sophisticated replacements because of the list of functions, and wouldn't require us to have an optional argument that depends on another. Support for currying would also allow users to massage existing OTTL functions into new ones that have similar signatures.

On the other hand, providing a format argument won't require any additional syntax and could be easily validated at parse time: if a format string is provided but a function is not, throw an error.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 9, 2024

To summarize here, I think we're considering the following options

  1. force the optional function to only be applied on a capture group
  • replace_pattern(name, "^kube_(...)$", "prefix-$$1--infix-suffix", SHA256)
  1. Add another format argument only when a function is passed
  • replace_pattern(message, passwd\=([^\s]*), "$$1", SHA256, "passwd=%s")
  1. A more sophisticated approach of supporting curried functions in OTTL
  • replace_pattern(message, "passwd\=([^\s]*)", "$$1", [SHA256, Format("passwd=%s", [_])])

At this point, I'm leaning towards the first option for its simplicity/UX. We could also potentially hash a multiple capture group combo if the user enters something like
replace_pattern(name, "name=(user123), client=(abc)", "prefix-$$1$$2--infix-suffix", SHA256)

@rnishtala-sumo
Copy link
Contributor Author

@TylerHelmuth @evan-bradley this is a draft PR that uses option 1. It doesn't cover multiple capture groups yet, but all the existing test cases pass and it supports adding a prefix to a hashed capture group.

@rnishtala-sumo
Copy link
Contributor Author

rnishtala-sumo commented Jan 17, 2024

The following PR now uses Option 1 and is marked ready for review. New tests have been added for this and all the existing test cases pass too. Open to discussing other options. IMO this option offers UX consistent with masking.

@Jsalz2000
Copy link

Jsalz2000 commented Jan 29, 2024

Hi all,

I was asked by @TylerHelmuth to post my use-case after messaging in the otelcol slack.

I am trying to rename all datapoint attributes from CamelCase to snake_case. My two approaches and neither one works.

- context: datapoint
  statements:
    - replace_all_patterns(attributes, "key", "(.*)", ConvertCase("$$1", "snake"))

This doesn't work because I think the $$1 is substituted after the ConvertCase function has already executed.
(shows up as no action)

- context: datapoint
  statements:
    - replace_all_patterns(attributes, "key", "(.*)", "$$1", ConvertCase))

This doesn't work because ConvertCase has multiple parameters, not just one. I believe we would need something like a partial function/currying in order to use this here.

Thanks!

@rnishtala-sumo
Copy link
Contributor Author

Referencing the syntax explained here by @evan-bradley - #27820 (comment)

This is what Function currying could potentially look like in the use cases below -

  • Add a prefix to a Hash
    replace_pattern(name, "^kube_([0-9A-Za-z]+_)", "$$1", [SHA256, Concat("k8s_", [_])]
    the functions would resolve to
    Concat("k8s_", SHA256($$1))

  • Convert case
    replace_all_patterns(attributes, "key", "(.*)", $$1, [NoOp, ConvertCase([_], "snake"] )
    the functions would resolve to
    ConvertCase(NoOp($$1), "snake")
    where the NoOp Converter takes one input value and returns it without modifications (doesn't exist yet).

There is a caveat here in that, the first function used must only accept one input argument (SHA256, NoOp). For example, using the above approach one cannot use Concat and ConvertCase together.

@evan-bradley
Copy link
Contributor

One small correction, after reading up on partial function application and currying, it turns out we're actually talking about partial function application here: taking a function of a given arity and returning a function of lesser arity by "fixing" some of the arguments. Currying is similar in concept but would technically produce a series of functions that still take all the arguments. Apparently this a common mistake, my bad.

There is a caveat here in that, the first function used must only accept one input argument (SHA256, NoOp). For example, using the above approach one cannot use Concat and ConvertCase together.

@rnishtala-sumo This would actually be okay. The optional function passed to replace_pattern has to have a signature like func(string) string, so with this change we would just be allowing a sequence of such functions to be passed to replace_pattern. What partial application solves is allowing us to take functions that look like func(..., string, ...) string and fix all but one of the string arguments to turn them into a function that fills in the last remaining argument: func(string) string. The _ identifier is just to choose which argument you want to be unfixed in the resulting function.

So the following would be possible:

  • replace_pattern(name, "^kube_([0-9A-Za-z]+_)", "$$1", [SHA256, Concat(["k8s", _], "-")])
  • replace_pattern(name, "^kube_([0-9A-Za-z]+_)", "$$1", [ConvertCase(_, "snake")])
  • replace_pattern(name, "^kube_([0-9A-Za-z]+_)", "$$1", [ConvertCase(_, "snake"), Concat(["k8s", _], "_")])

Again, this is mostly to allow us to more concisely create single-use functions without requiring users to do something like this in their Collector config.

ottl_statements:
- func MyConvertToSnake(val string) return ConvertCase(val, "snake")
- replace_pattern(..., MyConvertToSnake)

@TylerHelmuth
Copy link
Member

Closed by #30837. Any future partial function or looping features can be handled in a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/ottl
Projects
None yet
Development

No branches or pull requests

5 participants