Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key-based item uniqueness #22

Closed
gregsdennis opened this issue Jan 23, 2018 · 46 comments · Fixed by #39
Closed

Key-based item uniqueness #22

gregsdennis opened this issue Jan 23, 2018 · 46 comments · Fixed by #39

Comments

@gregsdennis
Copy link
Member

It may be handy to be able to specify key-based uniqueness within an array of items.

Given the schema

{
  "type" : "array",
  "items" : {
    "type" : "object",
    "properties" : {
      "key" : { "type" : "string" },
      "value" : { "type" : "integer" }
    }
  },
  "uniqueItems" : true
}

the instance

[ { "key" : 1, "value" : "value1" }, { "key" : 1, "value" : "value2" } ]

would pass. However, the user may want this to fail as the value of the key property is repeated.

I propose a uniquenessKey (or similar) keyword that would allow the author to specify a pointer to an object property that should be unique among all items within the array. This would update the above schema to

{
  "type" : "array",
  "items" : {
    "type" : "object",
    "properties" : {
      "key" : { "type" : "string" },
      "value" : { "type" : "integer" }
    }
  },
  "uniquenessKey" : "#/key"
}

(The pointer would be resolved using the item in the array as the root.)

@handrews
Copy link
Contributor

This has some of the same difficulties as #21 ordered which also discusses possibly ordering on a key. We should probably tackle them together. These are in what I consider the grey zone between "yeah let's add that to the standard vocab" and "hmm... definitely see the use, but can easily become either too complicated to implement or too limited to be useful, so maybe extension vocabularies should play with this first."

I'm hoping as we improve our notion of what a vocabulary is and how to organize them, it will become more clear where this and ordered best fit.

@TakingItCasual
Copy link

TakingItCasual commented Apr 19, 2018

Perhaps a way to do this would be to add an objectArray type? For example converting the following:

{
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "Name": {"type" : "string"},
            "Desc": {"type" : "string"}
        },
        "required": ["Name", "Desc"]
    },
    "minItems": 1,
    "uniqueItems": true
}

To something along the lines of the the following:

{
    "type": "objectArray",
    "properties": {
        "Name": {"type" : "string"},
        "Desc": {"type" : "string"}
    },
    "required": ["Name", "Desc"]
    "minItems": 1,
    "uniqueItems": true
}

A keyword such as uniqueKeys (or uniqueKey with just one value, if that makes more sense) could be used with this type to achieve the desired effect.

@gregsdennis
Copy link
Member Author

@TakingItCasual I'm not sure how your suggestion solves the problem of key-based uniqueness. You seem to be merely adding a new type keyword as a shorthand for functionality that already exists.

@lalloni
Copy link

lalloni commented Apr 8, 2019

We are right now needing almost exactly what @gregsdennis proposed except that we must have composed keys so the "uniquenessKey" should be an array.

Having said that, I find very odd that the pointer points "#" to an element in the array somwhere in the instance... which would be very confusing to users, having all other pointers refering to the root of the schema as "#", I believe. Maybe there's no other way, though.

@handrews
Copy link
Contributor

handrews commented Apr 8, 2019

@lalloni

Having said that, I find very odd that the pointer points "#" to an element in the array somwhere in the instance... which would be very confusing to users, having all other pointers refering to the root of the schema as "#", I believe. Maybe there's no other way, though.

If we do that, that should be done with a Relative JSON Pointer, not changing the meaning of #.

@gregsdennis
Copy link
Member Author

gregsdennis commented Apr 9, 2019

We could do something like what JSON path does for array predicates: use @ instead of # to indicate a local root instead of the global root. It's not a proper pointer, but it makes the distinction well enough.

@handrews
Copy link
Contributor

handrews commented Apr 9, 2019

@gregsdennis Or we could just use Relative JSON Pointer which is designed specifically to solve this problem, is one of the specs we publish, and is used throughout JSON Hyper-Schema already.

@rrodini
Copy link

rrodini commented Jun 25, 2019

This would be very useful.

@ralfhandl
Copy link

This would be very useful, and uniquenessKey needs to accept an array of Relative JSON Pointers to also cover the multi-part key case.

@senior-pomidorka
Copy link

This would be very useful, and uniquenessKey needs to accept an array of Relative JSON Pointers to also cover the multi-part key case.

Totally agree. Would be really useful

@awwright
Copy link
Member

awwright commented Aug 9, 2019

I'm skeptical about the effectiveness of things like this for a few reasons:

  • The primary role of "uniqueItems" is to signify that an array is functioning as a mathematical set, and so array items should be unique. It's also one of the most difficult keywords to validate.

  • Verifying whether or not some data exists elsewhere (in the document, or another database) is somewhat outside the scope of JSON Schema. "uniqueItems" is an exception, and it has exceptionally poor performance. And I'm concerned that if we start adding keywords like this, it would encourage people to design schemas in a way that isn't ideal, by forcing too much data into the same document, when maybe it should be separated. See json-schema-org/json-schema-spec#549

@handrews
Copy link
Contributor

handrews commented Aug 9, 2019

@awwright if we were adding uniqueItems now I would lobby hard to make it an annotation and not an assertion, and leave it up to the application layer to decide whether and how to validate it.

@gregsdennis
Copy link
Member Author

@awwright It's also one of the most difficult keywords to validate.

While uniqueItems is difficult to implement, uniquenessKey would make things easier because you only have to check that the values at the keys are unique rather than the whole item.

To your second point, I'm not proposing that this has any relation to data in schemas. I think that item uniqueness holds a valid place within a schema, and identifying that uniqueness based on a property within the item (for example an ID) is a worthy addition.

@speedplane
Copy link

speedplane commented Sep 3, 2019

I found this issue after asking a question on SO. The view count (currently 21 views after 7 days) may provide insight into prioritizing this issue.

@Relequestual
Copy link
Member

@handrews @awwright In terms of

Verifying whether or not some data exists elsewhere (in the document, or another database) is somewhat outside the scope of JSON Schema.

This is often needed, and often if checks need to be pushed up a few levels to enable what's required.

@awwright if we were adding uniqueItems now I would lobby hard to make it an annotation and not an assertion, and leave it up to the application layer to decide whether and how to validate it.

@handrews

(Assuming you do not mean uniqueItems as that's already a keyword, and mean some new uniqueItemsByKey keyword) I would very much prefer NOT adding this as an annotation only. Doing so would create confusion, and I would find it preferable to take a stance which says "no, you must do this in creating a new 'database' sort of vocabulary. probably need to collaborate with mongodb". If it was only an annotation, I'm sure there are PLENTY more database type annotations that people might want, and we're not in the best place to see or collect them right now.

I personally don't feel this is unreasonable, however I don't think the number of people who've requested it is high enough to divert focus on remaining tasks for draft-8. (This issue is currently not in the draft-8 milestone. Let's also discuss if we should move it there or not.)

If the value was simply a string, which represented the value in the object to be checked, it sounds fairly simple to me. I don't think we need to allow for pointers. Values which are ids for objects should be at the top most layer, and doing so reduces the complexity of the task here.

@Relequestual
Copy link
Member

@speedplane Thanks for reaching out. Feel free to join our slack server for further discussion. Priorities are hard, especially when there are existing comittments and limited time. If you want to help us "move faster", keep an eye out for our soon to be announced Open Collective!

@gregsdennis
Copy link
Member Author

gregsdennis commented Dec 4, 2019

Another SO question that involves this.

The problems are ... It does not check for duplicates...

@chapmanjw
Copy link

+1 to the uniquenessKey idea. Thinking further on that idea, could that be modeled as an array instead of a single value?

Consider the following example:

{
  "type" : "array",
  "items" : {
    "type" : "object",
    "properties" : {
      "key" : { "type" : "string" },
      "key2" : { "type" : "string" },
      "value" : { "type" : "integer" }
    }
  },
  "uniquenessKeys" : ["#/key", "#/key2"]
}

The combination of key1 and key2 are what make each instance in the array unique.

@gregsdennis
Copy link
Member Author

gregsdennis commented Dec 10, 2019

@chapmanjw, I mentioned the use of multiple keys in the original post, but thanks for adding in an example!

@handrews
Copy link
Contributor

Moving this to the extension vocabularies repo.

@handrews handrews transferred this issue from json-schema-org/json-schema-spec Feb 28, 2020
@amitchone
Copy link

Hi all, I'd just like to mention that I'd be interested in this feature too! Not a difficult problem to solve for me with separate code, but nonetheless, if it was eventually included that would be great.

Cheers,
Adam

@karenetheridge
Copy link
Member

karenetheridge commented Jun 3, 2020

isUniqueByProperty: <propertyname> and isSortedByProperty: <propertyname> would be great vocabularies to help pioneer a new vocabulary development process :) count me in!

@gregsdennis
Copy link
Member Author

@karenetheridge I prefer having a pointer to a property rather than just a property name. This allows the key to be nested in the object somewhere.

@karenetheridge
Copy link
Member

karenetheridge commented Jun 4, 2020

@gregsdennis Interesting. So in the basic case (an array of objects of strings), "isUniqueByProperty": "mypropertyname" would just be "isUniqueByInsertBikeshedHere": "/mypropertyname" instead. But we could make it even more generalized by not requiring an an array of objects, but an array of anything:

{
  "type": "array",
  "isUniqueByPath": "/0/foo/bar",
  "isSortedByPath": "/0/foo/bar",
  "items": {
    "type": "array",
    "items": {
      "type": "object",
      "requires": [ "foo" ],
      "properties: {
        "foo": {
          "type: "object",
          "requires": [ "bar" ],
          "properties": {
            "$comment": "this is used as the uniqueness/sorting key for the top level array",
            "bar": { "type": "string" }
          }
        }
      }
    }
  }
}

..or is that too complicated?

(edit: I'm having difficulty thinking of when a nested array might be useful though. I think you had the better idea.) :)

@gregsdennis
Copy link
Member Author

That's what I'm thinking. I'm not sure if it will be useful, either, but my experience is that someone will want it.

@joaomcarlos
Copy link

If it doesn't exist, no one can have it. Adoption is purely based on existing functionality. If its not available, then devs will implement it at application layer, which... helps no one.

You can apply the same thinking recursively all the way up and have JsonSchemas support only string types with zero extra validations... people that need those will implement at application level.

This is not some obscure functionality either, uniqueness of things is a core concept of information passing objects. If JsonSchema is expressive enough to provide those and its support is widely adopted, then you don't need extra layers to provide that functionality anymore, which simplifies systems and improves robustness :)

Its more of a "want" than an "if": it is needed, just depends if anyone wants to implement it.

@alastair-todd
Copy link

Stumbled here because I want to specify unique keys. Totally concur with @joaomcarlos

We've had to implement custom code validators for simple things like comparing two dates (even when they are expressed as an integer). It can't be done "native". Now this is not a complete validation model that I can send to customers, like I could an xsd...

Now I'm really stuck between the devil and the deep blue sea.

@pmsreenivas
Copy link

I can think of two scenarios when it comes to uniqueness of multiple keys -

Scenario 1 - similar to @chapmanjw (Dec 9, 2019), the composite combination of key1 and key2 must be unique. As in, two or more objects in the array can have the same key1 or same key2, but no two objects can have the same key1 + key2 combination.

{
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "key1": { "type": "string" },
                "key2": { "type": "string" },
                "value": { "type": "integer" }
            }
        },
        "compositeUniquenessKeys": [ "#/key1", "#/key2" ]
    }

Example pass case :-

Multiple objects have the same key1/key2 but no two objects have the same key1 + key2 combination

[
        {
            "key1": "abc",
            "key2": "pqr",
            "value": 14
        },
        {
            "key1": "abc2",
            "key2": "pqr",
            "value": 10
        },
        {
            "key1": "abc",
            "key2": "pqr2",
            "value": 10
        }
    ]

Example fail case :-

First two objects have same key1 + key2 combination

[
        {
            "key1": "abc",
            "key2": "pqr",
            "value": 10
        },
        {
            "key1": "abc",
            "key2": "pqr",
            "value": 13
        },
        {
            "key1": "abc",
            "key2": "pqr2",
            "value": 10
        }
    ]

Scenario 2 - Objects within the array should not have the same value for key 1 or key 2 (can be extended for as many keys as possible)

{
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "key1": { "type": "string" },
                "key2": { "type": "string" },
                "value": { "type": "integer" }
            }
        },
        "nonCompositeUniquenessKeys": [ "#/key1", "#/key2" ]
    }

Example pass case :-

Both key1 and key2 are unique across all three objects

[
        {
            "key1": "abc",
            "key2": "pqr",
            "value": 10
        },
        {
            "key1": "abc2",
            "key2": "pqr2",
            "value": 9
        },
        {
            "key1": "abc3",
            "key2": "pqr3",
            "value": 10
        }
    ]

Example fail case :-

Although key1 is unique across all three objects, key2 is not. We don't care about the key1 + key2 combination here

[
        {
            "key1": "abc",
            "key2": "pqr",
            "value": 10
        },
        {
            "key1": "abc2",
            "key2": "pqr",
            "value": 10
        },
        {
            "key1": "abc3",
            "key2": "pqr2",
            "value": 10
        }
    ]

These examples can be extended to have more than two keys.

I would like to have both these scenarios included as features.

@pongstylin
Copy link

pongstylin commented Mar 11, 2021

I was about to make a comment similar to @pmsreenivas. Multiple keys would be quite useful - composite or not. My example involves authoring multiple-choice questions. Authoring test questions is pretty common in the education industry. Once such a question is authored, it is posted to the server. The route on the server validates the posted JSON and it would be nice if I can declare and validate the unique keys. Here's a simplified version of my schema and with a suggestion on how to do so.

{
  type: 'object',
  properties: {
    prompt: { type:'string' },
    choices: {
      type: 'array',
      minItems: 2,
      items: {
        type: 'object',
        properties: {
           id: { type:'string', format:'uuid' },
           text: { type:'string' },
        },
        required: [ 'id', 'text' ],
        additionalProperties: false,
      },
      uniqueItems: [
        { key:'id' },
        { key:'text' },
        { key:[ 'composite', 'example' ] }
      ]
    },
    multiple: { type:'boolean' },
  },
  required: [ 'prompt', 'choices', 'multiple' ],
  additionalProperties: false,
}

Rationale

  • Overloading "uniqueItems" is consistent with how "additionalProperties" are overloaded. Rather than providing "true", you can define how the items must be unique.
  • This does not significantly increase complexity, e.g. by using relative pointers. But it also does not reduce complexity for uniqueness checks. The comparison algorithm used to compare unique items today must still be used to compare key values. This makes it acceptable for a key to have an object value - the objects must be deeply equal.
  • Performance should not be a concern. Testing if two values are deeply equal is a common operation. Whether you perform this operation during schema validation or move that responsibility elsewhere, the cost remains roughly the same after sufficient optimization - especially for languages such as JavaScript/NodeJS, which are not strongly typed.
  • The one downside of this solution is that it avoided pointers in the name of simplicity. This means we can't assess item uniqueness based on nested keys, which may be required by edge cases I haven't thought of. But since "uniqueItems" accepts an object (or array of objects) it is extensible. If you wish to define "pointer" as an alternative to "key" at a later date, then this would be easy.

@gregsdennis
Copy link
Member Author

I like the array form for multiple keys, but I'm not sure about the objects with a single key property. I think it's simpler just having an array of strings containing pointers. (I'm not sure what you're trying to do with the "composite" example.)

Pointers don't add to complexity because and validator is already going to know how to resolve them from having to support $ref, et. al.

It's easier to validate at the meta-schema level, too.

As you mentioned, performance isn't a concern. Rather it should be improved since the validator is only required to compare a small number of values rather than an entire tree.

@pongstylin
Copy link

pongstylin commented Mar 11, 2021

@gregsdennis

I like the array form for multiple keys, but I'm not sure about the objects with a single key property. I think it's simpler just having an array of strings containing pointers. (I'm not sure what you're trying to do with the "composite" example.)

Understanding the composite example is essential to understanding my suggestion. By "composite", I am saying a key composed of multiple parts. Using my example, this simple line of code should explain it. So, if you wanted to avoid saying the word "key", we would have to have nested arrays, which is not as easy-to-read.

isDuplicate = a.id === b.id || a.text === b.text || (a.composite === b.composite && a.example === b.example)

An array of objects (or a single object) is really not that bad. Seeing the word "key" helps the reader understand what they are looking at. Also, we may consider adding other parameters to tune object comparisons such as "deep" or "exclude" (these are just examples, not an advocation of adding such parameters).

Pointers don't add to complexity because and validator is already going to know how to resolve them from having to support $ref, et. al.

If there is no roadblock to using pointers, I'm just as happy to use pointers. But I wouldn't want pointers to hold up the feature in general. So I'm pointing out that we could go either way and pivot later.

As you mentioned, performance isn't a concern. Rather it should be improved since the validator is only required to compare a small number of values rather than an entire tree.

Exactly right. Specifying unique keys does not reduce complexity, but does improve performance.

@gregsdennis
Copy link
Member Author

Oh, you mean that the array items are OR'd together. I'm not sure I agree with that. I'm thinking that each item in the array adds to the composite key (they're AND'd).

I'm not certain there's much of a use case for complex boolean login on deciding a uniqueness key. But to say that (a.foo, a.bar) must be unique for any a in the array is quite useful. This is actually the more common pattern, e.g. in document databases (hash and sort keys).

@pongstylin
Copy link

@gregsdennis I provided a use case where I want multiple fields to be unique - the id as well as the text of a choice in a multiple choice question. And, when working with databases, I have seen many tables that would have both a unique identifier and some textual representation of the item (e.g. title or name) that should be unique as well - otherwise it would cause a user to be confused seeing an apparent duplicate in a list. There are exceptions of course for more complex entities where the display text does not need to be unique when other metadata can appear along-side a title or name to distinguish it. There are other people in this thread that has asked for the same, so I don't think it is that obscure of a need. But I am glad my code example cleared up my intent.

@pmsreenivas
Copy link

pmsreenivas commented Mar 12, 2021

Oh, you mean that the array items are OR'd together. I'm not sure I agree with that. I'm thinking that each item in the array adds to the composite key (they're AND'd).

I'm not certain there's much of a use case for complex boolean login on deciding a uniqueness key. But to say that (a.foo, a.bar) must be unique for any a in the array is quite useful. This is actually the more common pattern, e.g. in document databases (hash and sort keys).

@gregsdennis

Based on my comment I posted on Feb 1, 2021 - The Array items must be OR'd if the keys are not composite and independently unique (scenario 2 in my earlier comment) while they must be AND'd if the keys are compositely unique (scenario 1 in my earlier comment)

@pongstylin
Copy link

pongstylin commented Mar 12, 2021

I just discovered that multiple unique keys (OR'd) are supported by ajv.
https://github.com/ajv-validator/ajv-keywords#uniqueitemproperties

That basically provides exactly what I need. Unfortunately, it doesn't support composite keys (AND'd). But, if your json-schema validator supports custom keywords (like ajv), then here's some code I've been playing with that is a proof-of-concept. Feel free to use / port it to implement your own uniqueItems keyword.
https://jsbench.me/bckm6l0k0k/1

@gregsdennis
Copy link
Member Author

I've created a draft vocab for this. It can be reviewed here.

This vocab adds the uniqueKeys keyword which is an array of JSON Pointers. Each pointer is applied to each item in an array to produce a set of values for that item. If one of the pointers can't be resolved it's skipped.

The keyword passes if all of the resulting value sets are unique.

@gregsdennis
Copy link
Member Author

I've released the vocabulary and an implementation for my validator JsonSchema.Net.

@dudicoco
Copy link

dudicoco commented Nov 2, 2021

@gregsdennis I can't get uniqueKeys to work.
I tried it with draft 7, draft 2019-09, draft 2020-09 and with https://gregsdennis.github.io/json-everything/meta/unique-keys

I'm not sure if i'm missing something or if the issue is with the validator which i'm using: https://github.com/sirosen/check-jsonschema

@gregsdennis
Copy link
Member Author

@dudicoco because this is an externally defined vocabulary, the implementation you're using needs to support it. If it supports custom keywords, you may be able to add it yourself. As of right now, my validator is the only one I know of that supports this vocabulary.

There's a lot of discussion in various issues around the difference between vocabularies and their meta-schemas (if they have one). The meta-schema can provide syntax checking (is the keyword's usage syntactically correct?), but the meta-schema can't define the logic behind the keyword. I also cover this in a bit more detail in my docs.

What you'll need to do is check to see if the validator you're using supports custom keywords and possibly implement it yourself.

If you want to play with it a bit to see if your schema's doing what you expect, you can use https://json-everything.net/json-schema, which is powered by my library and supports my custom vocabularies.

@dudicoco
Copy link

Thanks @gregsdennis!

Is it possible to define all keys as unique by default?

See the following schema snippet for example:

            "environment_variables": {
              "type": "object",
              "required": [],
              "patternProperties": {
                "(^([a-z]+[.])+[a-z_]+)$|(^([A-Z0-9_])+([A-Z0-9_]?)+([A-Z0-9])+)$": {
                  "type": [
                    "string"
                  ]
                }
              },
              "additionalProperties": false
            }

In this case we can't specify each key in uniqueKeys as the key names are not fixed.

@gregsdennis
Copy link
Member Author

Hey @dudicoco. Please raise the issue over in my repo https://github.com/gregsdennis/json-everything. Happy to discuss your needs there.

@mprevdelta
Copy link

Reading all this just reinforces for me that anyone doing document databases should have it impressed upon them that only very simple, fundamental objects should be nested. If you have a document containing an array of unique objects, break out a collection and store that as an array of ids to the new collection objects instead.

The tools will allow you to make nested objects that look like they should work, and even do some pretty advanced operations on them, but when you get to adding indexes and constraints or doing reporting, it all falls apart.

@dudicoco
Copy link

dudicoco commented Apr 7, 2022

@mprevdelta can you share an example?

@amannm
Copy link

amannm commented May 4, 2022

This feature is similar to x-kubernetes-patch-merge-key ... would be useful to make standard JSON schema arrays patchable with a similar "strategic merge" that Kubernetes does on fields with those property extensions

@MT-0
Copy link

MT-0 commented Sep 19, 2023

@gregsdennis

Oh, you mean that the array items are OR'd together. I'm not sure I agree with that. I'm thinking that each item in the array adds to the composite key (they're AND'd).

I'm not certain there's much of a use case for complex boolean login on deciding a uniqueness key. But to say that (a.foo, a.bar) must be unique for any a in the array is quite useful. This is actually the more common pattern, e.g. in document databases (hash and sort keys).

Defining an array of keys to automatically describe a composite key is not useful as it prevents defining multiple unique keys.

uniqueKeys: [ "/key1", "/key2" ]

Could either be defined as either:

  1. The key1 property is unique and independently the key2 property is unique; or
  2. The tuple key1, key2 is unique.

If you define it as option 1 then the vocabulary can be extended such that uniqueKeys could take an array of arrays:

uniqueKeys: [ ["/key1"], ["/key2"], ["/key3", "/key4"] ]

Such that key1 is unique, independently key2 is unique, and again independently the tuple key3 and key4 is unique.

There may be a case to allow a simplified version of the syntax such that if a nested array contains only a single item then it could be rewritten without the array wrapper so that the previous example would be functionally identical to:

uniqueKeys: [ "/key1", "/key2", ["/key3", "/key4"] ]

This would be particularly useful for JSON:API resources where you would quite often want something like, for example, for an employee where they have a unique id, a unique username and a unique combination of several other fields:

uniqueKeys: [
  ["/id"],
  ["/attributes/username"],
  ["/references/team/data/id", "/attributes/index-within-team"]
]

@gregsdennis
Copy link
Member Author

@MT-0 what you're describing can be done with my proposal (and what's currently described in the vocab I wrote). You just need an allOf.

{
  // ...
  "allOf": [
    { "uniqueKeys": [ "/key1" ] },
    { "uniqueKeys": [ "/key2" ] },
    { "uniqueKeys": [ "/key3", "/key4" ] }
  ]
}

It's more verbose, but it's also more explicit and easier to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.