Skip to content

Shared Data

Kinan Bab edited this page Apr 27, 2023 · 2 revisions

In the previous tutorial, we saw how a simple K9db schema can be annotated to reflect a simple compliance policy, where every piece of data had a clear and explicit singular owner.

However, it is common in web-applications to have some data that corresponds to several data subjects. In such a case, developers need to make a determination about the rights each of these data subjects have to that shared data.

Let's look at a simplified scenario extracted from our Lobsters experiment. In this application, users can post stories on their social media profile, and they and other users can post comments on these stories. Furthermore, users can exchange direct private chat messages. Thus, we have the following tables: users, chat, stories, and comments.

Schema

The users table clearly represents data subjects and can be annotated with DATA_SUBJECT.

CREATE DATA_SUBJECT TABLE users(
  ID INT,
  name TEXT,
  PRIMARY KEY(ID)
);

The chat table is more interesting: a chat message has a sender and a receiver, both foreign key columns pointing to users. Both of these users have some rights to the chat message, but the exact policy depends on the application. For example, Facebook's privacy policy ensures that a private message remains visible to one of the two users, even if the other user requested deletion of their data. K9db views such a policy as expressing a notion of co-ownership: the data is truly owned by all of the associated users, and thus is only deleted when all the users request deletion. We express this in K9db by annotating both foreign keys with an OWNED_BY annotation.

CREATE TABLE chat (
  ID INT,
  sender_id INT,
  receiver_id INT,
  message TEXT,
  PRIMARY KEY (ID),
  FOREIGN KEY (sender_id) OWNED_BY users(ID),
  FOREIGN KEY (receiver_id) OWNED_BY users(ID)
);

Note that these foreign keys need to be annotated manually, as K9db does not automatically annotate cases with multiple ambiguous foreign keys. We will see other policies that assign asymmetric rights to associated users later.

The stories table is more straightforward. It has only one foreign key column to the users table that represents the author of the story. A reasonable policy considers the author to be the sole owner of this story. Note that K9db can automatically annotate this foreign key with OWNED_BY. However, we explicitly add this annotation below for clarity.

CREATE TABLE stories(
  ID INT,
  author INT,
  context TEXT,
  PRIMARY KEY (ID),
  FOREIGN KEY (author) OWNED_BY users(ID)
);

Finally, the comments table includes two foreign keys. The first links the comment to the story it is posted on, and the second links it to the author of the comment, which may be different than the author of the story. While it may seem that there is only one foreign key linking the comment to a data subject (the direct author FK), in reality, the other foreign key also links the comment to a data subject transitively via the stories table. Thus, K9db errors as expected on any attempt to create this table with no annotations, as the two foreign keys are ambiguous in terms of ownership.

Any reasonable policy must consider the author of the comment to be an owner of that comment. Certain applications, like reddit, consider that author to be the sole owner. If the author of the parent story requests their data be deleted, the contents of the story are removed (and it's shown in the user interface as [deleted story]), but the comments on the story made by other users remain visible. We will use this policy for this example.

CREATE TABLE comments(
  ID INT,
  author INT,
  story_id INT,
  content TEXT,
  PRIMARY KEY(ID),
  FOREIGN KEY (author) OWNED_BY users(ID),
  -- no annotations: only the author has rights to the comment.
  FOREIGN KEY (story_id) REFERENCES stories(ID)
);

SARs

Now that we have created the schema, we can insert some data, and validate that SARs behave as we expect.

INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');

INSERT INTO chat VALUES (1, 1, 2, 'Msg 1');
INSERT INTO chat VALUES (2, 2, 1, 'Msg 2');
INSERT INTO chat VALUES (3, 1, 1, 'Msg 3');

INSERT INTO stories VALUES (1, 1, 'Story 1');
INSERT INTO comments VALUES (1, 2, 1, 'Comment');
INSERT INTO comments VALUES (2, 1, 1, 'Response');

Say Alice wants to delete her data. We expect that 'Msg 3' will be deleted, since Alice sent it to herself, and thus is the sole owner of it. The other two messages should not be deleted, because they are also owned by Bob, who has not deleted his data yet. Furthermore, we expect that 'Story 1' will be deleted, because Alice authored it as well as the 'Response' comment, but not Bob's comment.

Indeed, we can see this is the case.

GDPR FORGET users 1;
> Query OK, 6 rows affected (0.029 sec)

SELECT * FROM chat;
+------+-----------+-------------+---------+
| ID   | sender_id | receiver_id | message |
+------+-----------+-------------+---------+
|    2 |         2 |           1 | Msg 2   |
|    1 |         1 |           2 | Msg 1   |
+------+-----------+-------------+---------+
> 3 rows in set (0.000 sec)

SELECT * FROM stories;
> Empty set (0.000 sec)

SELECT * FROM comments;
+------+--------+----------+---------+
| ID   | author | story_id | content |
+------+--------+----------+---------+
|    1 |      2 |        1 | Comment |
+------+--------+----------+---------+
> 1 row in set (0.000 sec)

Anonymization

While Alice's rows were correctly deleted in the example above, we can still see references to Alice's ID in some of the surviving rows. In particular, an external observer can still identify that the two messages Bob exchanged were in communication with the same user, who is also the author of the story on which Bob made a comment. The observer can potentially correlate this with additional external information to learn more information about the identity of the deleted user (Alice).

K9db allows developers to specify anonymization rules that can help hide this information in surviving rows. Anonymization rules can be applied on either a GDPR GET or a GDPR FORGET. In the case of GET, anonymization will be applied to the result set in memory, prior to returning it to the client, while also leaving the contents of the database unchanged. In FORGET, the anonymization is applied to data associated with the requesting data subject that survives the deletion, e.g. because it is owned by others as well.

Both types of anonymization rules require developers express the data subject to which these rules apply, by specifying the foreign key column in the table that connects the data to that data subject (directly or via transitive hops). Developers also specify the set of columns to anonymize when the rule is applicable.

Here is an example below for the chat table. Note: you will need to delete the database and restart K9db cleanly in order to re-create the table.

CREATE TABLE chat (
  ID INT,
  sender_id INT,
  receiver_id INT,
  message TEXT,
  PRIMARY KEY (ID),
  FOREIGN KEY (sender_id) OWNED_BY users(ID),
  FOREIGN KEY (receiver_id) OWNED_BY users(ID),
  -- When the data subject pointed to by sender_id issues a GDPR FORGET
  -- Anonymize the sender_id column by setting it to NULL.
  ON DEL sender_id ANON (sender_id),
  -- Ditto for receiver_id.
  ON DEL receiver_id ANON (receiver_id)
);

After re-creating the database with the new chat table schema, and inserting the same data as above. We can issue a GDPR FORGET and validate that anonymization works as expected.

GDPR FORGET users 1;
> Query OK, 8 rows affected (0.033 sec)

SELECT * FROM chat;
+------+-----------+-------------+---------+
| ID   | sender_id | receiver_id | message |
+------+-----------+-------------+---------+
|    2 |         2 |        NULL | Msg 2   |
|    1 |      NULL |           2 | Msg 1   |
+------+-----------+-------------+---------+
> 2 rows in set (0.001 sec)

Anonymization allows us to express even more complex policies. For example, in the above we used a policy similar to Reddit's, where comments made by one data subject are kept even when their parent story are deleted. However, with anonymization, we can express a different policy, such as Facebook's, where a comment is deleted when either:

  1. Its author requests deletion of their data.
  2. Its parent comment is deleted.

The comment thus now has multiple data subjects with deletion rights: the author of the comment, and the author of the story. This is somewhat similar to how chat worked above, except that the deletion condition is an OR rather than an AND. We can express this in K9db by making both foreign keys OWNED_BY, and then specify that the record be deleted (e.g. fully anonymized) when either data subjects request deletion.

CREATE TABLE comments(
  ID INT,
  author INT,
  story_id INT,
  content TEXT,
  PRIMARY KEY(ID),
  -- Both foreign keys are OWNED_BY.
  FOREIGN KEY (author) OWNED_BY users(ID),
  FOREIGN KEY (story_id) OWNED_BY stories(ID),
  -- Either of them deleting deletes the comment.
  ON DEL author DELETE_ROW,
  ON DEL story_id DELETE_ROW
);

We can recreate the database and insert the data again, and validate that the new policy is in effect.

GDPR FORGET users 1;
> Query OK, 10 rows affected (0.033 sec)

SELECT * FROM comments;
> Empty set (0.001 sec)