Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current structure might hit limits #16

Closed
rickvdbosch opened this issue Sep 29, 2020 · 7 comments · Fixed by #23
Closed

Current structure might hit limits #16

rickvdbosch opened this issue Sep 29, 2020 · 7 comments · Fixed by #23
Assignees

Comments

@rickvdbosch
Copy link
Collaborator

In the current entity setup, repos and the PRs of that repo a user contributed to are serialized into a Json string and stored in one Table Storage column. The maximum length of one column in Table Storage is 64 KiB:

String values may be up to 64 KiB in size. Note that the maximum number of characters supported is about 32 K or less.
Source: Understanding the Table service data model - Property types

Because of this limit, the current structure might be insufficient for (very) active users.

I propose to implement an alternative structure to make sure we can accommodate even the most active GitHub users. Is that OK?

@Layla-P
Copy link
Owner

Layla-P commented Sep 30, 2020

@CrypticEngima Are you able to look into this?

@CrypticEnigma00
Copy link
Collaborator

@rickvdbosch Currently we are not expecting the usage to hit those limits. but if we do start getting that amount of traffic we can look at refactoring this. in prep for if that happens could you please describe the changes you propose to make for this(I'm personally interested as this is the first time i'm using table storage).

@rickvdbosch
Copy link
Collaborator Author

rickvdbosch commented Sep 30, 2020

@CrypticEngima As far as the way I'm used to work with TableStorage, you could take a look at my TableStorageRepository for reference. Might be interesting.

For the entities, I would think about the following:

Table Partition Key Row Key
Users "Users" Username
Repositories Username Reponame
PullRequests Username + Reponame PrId

There's a downside here since you need to do multiple queries to get all information. But with proper partitioning that shouldn't be a big / an actual issue.

@CrypticEnigma00
Copy link
Collaborator

@rickvdbosch Thank you so much for sharing that information I can most certanly see the benifits of this structure. I have one overriding question about the format you suggest here though which is.

Does this format not turn a key value pair storage into a basic Relational Database?

Maybe i'm missunderstanding the useage of ''no sql' style storage i'm so used to using Relational Databases.

@rickvdbosch
Copy link
Collaborator Author

Well, the current structure does the same, but only by serializing data instead of having it in separate tables. 😁

Looking at this from an API perspective, there are some clear entry points to be seen.

  1. Get repos per user
  2. Get PRs for a repo (of a user)

This would validate the structure, since you're going to need to call1 before calling 2. Us using MVC might drive us to think we'd need all the data at once for our model.

Come to think about it, maybe the user table is not even needed. It doesn't store anything else than username... right? So having username as the PK of the repos table eliminates that one. And to be honest I'm not entirely sure about the repositories table either.

That would solve the issue entirely 🤓

@rickvdbosch
Copy link
Collaborator Author

So I took the time to play a game of tennis, and CrypticEngima's comment and the relaxation gave me some new insights. Nothing in this comment is meant as criticism, only to get us to the best solution. So here goes:

The current solution

The proposal in my earlier comment in this thread was based on an existing model, which actually seems set up with a relational model in mind. But I think we might need to take a step back in defining the model.

Requirements

What we should do first is define what data we actually need to store. The user-table, for instance, can be removed since the only thing we store is the username. That's something we can store elsewhere.
Next we need to take a look at the levels at which we want to retrieve that data. Because if we always get all repositories and the PR's the user has for those repos, the model can be brought back to only one table. That's the cool thing about Table Storage: it's so fast and cheap it's not bad to store things multiple times. Normalization is not that important anymore.

Proposal (beware, based on assumptions above)

PartitionKey RowKey Column Column
Structure Username {owner}:{reponame}:{prId} Url Title
Example rickvdbosch Layla-P:HacktoberfestProject:19 https://github.com/Layla-P/HacktoberfestProject/pull/19 Get user from table storage based on GitHub info

This enables us to get all information for a specific user by querying the entire partition for a user. The current combined RowKey is unique and can be parsed into three different columns owner, reponame and PrId.
As a sidestep: we can generate the URL based on that information. So to be efficient we could remove Url too. But if it's simpler to keep it, then we should.

Input or ideas?

Any ideas @Layla-P and @CrypticEngima?

@CrypticEnigma00
Copy link
Collaborator

@rickvdbosch Thanks, I see the way your thinking about this now and yes it's a big change from the way you think about data in a relational database. I think i need to investigate ' No Sql' style further to better understand. but this information has been a real eye opener

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants