-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decentralized persistence and resolving #112
Comments
What makes sense to me is to distinguish multiple kinds of remote data handling:
|
@jonassmedegaard I like the idea, we can probably model and implement it as typed DAG. |
@jonassmedegaard good ideas, these different usecases deserve clear distinctions. The process of forking (cloning + editing) external resources should be specified as well. In the last three usecases that you mention, the In the opening post, I was mostly thinking about the first usecase: |
I disagree: I think only a cloned resource should be renamed and tracked as a separate issue - a resource that is mirrored is still the same, including subject. I.e. what I mean by a mirrored resource is owl:sameAs. |
I imagine the Atomic Server would only cache at first, and if at a later refresh of the cache the resource had gone then flag it as needing action. I guess the default proposed action might be different based on how the resource had disappeared - e.g. host unaccessible or a 5xx response might lead to proposing the action of turning into a copy, whereas host serving different content might lead to proposing turning the cached copy into a forked resource. |
Every place I've seen But I think we probably mean the same thing with mirroring: creating a local clone of some external resource, use a different subject (or else the mirrored resource could not be resolved), and consider the values immutable. What do you think? |
Right. I think I got confused about the very essential meaning of this issue: If you truly are talking about "same subject" then there cannot be any decentralization, only caching of One True Source - because that one single source is the subject. |
I.e. you cannot "embed" an external resource without its RDF subject changing, and you cannot offer decentral resolving because there is only one authoritative source (every indirect access can only possibly be cached data). So essentially this very issue cannot be about "same subject" (and therefore I simply ignored that sentence in your previous post here). ...or what am I missing? |
I mean, either we are talking about resources that each can only have one true identifier, or we are talking about resources that each can have one or or more identifiers (i.e. semantically equivalent identifiers). Which is it? |
Decentralized resolving with only one authoritative source is not necessarily impossible:
I'm talking about a single, decentralized, resolvable identifier. Semantic equivalent identifiers are interesting for other reasons, but not for this issue. Still, I definitely believe that you make a valid distinction in your first comment: there are multiple reasons for using remote data, and various types of relationships between source and user. |
Ok, so when you say "embed" here, you really only mean "cache". With that constraint it makes sense to me (otherwise not). |
When I mention embed, I'm talking about embedding the actual application / dependency that deals with decentralised resolving. For example, an embeddable IPFS implementation, or a different type of library that can be embedded in the Atomic-Server binary. I want to prevent introducing a runtime dependency. I'll edit the OP to make this a bit clearer. |
Makes more sense now. Thanks! Essentially this is related to the "...but what if the web is lost" problem of networked resources. This is the reason I laid out the ways to secure knowledge about external data points - recognizing that they may disappear. You can replace http identifiers with all-is-on-a-blockchain identifiers, but that does not change the fact that data points may get lost, it just changes how it happens: With blockchains it gets lost by the web growing too large to truly be fully mirrored and techniques to "omit the less important bits" then occationally optimizing away the bits that you need. So sure, you can choose to use IPFS identifiers instead of http identifiers, shifting your choice of underlying "weawing tech" for the web you want your system to rely on. And then embed the code to handle that identifier type. I would be sad if you chose to use only blockchain-based IDs for Atomic Data, because that would massively loose the ability to weave a web of both Atomic Data and Solid nodes. |
I fully agree that completely moving to blockchain IDs would be a bad approach. Most (if not all) blockchain solutions are far too slow, anyways.
True, data can always get lost. But there are some characteristics of Atomic Data that could help make it less likely that data becomes lost. If every server that has a dependency on some external resource also caches this resource and advertises its caching to others, we get a degree of redundancy that makes it far less likely that critical information gets lost. Finding a mechanism that enables this, though, seems pretty complicated. |
When limited to caching, protocol-specific rules for caching must be obeyed. I.e. CacheControl header for http protocol. It confuses me that you mention IPFS and Hypercore if what you want is to (also) cache http identifiers. |
I mean, what you can do to aid other Atomic Data servers in long-term caching your data is to add a CacheControl header with a long expiry time (and then treat that identifier as immutable for that same amount of time, obviously!). And what you can do to cache data from external Atomic Data servers is to store locally a cached copy but only for as long as that external server signaled in their CacheControl header that you are permitted to do so. Other protocols may have other efficient cache management, but those features are irrelevant for caching of http-based identifiers. |
CacheControl is not for dealing with resources that go offline (i.e. 404 / server timeout), which is the scope of this issue. CacheControl is just for increasing performance, and preventing re-fetching big documents. I havent' implemented that in Atomic Server, as the (often very small) resources themselves take about 0.2 milliseconds to fetch from disc. Just to be clear: maybe CacheControl has its merit here, too, but it definitely does not solve the issue I'm trying to solve here: resources that go offline, and having means to find them if the HTTP URL no longer works. |
The crate |
...and other more generic cache handling issues like this one. My point being that this sounds specific to cache handling - which for http is tied to the rules for caching defined as part of the http protocol. |
When a resource disappears from a web, then either you violate CacheControl rules by continuing to serve a cached copy beyond its expiry time claiming that what you serve represents that URI, or you admit that you are service a copy of a resource that at certain point in time had a certain identity. Might make sense to track mutable data as a separate issue, but I dare say that this is exactly about the second form in my list: read-only copy of a resource |
FYI Rust-Libp2p (networking stack behind IPFS) supports DHT: https://github.com/libp2p/rust-libp2p/tree/master/protocols/kad I've been familiar with both Hypercore (formerly known as Dat Protocol, now called Hole Punch) and IPFS (IPNS, IPLD, Libp2p) since 2018/2019 and they have both evolved at lot since then. The IPFS ecosystem is well funded and well supported seemingly more so than Hypercore/Holepunch so the IPFS adoption and name recognition is more prevalent, plus IPFS seems to have better browser support overall (not every user wants to download something to get started). The very slow IPNS is due to the DHT, but the IPNS can be sped up by using pubsub and also there is a new initiative called the "Name Name Service" (NNS) which may replace IPNS in the future. Personally I am working on zk Delegated Anonymous Credentials name system which could offer a robust naming solution across mesh nodes. Personally, after years of research and development in this area I am leaning towards nodes that can be run at home with no domain name (TLS) requirement -- which means WebRTC Data Channels over Rust compiled nodes, with the data being persisted and resolved across those nodes. |
As a protocol, Atomic Data is mostly designed with the assumption that HTTP URLs do not change, and will continue to be hosted for as long as needed. In practice, this does not always happen. This basically means that everytime you use an externally defined thing (such as a Class or Property), you introduce a dependency. The source may go offline anytime.
We currently deal with this issue by simply caching things server side. In effect, all Properties and Classes that a server encounters are saved locally. This works for this server, but what happens if someone else wants to use this data? If they try to get the Property, for example, that no longer is hosted, they have no alternative means of resolving the URL.
I'm looking for a system / protocol that gives users the option to find resources that have gone offline at their original source.
Some considerations:
Let's discuss various approaches to this problem here:
IPFS
A really interesting technology that allows for content-based adressing. See #42
Great for static stuff, but not that great for things that change over time.
The rust version does not offer DHT support as of now, and development appears to have stagnated.
One way that, to me, seems particularly interesting, is to add the IPFS identifier to the HTTP url. Basically, we get a URL like this:
https://atomicdata.dev/someresource?ipfsid=QmYwAPJzv5CZsnA625s3Xf2nemtYgPpHdWEz79ojWnPbdD
. That way, the subject contains information about who is in charge / where you should fetch the data (atomicdata.dev) and also, which version was used (ifps is a hash of a specific version) and where you can retrieve that data if the HTTP url does not resolve. Pretty cool stuff, right? See #64 about hybrid identifiers.Hypercore
HyperCore is fundamentally a protocol for a replicated append-only log, although it also supports higher level K/V and Filesystem based data structures. It is a bit like bittorrent, but more dynamic. Logs have a public key as an address, where (everyone) can append to.
It has a Rust crate, although it is in beta and doesn't seem to be actively developed anymore. Not a problem per se, if its current state is good enough and the code is maintainable. Also [this one]https://github.com/datrs/hypercore-protocol-rs) from the Dat project.
We could store Atomic Commits using Hypercore. One log per resource, which represents all the changes to that specific resource. The secret key for the Resource is maintained by the one creating the Commits. We could share the public key in the Resource. When this
hypercorePubKey
is present, the committer also sends the new signed commit to the Hypercore feed.I think it's probably best if the server maintains the secret key for the feed (by default), because that way it could also create server-side changes by other users through authorization of other users. In other words, you could invite a different user to append to your log, if certain conditions are met. For example, you could invite others to post messages to your chat room this way - withouth having to share the secret to your log.
What would we gain from this? Well, we'd have an extra way to retrieve Commits, even if the server goes offline. At least, in theory, if others have replicated the data.
Custom Merkle tree / DHT implementation
Perhaps it makes more sense to build something custom for Atomic, something lightweight and designed in conjunction with other parts of Atomic.
It needs to:
Nah, too much effort.
Asking atomic servers to resolve external HTTP resources
We currently have the
/path
endpoint, which accepts any Atomic URL / subject. A client that wants to use some property that appears to be offline, could then ask any atomic server for thishttp://someproperty
resource.This approach has a couple of limitations:
Also, if servers would serve external content, it may be worth sharing some cache-related metadata (thanks @jonassmedegaard), such as caching policy, and date fetched / cached.
We could maybe solve this by introducing some form of discoverability. Not sure how that should work, though.
The text was updated successfully, but these errors were encountered: