Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add as_any() to the ObjectStore to make it able to identify which ObjectStore is using for the related trait object #3839

Closed
wants to merge 2 commits into from

Conversation

yahoNanJing
Copy link

Which issue does this PR close?

Closes #3838.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

…ectStore is using for the related trait object
@github-actions github-actions bot added the object-store Object Store Interface label Mar 10, 2023
@tustvold tustvold added the api-change Changes to the arrow API label Mar 10, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that rust supports trait upcasting you might be able to just make ObjectStore: Any

@alamb
Copy link
Contributor

alamb commented Mar 11, 2023

There appears to be some CI errors.

I think adding a as_any function is probably fine> it would have wider support from older rust versions, and there are many existing examples of doing so

FWIW this seems like either approach is a breaking change (and would require releasing object_store 0.6.0), which is fine, but just something to keep in mind

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yahoNanJing

@@ -169,6 +170,10 @@ impl std::fmt::Display for AmazonS3 {

#[async_trait]
impl ObjectStore for AmazonS3 {
fn as_any(&self) -> &dyn Any {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@tustvold
Copy link
Contributor

tustvold commented Mar 11, 2023

I would like to understand the use-case more before merging this, I'm not a massive fan of breaking encapsulation in this way, it undermines the decorator pattern, and would rather avoid making a breaking change if possible

Edit: turns out trait upcasting is still unstable, the RFC was just accepted. Using direct "inheritance" might have avoided this being a breaking change, as all 'static types automatically implement Any, but I guess this is the only way to support this

@alamb
Copy link
Contributor

alamb commented Mar 11, 2023

I would like to understand the use-case more before merging this,

I wonder if it is something like flushing the cache, or being able to take some different action if the item is not in the cache 🤔

I'm not a massive fan of breaking encapsulation in this way, it undermines the decorator pattern, and would rather avoid making a breaking change if possible

I agree finding some way to avoid as_any might be preferable in most circumstances, but I don't see that as a reason to not add it. If a downstream project wants to structure their code to use as_any I think that is a legitimate choice.

In other words I see no harm in adding this API as it allows significant flexibility even if it is not a preferred pattern for some projects

@yahoNanJing
Copy link
Author

yahoNanJing commented Mar 13, 2023

Thanks @alamb and @tustvold for reviewing this. The main reason for adding as_any is to identify different cache path for the corresponding ObjectStore. For example, in the apache/datafusion-ballista#645, we will introduce a cache layer with a root cache directory. All of the cache data will be under that root directory with a cache path based on the original path in its original ObjectStore. The cache layer will be able to cache data from different ObjectStores. Suppose ObjectStore A has a path p, while ObjectStore B has the same path p. Then how to cache them both becomes a problem if we are not able to identify different ObjectStores.

@tustvold
Copy link
Contributor

tustvold commented Mar 13, 2023

Then how to cache them both becomes a problem if we are not able to identify different ObjectStores.

How does as_any fit into this, the underlying stores don't expose or even know that information? I think you need to use the url that is used to register the store with the ObjectStoreRegistry as a cache prefix?

@yahoNanJing
Copy link
Author

I think you need to use the url that is used to register the store with the ObjectStoreRegistry as a cache prefix?

If the ObjectStore exposes the url, then it would be better so that we can even use a different cache prefix for different ObjectStore instances instead of for different ObjectStore types. However, it's better not to expose the url with secrets, especially for S3 or other stores.

At the first stage, I think the cache prefix based on the ObjectStore types is enough since the as_any method is very common for a trait.

@tustvold
Copy link
Contributor

If the object_store exposes the URL

That is my point, the URL is a datafusion concept, the stores don't know the URL

cache prefix based on store types

How would this work if there are say two different S3 buckets registered?

@alamb
Copy link
Contributor

alamb commented Mar 13, 2023

Here is a PR in DataFusion that is perhaps related: apache/datafusion#5543

That is my point, the URL is a datafusion concept, the stores don't know the URL

So perhaps you are saying the caching should be done at a higher level? Like in ObjectStoreRegistry for example? Is this something you considered @yahoNanJing ?

@yahoNanJing
Copy link
Author

How would this work if there are say two different S3 buckets registered?

I don't know whether it's common for a cluster to access several S3 buckets. However, it's a common case for a cluster to access one S3 bucket, one HDFS cluster, or other different kinds of storages.

@yahoNanJing
Copy link
Author

That is my point, the URL is a datafusion concept, the stores don't know the URL

Thanks for pointing this out. Previously I thought the url is a necessary part of the ObjectStore. Although it's still very awkward to me not to include the url for ObjectStore, I respect your original design.

@yahoNanJing
Copy link
Author

yahoNanJing commented Mar 13, 2023

So perhaps you are saying the caching should be done at a higher level? Like in ObjectStoreRegistry for example? Is this something you considered @yahoNanJing ?

Thanks @alamb. I still think it should be at the ObjectStore level. I'll change my current implementation to adjust to it. Maybe we need to add a trait to extend the ObjectStore with providing get_url() method, since it's a datafusion specific concept.

@tustvold tustvold marked this pull request as draft March 13, 2023 22:17
@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

Is there some reason we should hold off on merging this PR ?

@tustvold
Copy link
Contributor

tustvold commented Mar 15, 2023

I'll change my current implementation to adjust to it. Maybe we need to add a trait to extend the ObjectStore with providing get_url() method, since it's a datafusion specific concept.

I was waiting based on this feedback, I would rather not merge something, especially a breaking change, without a motivating use-case. I had understood apache/datafusion#5543 as removing the need for this

@yahoNanJing
Copy link
Author

yahoNanJing commented Mar 23, 2023

Now that rust supports trait upcasting you might be able to just make ObjectStore: Any

Hi @tustvold, I'm afraid we still need to add the as_any() interface to the trait. Based on apache/datafusion#5543, I factored previous code. However, it doesn't work as expected. See the UT fails in https://github.com/yahoNanJing/arrow-ballista/blob/dev-20230322/ballista/core/src/cache_layer/object_store/mod.rs#L169-L173

Or maybe there are other correct way to downcast an Arc<dyn ObjectStore>.
https://github.com/yahoNanJing/arrow-ballista/blob/dev-20230322/ballista/core/src/cache_layer/object_store/mod.rs#L31-L39

@tustvold
Copy link
Contributor

tustvold commented Mar 23, 2023

However, it doesn't work as expected

I'm having a somewhat hard time understanding the code you've linked but please bear with me:

CachedBasedObjectStoreRegistry:

  • Creates a FileCacheObjectStore combining a normal ObjectStore with a FileCacheLayer,
  • Computes a cache key from the URL
  • Creates a ObjectStoreWithKey to extend the FileCacheObjectStore with this key to wrap this construction

Calling get on this object store then:

  • Calls through ObjectStoreWithKey to FileCacheObjectStore
  • Looks up the value in the cache, passing the inner store as the GetExtra argument
  • This propagates through to FileCacheLoader::load
  • Which then calls CacheMedium::get_mapping_location
  • Which finally calls through to get_key which attempts to downcast the provided ObjectStore to ObjectStoreWithKey
  • This will always fail as the ObjectStoreWithKey is decorating the FileCacheObjectStore not the other way round

A simple fix might be to swap the decoration order, but taking a step back, the cache key is known at the time FileCacheObjectStore is constructed by CachedBasedObjectStoreRegistry. I don't see a reason to not just store this key inside the FileCacheObjectStore and pass it along with the inner ObjectStore as the GetExtra argument to the cache. For example, by storing ObjectStoreWithKey inside FileCacheObjectStore instead of Arc<dyn ObjectStore>, and passing this as the GetExtra argument. This would not only be simpler (and work) but would also avoid any need for downcasting?

FWIW this highlights my major reservation about as_any(), it makes for very hard to follow code, there is a reason Rust discourages its usage by not having first-class dynamic casting support, the friction is a feature 😄. I think if downstreams want to use this pattern they can derive a trait object with as_any and as_object_store, but I would prefer not to "bless" this as a first-party pattern without a really compelling use-case

@yahoNanJing
Copy link
Author

yahoNanJing commented Mar 23, 2023

Thanks @tustvold for reading the proto branch for cache layer. The process details you described above are exactly correct.

However, it doesn't work as expected

I'm sorry I didn't explain it very clearly. Actually the process works well. And it's totally transparent to the Datafusion. The only unexpected thing is the function get_key in https://github.com/yahoNanJing/arrow-ballista/blob/dev-20230322/ballista/core/src/cache_layer/object_store/mod.rs#L31-L39. It seems it will always return "none". The unit test https://github.com/yahoNanJing/arrow-ballista/blob/dev-20230322/ballista/core/src/cache_layer/object_store/mod.rs#L169-L173 verifies that.

@yahoNanJing
Copy link
Author

Calling get on this object store then:
Calls through ObjectStoreWithKey to FileCacheObjectStore
Looks up the value in the cache, passing the inner store as the GetExtra argument
This propagates through to FileCacheLoader::load
Which then calls CacheMedium::get_mapping_location
Which finally calls through to get_key which attempts to downcast the provided ObjectStore to ObjectStoreWithKey
This will always fail as the ObjectStoreWithKey is decorating the FileCacheObjectStore not the other way round

Thanks @tustvold for providing this genius idea. It works now without introducing as_any to the ObjectStore. Related commit is here. yahoNanJing/arrow-ballista@84a98d7

@tustvold
Copy link
Contributor

I am going to close this for now, feel free to reopen should the need arise

@tustvold tustvold closed this Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API object-store Object Store Interface
Projects
None yet
4 participants