Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The returned path value of get_by_uri should be self-described with entire path #1779

Merged
merged 3 commits into from
Feb 14, 2022
Merged

The returned path value of get_by_uri should be self-described with entire path #1779

merged 3 commits into from
Feb 14, 2022

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #1778.

What changes are included in this PR?

There are two changes in this PR.

  • The key of object store considers the host and port rather than just scheme.
  • The returned path value of get_by_uri is self-described with information of scheme, host, port and path in the object store.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Feb 8, 2022
@@ -39,6 +39,11 @@ pub struct LocalFileSystem;
#[async_trait]
impl ObjectStore for LocalFileSystem {
async fn list_file(&self, prefix: &str) -> Result<FileMetaStream> {
let prefix = if let Some((_scheme, path)) = prefix.split_once("://") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if objecstore instances are already sharded by host and port based on changes in get_by_uri, shouldn't we only accept path of the uri in this method? In other words, I think we can safely assume for a given object store instance, all requests should be referencing the same port and host right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For hdfs, there's a HA mechanism to provide a kind of vip host name. The hostname is not a name for a real host, by which ping will not work. However, the dependent object store client is able to recognize that name to direct requests to the real host.

I think this part should be the capability of remote object store, either by the way of HDFS or providing a vip service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Shouldn't this vip host:port be managed as an attribute in the hdfs object store instance? for example, when we create the object store instance and registry it in the registry, we specify the vip, then we only need to provide the object path in list and get object store calls. Internally when the object store handles the list operation, it will perform the pass in the vip host to the hdfs client. Basically my thinking on this is if we have an one to one mapping between an objectstore instance and vip, then we shouldn't need to pass in the vip as part of the object key in these method calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For whether we need to include the vip info in the path, I prefer to include it especially when running with Ballista for distributed execution. Then we will be able to do self-registration or self-detection based on the uri without transfer the object store. I also mentioned this in #1702.

…f-described for which object store to be used
@houqp
Copy link
Member

houqp commented Feb 8, 2022

@yahoNanJing looks like there are some test failures that need to be addressed as well. Changes in get_by_uri looks good to me.

@yahoNanJing
Copy link
Contributor Author

@yahoNanJing looks like there are some test failures that need to be addressed as well. Changes in get_by_uri looks good to me.

It's Windows issue. Let me fix it.

/// - URI with scheme `s3://` will return the S3 store if it's registered
/// Returns a tuple with the store and the path of the file in that store
/// (URI=scheme://path).
/// - URI with scheme `s3://host:port` will return the S3 store if it's registered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URI schemes also allow for username and password

e.g. s3://user:pass@host:port

https://en.wikipedia.org/wiki/Uniform_Resource_Identifier

))
})?;
Ok((store, path))
// We do not support the remote object store on Windows OS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer to keep any uri parsing in the datafusion crate as simple as possible and leave more sophisticated uri interpretation to the actual ObjectStoreInstance.

Among other things this makes it easier to implement arbitrary ObjectStoreInstance

For the usecase mentioned in #1778 I wonder if you could write a wrapping object store like:

struct HostAwareHDFS {

}

impl ObjectStore for HostAwareHDFS {
    async fn list_file(&self, prefix: &str) -> Result<FileMetaStream> {
      /// form valid hdfs url:
      let hdfs_url = format!("hdfs://{}", prefix);
      match Url::parse(&hdfs_url) {
        Ok(url) => { 
           self.get_object_store_for_host(url.host()).list_file(prefix)?
           ...
       Err(..) ...
    }
...
}

In other words, push the interpretation of urls into the ObjectStore

If this won't work, I would suggest passing the entire parsed url into store.get() rather than some synthetic key that is datafusion specific

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb, I totally agree with you, since the uri components may differ for different object store instance. That's why I prefer to keep the whole uri info in the return value. However, here this PR mainly focus on the key of hash map for object stores. The scheme info is not enough. In general, a remote object store can be identified by scheme://host:port

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the distinction I am trying to draw is that the current Object Store API is mapped by scheme and it would be up to the object store implementation to figure out how to handle host/port information

So rather than having one HDFSObjectStore instance for server1:8000 and a second HDFSObjectStore instance for server2:8290, there would be a single HDFSObjectStore that would need to know how to dispatch appropriately to the different server hosts / ports

The same basic pattern holds for file systems (for example, there is a single LocalFileSyetem instance even though the local file system might have different disks mounted to /data and /data2).

I think also it would hold for S3 and other types of object stores (where depending on the region you need to request to a different endpoint)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should point out that my reason for disliking adding host and port to the object store key is that it doesn't make sense for many types of object stores (such as LocalFileSystem or S3 which have no notion of host/port). It seems like this change is too HDFS specific and can be accomplished in a different way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb. Finally I get your point. To use the object store, there are three hierarchies: ObjectStoreRegistry -> ObjectStore -> ObjectStoreInstance. The ObjectStore is just for managing one kind of object store rather than being the instance. Previously I misunderstood it as a real instance.

One more question, for some object store, they support many kinds of schemes. For example, HDFS support:

  • file://
  • hdfs://
  • viewfs://

I still think the returned path should include the original scheme info rather than throw it away. Then for the object store instance, it will be able to deal with the path for different kinds of schemes. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the returned path should include the original scheme info rather than throw it away. Then for the object store instance, it will be able to deal with the path for different kinds of schemes. What do you think?

I think returning the entire path, rather than stripping away the scheme makes a lot of sense 👍

@yahoNanJing yahoNanJing changed the title Improve object store key with considering host and port in ObjectStoreRegistry The returned path value of get_by_uri should be self-described with entire path Feb 14, 2022
@yahoNanJing
Copy link
Contributor Author

Hi @alamb and @houqp, could you help recheck this PR? Now it's only for changing the returned path value of get_by_uri.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

cc @matthewmturner @seddonm1 @yjshen

@alamb
Copy link
Contributor

alamb commented Feb 14, 2022

Thank you @yahoNanJing -- looks much better to me now

@seddonm1
Copy link
Contributor

Thank you for the heads up @alamb. I had identified this issue last week but didn't have time to dig deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve object store key with considering host and port in ObjectStoreRegistry
5 participants