Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Support new and different segment types #2965

Closed
cheddar opened this issue May 13, 2016 · 12 comments
Closed

[Proposal] Support new and different segment types #2965

cheddar opened this issue May 13, 2016 · 12 comments

Comments

@cheddar
Copy link
Contributor

cheddar commented May 13, 2016

Currently, Druid has its own persistence format which was designed to handle structured data in the form of dimensions and metrics. It would be nice to expand Druid to be more straightforward in handling different structures, persistence formats and even functionality all as extensions to core. This proposal tries to move us in that direction.

There are three basic areas that require attention in order to do this:

  1. Ingestion
  2. Hand-off
  3. Querying

** Ingestion

The ingestion side of this is already handled by the recent Appenderator changes, so I do not believe any noteworthy changes would be required in the immediate future. That said, not all ingestion mechanisms leverage Appenderators yet, so in order to get the capabilities enabled by this proposal, there will be work needed to implement all mechanisms of ingestion in terms of Appenderator objects.

** Hand-off

Hand-off is comprised of two things: persisting on the ingestion-side and deserializing on the historical(read)-side. Appenderators already own the persisting process on the ingestion side, so once again, I do not believe any noteworthy changes are required for that half of the story.

On the deserialization end of the spectrum, though, there will be changes required to enable these formats to be added as extensions. Namely, the current SegmentLoader interface is implemented by SegmentLoaderLocalCacheManager in a two-step algorithm

public Segment getSegment(DataSegment segment) throws SegmentLoadingException
{
  File segmentFiles = getSegmentFiles(segment); // step 1: download and unzip files
  return new QueryableIndexSegment(segment.getIdentifier(), factory.factorize(segmentFiles)); // step 2
}

We need to change the algorithm of step 2 to be something that can be extended, which means a Guice or Jackson touch-point. I propose that we make it a Jackson touch point. Specifically, we should add a file to the zip that is a JSON-descriptor for the factory that should be used to deserialize the files. Essentially, resulting in this implementation instead:

public Segment getSegment(DataSegment segment) throws SegmentLoadingException
{
  File segmentFiles = getSementFiles(segment);

  SegmentFactory factory = legacyFactory;
  File factoryJson = new File(segmentFiles, "factory.json");
  if (factoryJson.exists()) {
    factory = objectMapper.readValue(factoryJson, SegmentFactory.java);
  }

  return factory.factorize(segment, segmentFiles);
}

Where the interface for SegmentFactory is

public interface SegmentFactory {
  Segment factorize(DataSegment segment, File segmentFiles) throws SegmentLoadingException;
}

And legacyFactory is an implementation of SegmentFactory that is just

public Segment factorize(DataSegment segment, File segmentFiles) throws SegmentLoadingException {
  return new QueryableIndexSegment(segment.getIdentifier(), factory.factorize(segmentFiles));
}

** Querying

Different segment persistence types can expose and enable new and different functionality in varying ways. We want to enable queries that can take advantage of the specific benefits of any given persistence type while also providing methods to connect into functionality already implemented. We can enable this with a relatively simple interface change. Currently, the Segment interface is

public interface Segment extends Closeable
{
  String getIdentifier();
  Interval getDataInterval();
  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();
}

I.e. it has two methods that allow you to "convert" the Segment into an object with specific semantics that queries know how to deal with:

  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();

I propose that we change the interface to be

public interface Segment extends Closeable
{
  String getIdentifier();
  Interval getDataInterval();
  <T> T as(Class<T> clazz);
}

This essentially means that all of the places that currently call asQueryableIndex() would be able to call as(QueryableIndex.class) instead. This interface does potentially have external touch-points in that if anybody has implemented their own Query, the might be calling either asQueryableIndex() or asStorageAdapter() already. So, there is a decision to be made on whether we make the change backwards compatible from the beginning and then ultimately remove the two superfluous methods later or if we make the backwards incompatible change now.

I think I'm a fan of making the backwards incompatible change now, because that should limit how many people are adversely affected by it. If we make it later, then everyone who has implemented their own storage format must update their implementation to not have the methods anymore.

Then again, if all we are doing is removing methods from an interface in the future, I think that would be a compile-time incompatibility, but not a runtime incompatibility. I.e. after we remove the methods, I think it's still possible for an implementation that was compiled assuming those methods exist to continue to work. But, I'm not sure about that. If that's the case, then the backwards-incompatible removal of the methods from the interface is actually a relatively low-risk change and only potentially adversely affects people who have implemented their own queries in terms of the removed methods.

I believe that these changes are sufficient to enable extensions to create their own persistence formats and even expose their own interfaces for querying, if required to access some specific properties of the new persistent format. This could be leveraged for many different uses, from building connectors to ORC or Parquet-based data to trying something completely new and different while still leveraging Druid's ingestion, segment-management and query-routing functionality.

In general, this will also improve the staying power of the Druid system as it will enable us to switch persistence formats when and if it is determined that the format implemented in the way-back-when is not keeping up the current needs and trends of the space.

@fjy
Copy link
Contributor

fjy commented May 13, 2016

👍

See #2913 as @javasoze's & @gianm's start into this work

@drcrallen
Copy link
Contributor

Overall I think the approach is in a good direction. (thinking outloud below)

I'm a bit cautious about the as(Class<T> clazz) usage in Segment. Somewhere there needs to be a wiring up from the data in storage to the proper objects for extracting that data. DataSegment seems to be the current and proposed descriptor for what data exists. And QueryableIndex and StorageAdapter for expressing the data.

QueryableIndex focuses on making Columns, and StorageAdapter focuses on making row cursors.

If I'm reading this correctly, the as method effectively finds a factory for the class requested, and calls the factory to create the data expression object (QueryableIndex or StorageAdapter).

The getSegment call effectively allows ServerManager and others to do some preliminary work on segments by wiring up the mmap stuff, and possibly doing some early validation if needed.

What I'm cautious about is that having as(Class<T> clazz) does not free up expressions of data in the way that it looks. This change would effectively make the storage mechanism pluggable (as per the issue title) but doesn't allow for a different range of data expressions compared to what exists now. Specifically, if a target for the as(Class<T> clazz) does not appear in the same classloader as io.druid.cli.Main or in the same classloader as the extension with the Segment storage impl, then it cannot be used.

As such, I suggest leaving the two

  QueryableIndex asQueryableIndex();
  StorageAdapter asStorageAdapter();

for now, until use cases for more dynamic data expression can be more fleshed out.

@cheddar
Copy link
Contributor Author

cheddar commented May 14, 2016

So, Charles, you are for adding the method and not taking away the other
methods, correct?

On Friday, May 13, 2016, Charles Allen notifications@github.com wrote:

Overall I think the approach is in a good direction. (thinking outloud
below)

I'm a bit cautious about the as(Class clazz) usage in Segment.
Somewhere there needs to be a wiring up from the data in storage to the
proper objects for extracting that data. DataSegment seems to be the
current and proposed descriptor for what data exists. And QueryableIndex
and StorageAdapter for expressing the data.

QueryableIndex focuses on making Columns, and StorageAdapter focuses on
making row cursors.

If I'm reading this correctly, the as method effectively finds a factory
for the class requested, and calls the factory to create the data
expression object (QueryableIndex or StorageAdapter).

The getSegment call effectively allows ServerManager and others to do
some preliminary work on segments by wiring up the mmap stuff, and possibly
doing some early validation if needed.

What I'm cautious about is that having as(Class clazz) does not free
up expressions of data in the way that it looks. This change would
effectively make the storage mechanism pluggable (as per the issue title)
but doesn't allow for a different range of data expressions compared to
what exists now. Specifically, if a target for the as(Class clazz)
does not appear in the same classloader as io.druid.cli.Main or in the same
classloader as the extension with the Segment storage impl, then it cannot
be used.

As such, I suggest leaving the two

QueryableIndex asQueryableIndex();
StorageAdapter asStorageAdapter();

for now, until use cases for more dynamic data expression can be more
fleshed out.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#2965 (comment)

@drcrallen
Copy link
Contributor

@cheddar I'm proposing to not add a <T> T as(Class<T> clazz); method as part of the scope of this PR.

IF the as method had a java doc contract that nothing other than QueryableIndex and/or StorageAdapter can be supported at this time, then I'm ok with the proposal. But if you are intending for as to be extendable beyond the two classes mentioned, then I think we need to rack out some ideas on how to properly make the data expression side of things more pluggable in the world where the data storage is ALSO pluggable.

@gianm
Copy link
Contributor

gianm commented May 16, 2016

@drcrallen The intent of as(Class<T>) is to allow extensions that include a new storage engine to also offer features that have no direct analog in StorageAdapter, but that take advantage of special features of the engine. The intended consumer of those features would be queries that are also part of that extension, and possibly queries that are in other extensions that depend on the main storage engine extension.

This seems to me like a fair way of letting people experiment with using features provided by non-default engines. If those experiments go really well, and some features end up being things we want to generalize across engines, we could do that by either introducing a new interface in core Druid, or by introducing an extension that only has that interface, which other storage engine / query extensions could depend on.

Does that sound fair?

@drcrallen
Copy link
Contributor

Ok, if the storage class, data expression class, and query class are all in the same extension then that should be fine, but it would be good if the resulting javadoc for the as method clarified this restriction.

@drcrallen
Copy link
Contributor

Correction: In the same extension OR are used by default druid queries via default druid classes

@javasoze
Copy link
Contributor

I see Druid as two very abstract parts:

  1. A distributed system that handles realtime ingestion into and manages realtime segments.
  2. A storage and querying engine for realtime event data.

With the current implementation, power of 1) can not be fully leveraged since it can ONLY understand segments created by 2), which I find limiting.

This proposal separates 1) and 2) into a clean and isolated pieces that can be iterated on separately and as a result, making druid more extensible and powerful.

@binlijin
Copy link
Contributor

👍

@drcrallen
Copy link
Contributor

To be clear 👍 on the proposal, but the wording of the javadoc contract for as needs to be very clear on what the expectations of acceptable parameters are.

@stale
Copy link

stale bot commented Jun 21, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 21, 2019
@stale
Copy link

stale bot commented Jul 5, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 5, 2019
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this issue Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants