-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to execute a substrait plan? #485
Comments
I found that in duckdb's substrait extension, the substrait plan is converted to the parse tree (i.e., the input of the logical planner) of duckdb. |
How to convert the substrait plan to the query plan of a specific query
engine?
Every query engine is going to have their own API for doing this. You will
need to look at the docs for each query engine. Substrait only specifies
the format of the query itself and not the APIs that are used to pass
queries around. There are some other projects that are attempting to
standardize this. For example, if you are starting with Spark you might
consider Gluten[1], if you are starting with SQL you might look into
ADBC[2], and if you are starting with python expressions you could consider
ibis[3]. All of these projects (I believe) support multiple backends and
can use Substrait to communicate with these backends.
For example, should I map a substrait plan to a very early-stage logical
plan that is to be consumed by the planner and optimizer?
Different engines have taken different approaches here. Substrait aims to
be able to represent all of the various stages. For example, DuckDB and
Datafusion both have their own optimizers. So they want to receive very
logical plans. Acero and Velox would rather consume more physical plans.
For example, there are people interested in using Velox with Spark and the
plan is to go from Spark logical plan to a Spark optimized plan and then to
Velox. I think the hope is that there might someday be a general purpose
"substrait to substrait" optimizer but I don't know if anyone is working on
anything like that at the moment.
Currently, in Substrait, the logical operators are the best defined and
farthest along. So most consumers are capable of using these fairly
logical operators and some might just reject plans that are a little too
logical (e.g. that contain unflattened correlated subqueries).
How can we do this if the query engine is not open-sourced or if the
query engine does not have a well-defined form of intermediate query plans?
If a query engine is not open source and does not consume Substrait today
then the best you can do is convert from Substrait to whatever format the
engine accepts (e.g. SQL). No one has created a Substrait -> SQL
conversion tool yet but it seems like something that will eventually need
to exist.
Are there any existing integrations (converters) from substrait to the
query engines such as Spark and Trino?
Take a look at the Gluten project[1]. However, I believe their goal is to
go from Spark -> Substrait. I don't know if they are working on Substrait
-> Spark. You might also be interested in this PR[4] and some of the
discussion around it.
[1] https://github.com/oap-project/gluten
[2] https://arrow.apache.org/docs/format/ADBC.html
[3] https://github.com/ibis-project/ibis-substrait
[4] substrait-io/substrait-java#90
…On Sun, Apr 2, 2023 at 2:54 AM Haoqiong Bian ***@***.***> wrote:
I found that in duckdb's substrait extension, the substrait plan is
converted to the parse tree (i.e., the input of the logical planner) of
duckdb.
—
Reply to this email directly, view it on GitHub
<#485 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM6CXPP7FBO4KZLJMU23L3W7FEFNANCNFSM6AAAAAAWQDTUME>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for the detailed and clear reply! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
After reading the documents on the official website, I am still confused about:
How to convert the substrait plan to the query plan of a specific query engine? I understand that substrait does not do any query optimization; it is only a cross-query-engine plan format (correct me if I am wrong). But is there any guideline or official examples for converting the substrait plan to the query plan in a query engine? For example, should I map a substrait plan to a very early-stage logical plan that is to be consumed by the planner and optimizer (so that we can benefit from the query optimization in the query engine) or the physical plan that is executed by the executors (this is mentioned in some talks)? How can we do this if the query engine is not open-sourced or if the query engine does not have a well-defined form of intermediate query plans?
Are there any existing integrations (converters) from substrait to the query engines such as Spark and Trino? The official website uses Spark and Trino as example use cases, so I suppose that Spark and Trino integrations are already implemented and tested. I have found the duckdb integration, but I did not find any official or third-party open-source integrations for Trino or Spark. Should I implement these integrations by myself?
Thanks a lot!
The text was updated successfully, but these errors were encountered: