How to execute a substrait plan? #485

bianhq · 2023-04-02T06:44:13Z

Hi,
After reading the documents on the official website, I am still confused about:

How to convert the substrait plan to the query plan of a specific query engine? I understand that substrait does not do any query optimization; it is only a cross-query-engine plan format (correct me if I am wrong). But is there any guideline or official examples for converting the substrait plan to the query plan in a query engine? For example, should I map a substrait plan to a very early-stage logical plan that is to be consumed by the planner and optimizer (so that we can benefit from the query optimization in the query engine) or the physical plan that is executed by the executors (this is mentioned in some talks)? How can we do this if the query engine is not open-sourced or if the query engine does not have a well-defined form of intermediate query plans?
Are there any existing integrations (converters) from substrait to the query engines such as Spark and Trino? The official website uses Spark and Trino as example use cases, so I suppose that Spark and Trino integrations are already implemented and tested. I have found the duckdb integration, but I did not find any official or third-party open-source integrations for Trino or Spark. Should I implement these integrations by myself?

Thanks a lot!

bianhq · 2023-04-02T09:54:19Z

I found that in duckdb's substrait extension, the substrait plan is converted to the parse tree (i.e., the input of the logical planner) of duckdb.

westonpace · 2023-04-03T14:30:40Z

How to convert the substrait plan to the query plan of a specific query

engine? Every query engine is going to have their own API for doing this. You will need to look at the docs for each query engine. Substrait only specifies the format of the query itself and not the APIs that are used to pass queries around. There are some other projects that are attempting to standardize this. For example, if you are starting with Spark you might consider Gluten[1], if you are starting with SQL you might look into ADBC[2], and if you are starting with python expressions you could consider ibis[3]. All of these projects (I believe) support multiple backends and can use Substrait to communicate with these backends.

For example, should I map a substrait plan to a very early-stage logical

plan that is to be consumed by the planner and optimizer? Different engines have taken different approaches here. Substrait aims to be able to represent all of the various stages. For example, DuckDB and Datafusion both have their own optimizers. So they want to receive very logical plans. Acero and Velox would rather consume more physical plans. For example, there are people interested in using Velox with Spark and the plan is to go from Spark logical plan to a Spark optimized plan and then to Velox. I think the hope is that there might someday be a general purpose "substrait to substrait" optimizer but I don't know if anyone is working on anything like that at the moment. Currently, in Substrait, the logical operators are the best defined and farthest along. So most consumers are capable of using these fairly logical operators and some might just reject plans that are a little too logical (e.g. that contain unflattened correlated subqueries).

How can we do this if the query engine is not open-sourced or if the

query engine does not have a well-defined form of intermediate query plans? If a query engine is not open source and does not consume Substrait today then the best you can do is convert from Substrait to whatever format the engine accepts (e.g. SQL). No one has created a Substrait -> SQL conversion tool yet but it seems like something that will eventually need to exist.

Are there any existing integrations (converters) from substrait to the

query engines such as Spark and Trino? Take a look at the Gluten project[1]. However, I believe their goal is to go from Spark -> Substrait. I don't know if they are working on Substrait -> Spark. You might also be interested in this PR[4] and some of the discussion around it. [1] https://github.com/oap-project/gluten [2] https://arrow.apache.org/docs/format/ADBC.html [3] https://github.com/ibis-project/ibis-substrait [4] substrait-io/substrait-java#90

…

On Sun, Apr 2, 2023 at 2:54 AM Haoqiong Bian ***@***.***> wrote: I found that in duckdb's substrait extension, the substrait plan is converted to the parse tree (i.e., the input of the logical planner) of duckdb. — Reply to this email directly, view it on GitHub <#485 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAM6CXPP7FBO4KZLJMU23L3W7FEFNANCNFSM6AAAAAAWQDTUME> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

bianhq · 2023-04-03T16:08:41Z

Thank you for the detailed and clear reply!
It would be nice if there were more third-party or official integrations for popular query engines such as Spark and Trino.

bianhq closed this as completed Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to execute a substrait plan? #485

How to execute a substrait plan? #485

bianhq commented Apr 2, 2023 •

edited

Loading

bianhq commented Apr 2, 2023

westonpace commented Apr 3, 2023 via email

bianhq commented Apr 3, 2023

How to execute a substrait plan? #485

How to execute a substrait plan? #485

Comments

bianhq commented Apr 2, 2023 • edited Loading

bianhq commented Apr 2, 2023

westonpace commented Apr 3, 2023 via email

bianhq commented Apr 3, 2023

bianhq commented Apr 2, 2023 •

edited

Loading