osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Migrating Beam SQL to Calcite's code generation


According to the reply in the Calcite JIRA, there might be some other way to implement SESSION_END. I haven't looked into it though.

-Rui  

On Thu, Nov 15, 2018 at 11:56 AM Mingmin Xu <mingmxus@xxxxxxxxx> wrote:
1. Window start/end: Actually this is already provided in other ways and the window in the SQL environment is unused and just waiting to be deleted. So you can still access TUMBLE_START, etc. This is well-defined as a part of the row so there's no semantic problem, but I think it should already work.
MM: Others work except SESSION_END();

2. Pane information: I don't think access to pane info is enough for correct results for a SQL join that triggers more than once. The pane info is part of a Beam element, but these records just represent a kind of changelog of the aggregation/join. The general solution is retractions. Until we finish that, you need to follow the Join/CoGBK with custom logic , often a stateful DoFn to get the join results right. For example, if both inputs are append-only relations and it is an equijoin, then you can do this with a dedupe when you unpack the CoGbkResult. I am guessing this is the main use case for BEAM-5204. Is it your use case?
MM: my case is a self-join with SQL-only, written as [DISCARD_Pane JOIN ACCU_Pane];
These UDFs is not a blocker, limitation in BEAM-5204 should be removed directly IMO. With multiple-trigger assigned, developers need to handle the output which is not complex with Java SDK, but very hard for SQL only cases.


On Thu, Nov 15, 2018 at 10:54 AM Kenneth Knowles <kenn@xxxxxxxxxx> wrote:
From https://issues.apache.org/jira/browse/BEAM-5204 it seems like what you most care about is to have joins that trigger more than once per window. To accomplish it you hope to build an "escape hatch" from SQL/relational semantics to specialized Beam SQL semantics. It could make sense with extreme care.

Separating the two parts:

1. Window start/end: Actually this is already provided in other ways and the window in the SQL environment is unused and just waiting to be deleted. So you can still access TUMBLE_START, etc. This is well-defined as a part of the row so there's no semantic problem, but I think it should already work.

2. Pane information: I don't think access to pane info is enough for correct results for a SQL join that triggers more than once. The pane info is part of a Beam element, but these records just represent a kind of changelog of the aggregation/join. The general solution is retractions. Until we finish that, you need to follow the Join/CoGBK with custom logic , often a stateful DoFn to get the join results right. For example, if both inputs are append-only relations and it is an equijoin, then you can do this with a dedupe when you unpack the CoGbkResult. I am guessing this is the main use case for BEAM-5204. Is it your use case?

Kenn

On Thu, Nov 15, 2018 at 10:08 AM Mingmin Xu <mingmxus@xxxxxxxxx> wrote:
Raise this thread.
Seems there're more changes in the backend on how a FUNCTION is executed in the backend, as noticed in #6996:
1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
2. BeamSqlExpressionEnvironment are removed;

Then,
1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which is great and duplicate work is worthless);
2. no way to access Beam context now;

For #2, I think we need to find a way to expose it, at least our UDF/UDAF should be able to access it to leverage the advantages of Beam module.

Any comments?


On Wed, Sep 19, 2018 at 2:55 PM Rui Wang <ruwang@xxxxxxxxxx> wrote:
This is a so exciting change!

Since we cannot mix current implementation with Calcite code generation, is there any case that Calcite code generation does not support but our current implementation supports, so switching to Calcite code generation will have some impact to existing usage?

-Rui

On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud <apilloud@xxxxxxxxxx> wrote:
To follow up on this, the PR is now in a reviewable state and I've added more tests for FLOOR and CEIL. Both work with a more extensive set of arguments after this change. There are now 4 outstanding calcite PRs that get all the tests passing.

Unfortunately there is no easy way to mix our current implementation and using Calcite's code generator.

Andrew

On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu <mingmxus@xxxxxxxxx> wrote:
Awesome work, we should call Calcite operator functions if available.

I haven't get time to read the PR yet, for those impacted would keep existing implementation. One example is, I notice FLOOR/CEIL only supports months/years recently which is quite a surprise to me.

Mingmin

On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin <kedin@xxxxxxxxxx> wrote:
This is pretty amazing! Thank you for doing this!

Regards,
Anton

On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud <apilloud@xxxxxxxxxx> wrote:
I've adapted Calcite's EnumerableCalc code generation to generate the BeamCalc DoFn. The primary purpose behind this change is so we can take advantage of Calcite's extensive SQL operator implementation. This deletes ~11000 lines of code from Beam (with ~350 added), significantly increases the set of supported SQL operators, and improves performance and correctness of currently supported operators. Here is my work in progress: https://github.com/apache/beam/pull/6417

There are a few bugs in Calcite that this has exposed:

Fixed in Calcite master:
  • CALCITE-2321 - The type of a union of CHAR columns of different lengths should be VARCHAR
  • CALCITE-2447 - Some POWER, ATAN2 functions fail with NoSuchMethodException
Pending PRs:
  • CALCITE-2529 - linq4j should promote integer to floating point when generating function calls
  • CALCITE-2530 - TRIM function does not throw exception when the length of trim character is not 1(one)
More work:
  • CALCITE-2404 - Accessing structured-types is not implemented by the runtime
  • (none yet) - Support multi character TRIM extension in Calcite
I would like to push these changes in with these minor regressions. Do any of these Calcite bugs block this functionality being adding to Beam?

Andrew


--
----
Mingmin


--
----
Mingmin


--
----
Mingmin