osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Enhancing the functionality and productivity of Table API


Hi, Jiangjie,
Thanks a lot for your feedback. And also thanks for our offline discussion!
Yes, your right! The Row-based APIs which you mentioned are very friendly
to flink user!
In order to follow the concept of the traditional database, perhaps we
named the corresponding function RowValued/TabeValued function will be more
appropriate, then from the perspective of return value in TableAPI we have
three type functions:

   - ColumnValuedFunction - ScalarFunction & AggregateFunction, and the
   result is a column.
   - RowValuedFunction - MapFunction which I'll proposal is
   RowValuedFunciton, and result is a single row.
   - TableValuedFunction - FlatMapFunction which I'll proposal is
   TableValuedFunciton, and result is a table.

The detail will be described in following FLIP/design doc.
About the input type I think we can support both column parameters and row
parameters. but I think the meaning you want to express should be
consistent with me, we are on the same page, right?

And thanks you like the proposal, I hope that we can work together to
advance the work!

Best,
Jincheng

Becket Qin <becket.qin@xxxxxxxxx> 于2018年11月2日周五 上午1:25写道:

> Thanks for the proposal, Jincheng.
>
> This makes a lot of sense. As a programming interface, Table API is
> especially attractive because it supports both batch and stream. However,
> the relational-only API often forces users to shoehorn their logic into a
> bunch of user defined functions. Introducing some more flexible API (e.g.
> row-based APIs) to process records would really help here.
>
> Besides the processing API, another useful improvement would be allowing
> batch tables and stream tables to run in the same job, which is actually a
> quite common scenario.
>
> As you said, there are a lot of work could be done here. I am looking
> forward to the upcoming FLIPs.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Nov 2, 2018 at 12:10 AM jincheng sun <sunjincheng121@xxxxxxxxx>
> wrote:
>
> > Hi, Timo,
> > I am very grateful for your feedback, and I am very excited when I hear
> > that you also consider adding a process function to the TableAPI.
> >
> > I agree that add support for the Process Function on the Table API, which
> > is actually part of my proposal Enhancing the functionality of Table API.
> > In fact, supporting the ProcessFunction means supporting the user-defined
> > Operator. As you said, A ProcessFunction can implement any logic,
> including
> > the user-defined window, which leaves the user with enough freedom and
> > control. At the same time, Co-PrecessFunction needs to be supported, so
> we
> > can implement the logic of User-Defined JOIN through Co-PrecessFunciton.
> Of
> > course, Co-PrecessFunciton also needs to introduce the concept of
> Connect,
> > and will introduce a new ConnectedTable type on TableAPI.  And I also
> think
> > TableAPI also for more event-driven applications.
> >
> > About processFunction In addition to the timer function, it should be
> > completely equivalent to flatmapFunction, so maybe we can support map and
> > flatmap in Table, support processFunction in GroupedTable, because for
> the
> > reason of State, the Timer of ProcessFunction can only Apply to
> > KeyedStream.
> >
> > You are right, ANSI-SQL is difficult to express complex operator logic
> such
> > as ProcessFunction, so once we decide to make these enhancements on the
> > TableAPI, it means that the Flink SQL only includes ANSI-SQL operations,
> > and the TableAPI' operations is SQL super set. This means that the Flink
> > High-level API includes the A Query language SQL and A powerfu program
> > language Table API. In this way, SQL using for those simple ETL user
> > groups, the TableAPI is for a user group that needs to be customized for
> > complex logic, and these users can enjoy The benefit of the query
> > optimizer. Maybe we need more refinement and hard work to support these
> > functions, but maybe this is a good direction of effort.
> >
> > Thanks,
> > Jincheng
> >
> > Timo Walther <twalthr@xxxxxxxxxx> 于2018年11月1日周四 下午10:08写道:
> >
> > > Hi Jincheng,
> > >
> > > I was also thinking about introducing a process function for the Table
> > > API several times. This would allow to define more complex logic
> (custom
> > > windows, timers, etc.) embedded into a relational API with schema
> > > awareness and optimization around the black box. Of course this would
> > > mean that we diverge with Table API from SQL API, however, it would
> open
> > > the Table API also for more event-driven applications.
> > >
> > > Maybe it would be possible to define timers and firing logic using
> Table
> > > API expressions and UDFs. Within planning this would be treated as a
> > > special Calc node.
> > >
> > > Just some ideas that might be interesting for new use cases.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > Am 01.11.18 um 13:12 schrieb Aljoscha Krettek:
> > > > Hi Jincheng,
> > > >
> > > > these points sound very good! Are there any concrete proposals for
> > > changes? For example a FLIP/design document?
> > > >
> > > > See here for FLIPs:
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > >> On 1. Nov 2018, at 12:51, jincheng sun <sunjincheng121@xxxxxxxxx>
> > > wrote:
> > > >>
> > > >> *--------I am sorry for the formatting of the email content. I
> > reformat
> > > >> the **content** as follows-----------*
> > > >>
> > > >> *Hi ALL,*
> > > >>
> > > >> With the continuous efforts from the community, the Flink system has
> > > been
> > > >> continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > >> is a canonical, widely used relational query language. However,
> there
> > > are
> > > >> still some scenarios where Flink SQL failed to meet user needs in
> > terms
> > > of
> > > >> functionality and ease of use, such as:
> > > >>
> > > >> *1. In terms of functionality*
> > > >>     Iteration, user-defined window, user-defined join, user-defined
> > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > >>
> > > >> *2. In terms of ease of use*
> > > >>
> > > >>    - Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(),
> > > >>    udf2(), udf3()....)” can be used to accomplish the same
> function.,
> > > with a
> > > >>    map() function returning 100 columns, one has to define or call
> 100
> > > UDFs
> > > >>    when using SQL, which is quite involved.
> > > >>    - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> > can
> > > be
> > > >>    implemented with “table.join(udtf).select()”. However, it is
> > obvious
> > > that
> > > >>    dataStream is easier to use than SQL.
> > > >>
> > > >> Due to the above two reasons, some users have to use the DataStream
> > API
> > > or
> > > >> the DataSet API. But when they do that, they lose the unification of
> > > batch
> > > >> and streaming. They will also lose the sophisticated optimizations
> > such
> > > as
> > > >> codegen, aggregate join transpose and multi-stage agg from Flink
> SQL.
> > > >>
> > > >> We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > >> the successful adoption of Table API. To this end,  Table API still
> > > >> requires more efforts from every contributor in the community. We
> see
> > > great
> > > >> opportunity in improving our user’s experience from this work. Any
> > > feedback
> > > >> is welcome.
> > > >>
> > > >> Regards,
> > > >>
> > > >> Jincheng
> > > >>
> > > >> jincheng sun <sunjincheng121@xxxxxxxxx> 于2018年11月1日周四 下午5:07写道:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> With the continuous efforts from the community, the Flink system
> has
> > > been
> > > >>> continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > >>> is a canonical, widely used relational query language. However,
> there
> > > are
> > > >>> still some scenarios where Flink SQL failed to meet user needs in
> > > terms of
> > > >>> functionality and ease of use, such as:
> > > >>>
> > > >>>
> > > >>>    -
> > > >>>
> > > >>>    In terms of functionality
> > > >>>
> > > >>> Iteration, user-defined window, user-defined join, user-defined
> > > >>> GroupReduce, etc. Users cannot express them with SQL;
> > > >>>
> > > >>>    -
> > > >>>
> > > >>>    In terms of ease of use
> > > >>>    -
> > > >>>
> > > >>>       Map - e.g. “dataStream.map(mapFun)”. Although
> > > “table.select(udf1(),
> > > >>>       udf2(), udf3()....)” can be used to accomplish the same
> > > function., with a
> > > >>>       map() function returning 100 columns, one has to define or
> call
> > > 100 UDFs
> > > >>>       when using SQL, which is quite involved.
> > > >>>       -
> > > >>>
> > > >>>       FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> it
> > > can
> > > >>>       be implemented with “table.join(udtf).select()”. However, it
> is
> > > obvious
> > > >>>       that datastream is easier to use than SQL.
> > > >>>
> > > >>>
> > > >>> Due to the above two reasons, some users have to use the DataStream
> > > API or
> > > >>> the DataSet API. But when they do that, they lose the unification
> of
> > > batch
> > > >>> and streaming. They will also lose the sophisticated optimizations
> > > such as
> > > >>> codegen, aggregate join transpose  and multi-stage agg from Flink
> > SQL.
> > > >>>
> > > >>> We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > >>> the successful adoption of Table API. To this end,  Table API still
> > > >>> requires more efforts from every contributor in the community. We
> see
> > > great
> > > >>> opportunity in improving our user’s experience from this work. Any
> > > feedback
> > > >>> is welcome.
> > > >>>
> > > >>> Regards,
> > > >>>
> > > >>> Jincheng
> > > >>>
> > > >>>
> > >
> > >
> >
>