osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Enhancing the functionality and productivity of Table API


Hi jingchengm

Thanks a lot for your proposal! I find it is a good start point for
internal optimization works and help Flink to be more
user-friendly.

AFAIK, DataStream is the most popular API currently that Flink
users should describe their logic with detailed logic.
>From a more internal view the conversion from DataStream to
JobGraph is quite mechanically and hard to be optimized. So when
users program with DataStream, they have to learn more internals
and spend a lot of time to tune for performance.
With your proposal, we provide enhanced functionality of Table API,
so that users can describe their job easily on Table aspect. This gives
an opportunity to Flink developers to introduce an optimize phase
while transforming user program(described by Table API) to internal
representation.

Given a user who want to start using Flink with simple ETL, pipelining
or analytics, he would find it is most naturally described by SQL/Table
API. Further, as mentioned by @hequn,

SQL is a widely used language. It follows standards, is a
> descriptive language, and is easy to use


thus we could expect with the enhancement of SQL/Table API, Flink
becomes more friendly to users.

Looking forward to the design doc/FLIP!

Best,
tison.


jincheng sun <sunjincheng121@xxxxxxxxx> 于2018年11月2日周五 上午11:46写道:

> Hi Hequn,
> Thanks for your feedback! And also thanks for our offline discussion!
> You are right, unification of batch and streaming is very important for
> flink API.
> We will provide more detailed design later, Please let me know if you have
> further thoughts or feedback.
>
> Thanks,
> Jincheng
>
> Hequn Cheng <chenghequn@xxxxxxxxx> 于2018年11月2日周五 上午10:02写道:
>
> > Hi Jincheng,
> >
> > Thanks a lot for your proposal. It is very encouraging!
> >
> > As we all know, SQL is a widely used language. It follows standards, is a
> > descriptive language, and is easy to use. A powerful feature of SQL is
> that
> > it supports optimization. Users only need to care about the logic of the
> > program. The underlying optimizer will help users optimize the
> performance
> > of the program. However, in terms of functionality and ease of use, in
> some
> > scenarios sql will be limited, as described in Jincheng's proposal.
> >
> > Correspondingly, the DataStream/DataSet api can provide powerful
> > functionalities. Users can write ProcessFunction/CoProcessFunction and
> get
> > the timer. Compared with SQL, it provides more functionalities and
> > flexibilities. However, it does not support optimization like SQL.
> > Meanwhile, DataStream/DataSet api has not been unified which means, for
> the
> > same logic, users need to write a job for each stream and batch.
> >
> > With TableApi, I think we can combine the advantages of both. Users can
> > easily write relational operations and enjoy optimization. At the same
> > time, it supports more functionality and ease of use. Looking forward to
> > the detailed design/FLIP.
> >
> > Best,
> > Hequn
> >
> > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang <wshaoxuan@xxxxxxxxx>
> wrote:
> >
> > > Hi Aljoscha,
> > > Glad that you like the proposal. We have completed the prototype of
> most
> > > new proposed functionalities. Once collect the feedback from community,
> > we
> > > will come up with a concrete FLIP/design doc.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek <aljoscha@xxxxxxxxxx>
> > > wrote:
> > >
> > > > Hi Jincheng,
> > > >
> > > > these points sound very good! Are there any concrete proposals for
> > > > changes? For example a FLIP/design document?
> > > >
> > > > See here for FLIPs:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > > >
> > > > Best,
> > > > Aljoscha
> > > >
> > > > > On 1. Nov 2018, at 12:51, jincheng sun <sunjincheng121@xxxxxxxxx>
> > > wrote:
> > > > >
> > > > > *--------I am sorry for the formatting of the email content. I
> > reformat
> > > > > the **content** as follows-----------*
> > > > >
> > > > > *Hi ALL,*
> > > > >
> > > > > With the continuous efforts from the community, the Flink system
> has
> > > been
> > > > > continuously improved, which has attracted more and more users.
> Flink
> > > SQL
> > > > > is a canonical, widely used relational query language. However,
> there
> > > are
> > > > > still some scenarios where Flink SQL failed to meet user needs in
> > terms
> > > > of
> > > > > functionality and ease of use, such as:
> > > > >
> > > > > *1. In terms of functionality*
> > > > >    Iteration, user-defined window, user-defined join, user-defined
> > > > > GroupReduce, etc. Users cannot express them with SQL;
> > > > >
> > > > > *2. In terms of ease of use*
> > > > >
> > > > >   - Map - e.g. “dataStream.map(mapFun)”. Although
> > “table.select(udf1(),
> > > > >   udf2(), udf3()....)” can be used to accomplish the same
> function.,
> > > > with a
> > > > >   map() function returning 100 columns, one has to define or call
> 100
> > > > UDFs
> > > > >   when using SQL, which is quite involved.
> > > > >   - FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly, it
> > can
> > > be
> > > > >   implemented with “table.join(udtf).select()”. However, it is
> > obvious
> > > > that
> > > > >   dataStream is easier to use than SQL.
> > > > >
> > > > > Due to the above two reasons, some users have to use the DataStream
> > API
> > > > or
> > > > > the DataSet API. But when they do that, they lose the unification
> of
> > > > batch
> > > > > and streaming. They will also lose the sophisticated optimizations
> > such
> > > > as
> > > > > codegen, aggregate join transpose and multi-stage agg from Flink
> SQL.
> > > > >
> > > > > We believe that enhancing the functionality and productivity is
> vital
> > > for
> > > > > the successful adoption of Table API. To this end,  Table API still
> > > > > requires more efforts from every contributor in the community. We
> see
> > > > great
> > > > > opportunity in improving our user’s experience from this work. Any
> > > > feedback
> > > > > is welcome.
> > > > >
> > > > > Regards,
> > > > >
> > > > > Jincheng
> > > > >
> > > > > jincheng sun <sunjincheng121@xxxxxxxxx> 于2018年11月1日周四 下午5:07写道:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> With the continuous efforts from the community, the Flink system
> has
> > > > been
> > > > >> continuously improved, which has attracted more and more users.
> > Flink
> > > > SQL
> > > > >> is a canonical, widely used relational query language. However,
> > there
> > > > are
> > > > >> still some scenarios where Flink SQL failed to meet user needs in
> > > terms
> > > > of
> > > > >> functionality and ease of use, such as:
> > > > >>
> > > > >>
> > > > >>   -
> > > > >>
> > > > >>   In terms of functionality
> > > > >>
> > > > >> Iteration, user-defined window, user-defined join, user-defined
> > > > >> GroupReduce, etc. Users cannot express them with SQL;
> > > > >>
> > > > >>   -
> > > > >>
> > > > >>   In terms of ease of use
> > > > >>   -
> > > > >>
> > > > >>      Map - e.g. “dataStream.map(mapFun)”. Although
> > > “table.select(udf1(),
> > > > >>      udf2(), udf3()....)” can be used to accomplish the same
> > > function.,
> > > > with a
> > > > >>      map() function returning 100 columns, one has to define or
> call
> > > > 100 UDFs
> > > > >>      when using SQL, which is quite involved.
> > > > >>      -
> > > > >>
> > > > >>      FlatMap -  e.g. “dataStrem.flatmap(flatMapFun)”. Similarly,
> it
> > > can
> > > > >>      be implemented with “table.join(udtf).select()”. However, it
> is
> > > > obvious
> > > > >>      that datastream is easier to use than SQL.
> > > > >>
> > > > >>
> > > > >> Due to the above two reasons, some users have to use the
> DataStream
> > > API
> > > > or
> > > > >> the DataSet API. But when they do that, they lose the unification
> of
> > > > batch
> > > > >> and streaming. They will also lose the sophisticated optimizations
> > > such
> > > > as
> > > > >> codegen, aggregate join transpose  and multi-stage agg from Flink
> > SQL.
> > > > >>
> > > > >> We believe that enhancing the functionality and productivity is
> > vital
> > > > for
> > > > >> the successful adoption of Table API. To this end,  Table API
> still
> > > > >> requires more efforts from every contributor in the community. We
> > see
> > > > great
> > > > >> opportunity in improving our user’s experience from this work. Any
> > > > feedback
> > > > >> is welcome.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Jincheng
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>