[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

回复:[DISCUSS] Embracing Table API in Flink ML

Hi Weihua,

    Thanks for the exciting proposal! 

    I have quickly read through it,  and I really appropriate the idea of providing the ML Pipeline API similar to the commonly used library scikit-learn, since it greatly reduce the learning cost for the AI engineers to transfer to the Flink platform. 

    Currently we are also working on a related issue, namely enhancing the stream iteration of Flink to support both SGD and online learning, and it also support batch training as a special case. we have had a rough design and will start a new discussion in the next few days. I think the enhanced stream iteration will help to implement Estimators directly in Flink, and it may help to simplify the online learning pipeline by eliminating the requirement to load the models from external file systems.

    I will read the design doc more carefully. Thanks again for sharing the design doc!

Yours sincerely
    Yun Gao 

发件人:Weihua Jiang <weihua.jiang@xxxxxxxxx>
发送时间:2018年11月20日(星期二) 20:53
收件人:dev <dev@xxxxxxxxxxxxxxxx>
主 题:[DISCUSS] Embracing Table API in Flink ML

ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
idea and made their own implementations [Spark ML Pipeline
<https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline

NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
and DL pipelines.

ML Pipeline is quite helpful for model composition (i.e. using model(s) for
feature engineering) . And it enables logic reuse in train and inference
phases (via pipeline persistence and load), which is essential for AI
engineering. ML Pipeline can also be a good base for Flink based AI
engineering platform if we can make ML Pipeline have good tooling support
(i.e. meta data human readable).

As the Table API will be the unified high level API for both stream and
batch processing, I want to initiate the design discussion of new Table
based Flink ML Pipeline.

I drafted a design document [1] for this discussion. This design tries to
create a new ML Pipeline implementation so that concrete ML/DL algorithms
can fit to this new API to achieve interoperability.

Any feedback is highly appreciated.