osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Embracing Table API in Flink ML


Hi Yun,

Very excited to see Flink ML forward! There are many touch points your
document touched. I couldn't agree more the value of having a (unified)
table API could bring to Flink ecosystem towards running ML workload. Most
ML pipelines we observed starts from single box python scripts or adhoc
tools researcher run to train model on powerful machine. When that proves
successful, they need to hook up with data warehouse and extract features
(SQL kick in). In training phase, the landscape is very segmented. Small to
median sized model can be trained on JVM, while large/deep model needs to
optimize operator per iteration data random shuffle (SGD based DL) often
ends up in JNI/ C++/Cuda and task scheduling.(gang scheduled instead of
hack around map-reduce)

Hope it makes sense. BTW, xgboost (most popular ML competition framework)
has very primitive flink support, might worth check out.
https://github.com/dmlc/xgboost

Chen

On Tue, Nov 20, 2018 at 6:13 PM Weihua Jiang <weihua.jiang@xxxxxxxxx> wrote:

> Hi Yun,
>
> Can't wait to see your design.
>
> Thanks
> Weihua
>
> Yun Gao <yungao.gy@xxxxxxxxxx.invalid> 于2018年11月21日周三 上午12:43写道:
>
> > Hi Weihua,
> >
> >     Thanks for the exciting proposal!
> >
> >     I have quickly read through it,  and I really appropriate the idea of
> > providing the ML Pipeline API similar to the commonly used library
> > scikit-learn, since it greatly reduce the learning cost for the AI
> > engineers to transfer to the Flink platform.
> >
> >     Currently we are also working on a related issue, namely enhancing
> the
> > stream iteration of Flink to support both SGD and online learning, and it
> > also support batch training as a special case. we have had a rough design
> > and will start a new discussion in the next few days. I think the
> enhanced
> > stream iteration will help to implement Estimators directly in Flink, and
> > it may help to simplify the online learning pipeline by eliminating the
> > requirement to load the models from external file systems.
> >
> >     I will read the design doc more carefully. Thanks again for sharing
> > the design doc!
> >
> > Yours sincerely
> >     Yun Gao
> >
> >
> > ------------------------------------------------------------------
> > 发件人:Weihua Jiang <weihua.jiang@xxxxxxxxx>
> > 发送时间:2018年11月20日(星期二) 20:53
> > 收件人:dev <dev@xxxxxxxxxxxxxxxx>
> > 主 题:[DISCUSS] Embracing Table API in Flink ML
> >
> > ML Pipeline is the idea brought by Scikit-learn
> > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed
> this
> > idea and made their own implementations [Spark ML Pipeline
> > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML
> Pipeline
> > <
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html
> > >].
> >
> >
> >
> > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> > and DL pipelines.
> >
> >
> > ML Pipeline is quite helpful for model composition (i.e. using model(s)
> for
> > feature engineering) . And it enables logic reuse in train and inference
> > phases (via pipeline persistence and load), which is essential for AI
> > engineering. ML Pipeline can also be a good base for Flink based AI
> > engineering platform if we can make ML Pipeline have good tooling support
> > (i.e. meta data human readable).
> >
> >
> > As the Table API will be the unified high level API for both stream and
> > batch processing, I want to initiate the design discussion of new Table
> > based Flink ML Pipeline.
> >
> >
> > I drafted a design document [1] for this discussion. This design tries to
> > create a new ML Pipeline implementation so that concrete ML/DL algorithms
> > can fit to this new API to achieve interoperability.
> >
> >
> > Any feedback is highly appreciated.
> >
> >
> > Thanks
> >
> > Weihua
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing
> >
> >
>