[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Embracing Table API in Flink ML

Hi Weihua,

Thanks for the proposal. I have quickly read through it. It looks great.
A quick question. Do you consider changing the ML Lib (implementation
of Estimator/Predictor/Transformer) also on top of the tableAPI? I
will be very happy if this is also included in the scope. It is not
easy and needs lots of new tableAPI functionalities, which is exactly
one of the reasons that motivate us to "enhance the tableAPI"
discussed in other threads.

The entire scope of your proposal is so big that I would suggest we
should complete it step by step. I think you have mainly proposed 3
1. Redesign the ML pipeline based on tableAPI
2. Take streaming ML pipeline into account
3. Enhance ML pipeline with some new features for a better user experience
Maybe we should first replace the ml pipeline interface with tableAPI,
then move into #2 and #3. In the meanwhile, we can also explore the
possibility of changing the ML lib also on top of tableAPI. What do
you think?

BTW, we should not break the current ML pipeline interface (which is
based on dataset) when we introduce the new ones. Let us leave it for
a while before the new interface is completed and well adopted. Then
we can deprecate the old ones.

I will take a more thorough look at your proposal and leave comments
directly on the doc.


On 11/20/18, Weihua Jiang <weihua.jiang@xxxxxxxxx> wrote:
> ML Pipeline is the idea brought by Scikit-learn
> <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
> idea and made their own implementations [Spark ML Pipeline
> <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].
> NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
> and DL pipelines.
> ML Pipeline is quite helpful for model composition (i.e. using model(s) for
> feature engineering) . And it enables logic reuse in train and inference
> phases (via pipeline persistence and load), which is essential for AI
> engineering. ML Pipeline can also be a good base for Flink based AI
> engineering platform if we can make ML Pipeline have good tooling support
> (i.e. meta data human readable).
> As the Table API will be the unified high level API for both stream and
> batch processing, I want to initiate the design discussion of new Table
> based Flink ML Pipeline.
> I drafted a design document [1] for this discussion. This design tries to
> create a new ML Pipeline implementation so that concrete ML/DL algorithms
> can fit to this new API to achieve interoperability.
> Any feedback is highly appreciated.
> Thanks
> Weihua
> [1]
> https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing


*Rome was not built in one day*