osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[DISCUSS] Embracing Table API in Flink ML


ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
idea and made their own implementations [Spark ML Pipeline
<https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
<https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].



NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
and DL pipelines.


ML Pipeline is quite helpful for model composition (i.e. using model(s) for
feature engineering) . And it enables logic reuse in train and inference
phases (via pipeline persistence and load), which is essential for AI
engineering. ML Pipeline can also be a good base for Flink based AI
engineering platform if we can make ML Pipeline have good tooling support
(i.e. meta data human readable).


As the Table API will be the unified high level API for both stream and
batch processing, I want to initiate the design discussion of new Table
based Flink ML Pipeline.


I drafted a design document [1] for this discussion. This design tries to
create a new ML Pipeline implementation so that concrete ML/DL algorithms
can fit to this new API to achieve interoperability.


Any feedback is highly appreciated.


Thanks

Weihua


[1]
https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing