osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Support Interactive Programming in Flink Table API


Hi Fabian and Piotr,

Thanks for the feedback. I think I now understand you a little better.

1. “Materialize" and “cache" are two different scenarios IMO. "Materialize"
is a complex feature that allows the user to really create a
materializedView/table, and the materialized table will be timely updated
either when sourceTable is varied or timer is triggered. I can image this
feature will need lots of components to be added in Flink, like flinkStore,
meta system, job scheduler etc. This is definitely something that we want
to have but have not been planned yet. "Cache" addresses the performance
issue when consequent jobs needed to be executed and the latter one want to
reuse the result of previous one’s as an input source.

2. In the case of “Cache”. I did not consider that the method (let us first
assume there is such method) could modify the input table. To make sure I
understand you correctly. Is this what you mean by “refresh":
Table t1 = ???
Table t2 = t1.cache()
Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1'
//assume t1 can be modified
Table t4 = methodThatAppliesOperators(t2) // t1is used
t2.refresh() //load t1'
Table t5 = methodThatAppliesOperators(t2) // t1’ is used

I can see the value of having a new return type for cache() in this case.
(Maybe I missed sth.) But do we have such methods or expect to have any of
those that can modify the input table? If not, I do not see the need that
we should add a new return type for cache().

3. I I agree we should keep the “the logic plan of t1” and let optimizer to
decide the optimal plan weather to scan the cache data or not. This is
useful for both materialize and cache cases. When we start to think about
this cache proposal, I am evening thinking to let optimizer smartly add a
cache as needed. But this needs lots of changes on the optimization
framework itself (cross-job optimization), also it does not improve the
problem when user executes the table queries in an interactive way (because
optimizer cannot predict the future queries).


Regards,
Shaoxuan


On Thu, Nov 29, 2018 at 9:16 PM Fabian Hueske <fhueske@xxxxxxxxx> wrote:

> Hi,
>
> Thanks for the clarification Becket!
>
> I have a few thoughts to share / questions:
>
> 1) I'd like to know how you plan to implement the feature on a plan /
> planner level.
>
> I would imaging the following to happen when Table.cache() is called:
>
> 1) immediately optimize the Table and internally convert it into a
> DataSet/DataStream. This is necessary, to avoid that operators of later
> queries on top of the Table are pushed down.
> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
> 3) add a sink to the DataSet/DataStream. This is the materialization of the
> Table X
>
> Based on your proposal the following would happen:
>
> Table t1 = ....
> t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
> a scan of X. There is also a reference to the materialization of X.
>
> t1.count(); // this executes the program, including the DataSet/DataStream
> that backs X and the sink that writes the materialization of X
> t1.count(); // this executes the program, but reads X from the
> materialization.
>
> My question is, how do you determine when whether the scan of t1 should go
> against the DataSet/DataStream program and when against the
> materialization?
> AFAIK, there is no hook that will tell you that a part of the program was
> executed. Flipping a switch during optimization or plan generation is not
> sufficient as there is no guarantee that the plan is also executed.
>
> Overall, this behavior is somewhat similar to what I proposed in
> FLINK-8950, which does not include persisting the table, but just
> optimizing and reregistering it as DataSet/DataStream scan.
>
> 2) I think Piotr has a point about the implicit behavior and side effects
> of the cache() method if it does not return anything.
> Consider the following example:
>
> Table t1 = ???
> Table t2 = methodThatAppliesOperators(t1);
> Table t3 = methodThatAppliesOtherOperators(t1);
>
> In this case, the behavior/performance of the plan that results from the
> second method call depends on whether t1 was modified by the first method
> or not.
> This is the classic issue of mutable vs. immutable objects.
> Also, as Piotr pointed out, it might also be good to have the original plan
> of t1, because in some cases it is possible to push filters down such that
> evaluating the query from scratch might be more efficient than accessing
> the cache.
> Moreover, a CachedTable could extend Table() and offer a method refresh().
> This sounds quite useful in an interactive session mode.
>
> 3) Regarding the name, I can see both arguments. IMO, materialize() seems
> to be more future proof.
>
> Best, Fabian
>
> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> wshaoxuan@xxxxxxxxx>:
>
> > Hi Piotr,
> >
> > Thanks for sharing your ideas on the method naming. We will think about
> > your suggestions. But I don't understand why we need to change the return
> > type of cache().
> >
> > Cache() is a physical operation, it does not change the logic of
> > the `Table`. On the tableAPI layer, we should not introduce a new table
> > type unless the logic of table has been changed. If we introduce a new
> > table type `CachedTable`, we need create the same set of methods of
> `Table`
> > for it. I don't think it is worth doing this. Or can you please elaborate
> > more on what could be the "implicit behaviours/side effects" you are
> > thinking about?
> >
> > Regards,
> > Shaoxuan
> >
> >
> >
> > On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <piotr@xxxxxxxxxxxxxxxxx>
> > wrote:
> >
> > > Hi Becket,
> > >
> > > Thanks for the response.
> > >
> > > 1. I wasn’t saying that materialised view must be mutable or not. The
> > same
> > > thing applies to caches as well. To the contrary, I would expect more
> > > consistency and updates from something that is called “cache” vs
> > something
> > > that’s a “materialised view”. In other words, IMO most caches do not
> > serve
> > > you invalid/outdated data and they handle updates on their own.
> > >
> > > 2. I don’t think that having in the future two very similar concepts of
> > > `materialized` view and `cache` is a good idea. It would be confusing
> for
> > > the users. I think it could be handled by variations/overloading of
> > > materialised view concept. We could start with:
> > >
> > > `MaterializedTable materialize()` - immutable, session life scope
> > > (basically the same semantic as you are proposing
> > >
> > > And then in the future (if ever) build on top of that/expand it with:
> > >
> > > `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > > materialize(refreshHook=…)`
> > >
> > > Or with cross session support:
> > >
> > > `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> > > materializeInto(tableFactory=…)`
> > >
> > > I’m not saying that we should implement cross session/refreshing now or
> > > even in the near future. I’m just arguing that naming current immutable
> > > session life scope method `materialize()` is more future proof and more
> > > consistent with SQL (on which after all table-api is heavily basing
> on).
> > >
> > > 3. Even if we agree on naming it `cache()`, I would still insist on
> > > `cache()` returning `CachedTable` handle to avoid implicit
> > behaviours/side
> > > effects and to give both us & users more flexibility.
> > >
> > > Piotrek
> > >
> > > > On 29 Nov 2018, at 06:20, Becket Qin <becket.qin@xxxxxxxxx> wrote:
> > > >
> > > > Just to add a little bit, the materialized view is probably more
> > similar
> > > to
> > > > the persistent() brought up earlier in the thread. So it is usually
> > cross
> > > > session and could be used in a larger scope. For example, a
> > materialized
> > > > view created by user A may be visible to user B. It is probably
> > something
> > > > we want to have in the future. I'll put it in the future work
> section.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket.qin@xxxxxxxxx>
> > wrote:
> > > >
> > > >> Hi Piotrek,
> > > >>
> > > >> Thanks for the explanation.
> > > >>
> > > >> Right now we are mostly thinking of the cached table as immutable. I
> > can
> > > >> see the Materialized view would be useful in the future. That said,
> I
> > > think
> > > >> a simple cache mechanism is probably still needed. So to me, cache()
> > and
> > > >> materialize() should be two separate method as they address
> different
> > > >> needs. Materialize() is a higher level concept usually implying
> > > periodical
> > > >> update, while cache() has much simpler semantic. For example, one
> may
> > > >> create a materialized view and use cache() method in the
> materialized
> > > view
> > > >> creation logic. So that during the materialized view update, they do
> > not
> > > >> need to worry about the case that the cached table is also changed.
> > > Maybe
> > > >> under the hood, materialized() and cache() could share some
> mechanism,
> > > but
> > > >> I think a simple cache() method would be handy in a lot of cases.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > piotr@xxxxxxxxxxxxxxxxx
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi Becket,
> > > >>>
> > > >>>> Is there any extra thing user can do on a MaterializedTable that
> > they
> > > >>> cannot do on a Table?
> > > >>>
> > > >>> Maybe not in the initial implementation, but various DBs offer
> > > different
> > > >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > > manually
> > > >>> etc. Having `MaterializedTable` would help us to handle that in the
> > > future.
> > > >>>
> > > >>>> After users call *table.cache(), *users can just use that table
> and
> > do
> > > >>> anything that is supported on a Table, including SQL.
> > > >>>
> > > >>> This is some implicit behaviour with side effects. Imagine if user
> > has
> > > a
> > > >>> long and complicated program, that touches table `b` multiple
> times,
> > > maybe
> > > >>> scattered around different methods. If he modifies his program by
> > > inserting
> > > >>> in one place
> > > >>>
> > > >>> b.cache()
> > > >>>
> > > >>> This implicitly alters the semantic and behaviour of his code all
> > over
> > > >>> the place, maybe in a ways that might cause problems. For example
> > what
> > > if
> > > >>> underlying data is changing?
> > > >>>
> > > >>> Having invisible side effects is also not very clean, for example
> > think
> > > >>> about something like this (but more complicated):
> > > >>>
> > > >>> Table b = ...;
> > > >>>
> > > >>> If (some_condition) {
> > > >>>  processTable1(b)
> > > >>> }
> > > >>> else {
> > > >>>  processTable2(b)
> > > >>> }
> > > >>>
> > > >>> // do more stuff with b
> > > >>>
> > > >>> And user adds `b.cache()` call to only one of the `processTable1`
> or
> > > >>> `processTable2` methods.
> > > >>>
> > > >>> On the other hand
> > > >>>
> > > >>> Table materialisedB = b.materialize()
> > > >>>
> > > >>> Avoids (at least some of) the side effect issues and forces user to
> > > >>> explicitly use `materialisedB` where it’s appropriate and forces
> user
> > > to
> > > >>> think what does it actually mean. And if something doesn’t work in
> > the
> > > end
> > > >>> for the user, he will know what has he changed instead of blaming
> > > Flink for
> > > >>> some “magic” underneath. In the above example, after materialising
> b
> > in
> > > >>> only one of the methods, he should/would realise about the issue
> when
> > > >>> handling the return value `MaterializedTable` of that method.
> > > >>>
> > > >>> I guess it comes down to personal preferences if you like things to
> > be
> > > >>> implicit or not. The more power is the user, probably the more
> likely
> > > he is
> > > >>> to like/understand implicit behaviour. And we as Table API
> designers
> > > are
> > > >>> the most power users out there, so I would proceed with caution (so
> > > that we
> > > >>> do not end up in the crazy perl realm with it’s lovely implicit
> > method
> > > >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > > >>>
> > > >>>> Table API to also support non-relational processing cases, cache()
> > > >>> might be slightly better.
> > > >>>
> > > >>> I think even such extended Table API could benefit from sticking
> > > to/being
> > > >>> consistent with SQL where both SQL and Table API are basically the
> > > same.
> > > >>>
> > > >>> One more thing. `MaterializedTable materialize()` could be more
> > > >>> powerful/flexible allowing the user to operate both on materialised
> > > and not
> > > >>> materialised view at the same time for whatever reasons (underlying
> > > data
> > > >>> changing/better optimisation opportunities after pushing down more
> > > filters
> > > >>> etc). For example:
> > > >>>
> > > >>> Table b = …;
> > > >>>
> > > >>> MaterlizedTable mb = b.materialize();
> > > >>>
> > > >>> Val min = mb.min();
> > > >>> Val max = mb.max();
> > > >>>
> > > >>> Val user42 = b.filter(‘userId = 42);
> > > >>>
> > > >>> Could be more efficient compared to `b.cache()` if `filter(‘userId
> =
> > > >>> 42);` allows for much more aggressive optimisations.
> > > >>>
> > > >>> Piotrek
> > > >>>
> > > >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fhueske@xxxxxxxxx>
> wrote:
> > > >>>>
> > > >>>> I'm not suggesting to add support for Ignite. This was just an
> > > example.
> > > >>>> Plasma and Arrow sound interesting, too.
> > > >>>> For the sake of this proposal, it would be up to the user to
> > > implement a
> > > >>>> TableFactory and corresponding TableSource / TableSink classes to
> > > >>> persist
> > > >>>> and read the data.
> > > >>>>
> > > >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > > >>>> pompermaier@xxxxxxxx>:
> > > >>>>
> > > >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> > > >>> Apache
> > > >>>>> Ignite?
> > > >>>>> [1]
> > > >>>>>
> > > >>>
> > >
> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>
> > > >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> fhueske@xxxxxxxxx>
> > > >>> wrote:
> > > >>>>>
> > > >>>>>> Hi,
> > > >>>>>>
> > > >>>>>> Thanks for the proposal!
> > > >>>>>>
> > > >>>>>> To summarize, you propose a new method Table.cache(): Table that
> > > will
> > > >>>>>> trigger a job and write the result into some temporary storage
> as
> > > >>> defined
> > > >>>>>> by a TableFactory.
> > > >>>>>> The cache() call blocks while the job is running and eventually
> > > >>> returns a
> > > >>>>>> Table object that represents a scan of the temporary table.
> > > >>>>>> When the "session" is closed (closing to be defined?), the
> > temporary
> > > >>>>> tables
> > > >>>>>> are all dropped.
> > > >>>>>>
> > > >>>>>> I think this behavior makes sense and is a good first step
> towards
> > > >>> more
> > > >>>>>> interactive workloads.
> > > >>>>>> However, its performance suffers from writing to and reading
> from
> > > >>>>> external
> > > >>>>>> systems.
> > > >>>>>> I think this is OK for now. Changes that would significantly
> > improve
> > > >>> the
> > > >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> > > large
> > > >>>>>> impacts on many components of Flink.
> > > >>>>>> Users could use in-memory filesystems or storage grids (Apache
> > > >>> Ignite) to
> > > >>>>>> mitigate some of the performance effects.
> > > >>>>>>
> > > >>>>>> Best, Fabian
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > > >>>>>> becket.qin@xxxxxxxxx
> > > >>>>>>> :
> > > >>>>>>
> > > >>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>
> > > >>>>>>> Is there any extra thing user can do on a MaterializedTable
> that
> > > they
> > > >>>>>>> cannot do on a Table? After users call *table.cache(), *users
> can
> > > >>> just
> > > >>>>>> use
> > > >>>>>>> that table and do anything that is supported on a Table,
> > including
> > > >>> SQL.
> > > >>>>>>>
> > > >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> > > >>> cache()
> > > >>>>>> is
> > > >>>>>>> a bit more general than materialize(). Given that we are
> > enhancing
> > > >>> the
> > > >>>>>>> Table API to also support non-relational processing cases,
> > cache()
> > > >>>>> might
> > > >>>>>> be
> > > >>>>>>> slightly better.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>>
> > > >>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > >>>>> piotr@xxxxxxxxxxxxxxxxx
> > > >>>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi Becket,
> > > >>>>>>>>
> > > >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > > >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
> to
> > > >>>>>> provide
> > > >>>>>>> an
> > > >>>>>>>> alternate way of writing the data.
> > > >>>>>>>>
> > > >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > > rename
> > > >>>>>>>> `cache()` to
> > > >>>>>>>>
> > > >>>>>>>> void materialize()
> > > >>>>>>>>
> > > >>>>>>>> or going step further
> > > >>>>>>>>
> > > >>>>>>>> MaterializedTable materialize()
> > > >>>>>>>> MaterializedTable createMaterializedView()
> > > >>>>>>>>
> > > >>>>>>>> ?
> > > >>>>>>>>
> > > >>>>>>>> The second option with returning a handle I think is more
> > flexible
> > > >>>>> and
> > > >>>>>>>> could provide features such as “refresh”/“delete” or generally
> > > >>>>> speaking
> > > >>>>>>>> manage the the view. In the future we could also think about
> > > adding
> > > >>>>>> hooks
> > > >>>>>>>> to automatically refresh view etc. It is also more explicit -
> > > >>>>>>>> materialization returning a new table handle will not have the
> > > same
> > > >>>>>>>> implicit side effects as adding a simple line of code like
> > > >>>>> `b.cache()`
> > > >>>>>>>> would have.
> > > >>>>>>>>
> > > >>>>>>>> It would also be more SQL like, making it more intuitive for
> > users
> > > >>>>>>> already
> > > >>>>>>>> familiar with the SQL.
> > > >>>>>>>>
> > > >>>>>>>> Piotrek
> > > >>>>>>>>
> > > >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket.qin@xxxxxxxxx>
> > > wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>
> > > >>>>>>>>> For the cache() method itself, yes, it is equivalent to
> > creating
> > > a
> > > >>>>>>>> BUILT-IN
> > > >>>>>>>>> materialized view with a lifecycle. That functionality is
> > missing
> > > >>>>>>> today,
> > > >>>>>>>>> though. Not sure if I understand your question. Do you mean
> we
> > > >>>>>> already
> > > >>>>>>>> have
> > > >>>>>>>>> the functionality and just need a syntax sugar?
> > > >>>>>>>>>
> > > >>>>>>>>> What's more interesting in the proposal is do we want to stop
> > at
> > > >>>>>>> creating
> > > >>>>>>>>> the materialized view? Or do we want to extend that in the
> > future
> > > >>>>> to
> > > >>>>>> a
> > > >>>>>>>> more
> > > >>>>>>>>> useful unified data store distributed with Flink? And do we
> > want
> > > to
> > > >>>>>>> have
> > > >>>>>>>> a
> > > >>>>>>>>> mechanism allow more flexible user job pattern with their own
> > > user
> > > >>>>>>>> defined
> > > >>>>>>>>> services. These considerations are much more architectural.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>>
> > > >>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > >>>>>>> piotr@xxxxxxxxxxxxxxxxx>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Interesting idea. I’m trying to understand the problem.
> Isn’t
> > > the
> > > >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> > later
> > > >>>>>>> reading
> > > >>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> > And
> > > >>>>> the
> > > >>>>>>>> sink
> > > >>>>>>>>>> could be implemented as in memory or a file sink?
> > > >>>>>>>>>>
> > > >>>>>>>>>> If so, what’s the problem with creating a materialised view
> > > from a
> > > >>>>>>> table
> > > >>>>>>>>>> “b” (from your document’s example) and reusing this
> > materialised
> > > >>>>>> view
> > > >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > materialised
> > > >>>>>> views
> > > >>>>>>>> (for
> > > >>>>>>>>>> example when current session finishes)? Maybe we need some
> > > >>>>> syntactic
> > > >>>>>>>> sugar
> > > >>>>>>>>>> on top of it?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Piotrek
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@xxxxxxxxx
> >
> > > >>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > > >>>>>>> lifecycle/defined
> > > >>>>>>>>>>> scope. I just added a section in the future work for this.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > >>>>>>> sunjincheng121@xxxxxxxxx
> > > >>>>>>>>>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thank you for the explanation about the name of
> `cache()`, I
> > > >>>>>>>> understand
> > > >>>>>>>>>> why
> > > >>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
> data
> > > >>>>>>>> persistence?
> > > >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> > is
> > > >>>>> not
> > > >>>>>>>>>> worried
> > > >>>>>>>>>>>> about data loss, and will clearly specify the time range
> for
> > > >>>>>> keeping
> > > >>>>>>>>>> time.
> > > >>>>>>>>>>>> At the same time, if we want to expand, we can also share
> > in a
> > > >>>>>>> certain
> > > >>>>>>>>>>>> group of session, for example:
> > LifeCycle.SESSION_GROUP(...), I
> > > >>>>> am
> > > >>>>>>> not
> > > >>>>>>>>>> sure,
> > > >>>>>>>>>>>> just an immature suggestion, for reference only!
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Becket Qin <becket.qin@xxxxxxxxx> 于2018年11月23日周五
> 下午1:33写道:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> persist(),
> > > >>>>>>>> personally I
> > > >>>>>>>>>>>>> find cache() to be more accurately describing the
> behavior,
> > > >>>>> i.e.
> > > >>>>>>> the
> > > >>>>>>>>>>>> Table
> > > >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> > > >>>>> session
> > > >>>>>> is
> > > >>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>> persist() seems a little misleading as people might think
> > the
> > > >>>>>> table
> > > >>>>>>>>>> will
> > > >>>>>>>>>>>>> still be there even after the session is gone.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Great point about mixing the batch and stream processing
> in
> > > the
> > > >>>>>>> same
> > > >>>>>>>>>> job.
> > > >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
> that
> > > >>>>> would
> > > >>>>>>> be
> > > >>>>>>>> a
> > > >>>>>>>>>>>> huge
> > > >>>>>>>>>>>>> change across the board, including sources, operators and
> > > >>>>>>>>>> optimizations,
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> > > >>>>>>> discussions.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > >>>>> xingcanc@xxxxxxxxx>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
> both
> > > >>>>>>>> orthogonal
> > > >>>>>>>>>>>> to
> > > >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
> time
> > > we
> > > >>>>>> plan
> > > >>>>>>>> to
> > > >>>>>>>>>>>>>> introduce another storage mechanism other than the
> state.
> > > >>>>> Maybe
> > > >>>>>>> it’s
> > > >>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > > specific
> > > >>>>>>> part?
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > > underlying
> > > >>>>>>>>>> service.
> > > >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > > >>>>> codebase.
> > > >>>>>> As
> > > >>>>>>>> you
> > > >>>>>>>>>>>>>> claimed, the service should be extendible to support
> other
> > > >>>>>>>> components
> > > >>>>>>>>>>>> and
> > > >>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> > Table
> > > >>>>>> API,
> > > >>>>>>> in
> > > >>>>>>>>>>>> case
> > > >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > > >>>>>> xiaoweij@xxxxxxxxx>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
> is
> > > not
> > > >>>>>> very
> > > >>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > > successfully.
> > > >>>>> We
> > > >>>>>>> may
> > > >>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
> have
> > an
> > > >>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>> between temp table and session id. So we can always
> clean
> > > up
> > > >>>>>> temp
> > > >>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>> which are no longer associated with any active
> sessions.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > > >>>>>>>>>>>>> sunjincheng121@xxxxxxxxx>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Interactive Programming is very useful and user
> friendly
> > > in
> > > >>>>>> case
> > > >>>>>>>> of
> > > >>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>> Moreover, especially when a business has to be
> executed
> > in
> > > >>>>>>> several
> > > >>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in
> > > order
> > > >>>>>> to
> > > >>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
> job
> > > by
> > > >>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > > >>>>> `persist()`,
> > > >>>>>>> And
> > > >>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> > in
> > > >>>>>> memory
> > > >>>>>>>> or
> > > >>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state
> > > backend
> > > >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> > for
> > > >>>>>>>> streaming
> > > >>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> > in
> > > >>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> > > FLIP!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Becket Qin <becket.qin@xxxxxxxxx> 于2018年11月20日周二
> > > 下午9:56写道:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
> is a
> > > >>>>>>> promising
> > > >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > > aspects,
> > > >>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
> the
> > > >>>>>>> scenarios
> > > >>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
> To
> > > >>>>>> explain
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> > > >>>>>> together
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>> following document with our proposal.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > >
> > >
> >
>