osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Support Interactive Programming in Flink Table API


Yes you are right Becket that it still depends on the actual execution of
the job whether a consumer reads from a cached result or not.

My point was actually about the properties of a (cached vs. non-cached) and
not about the execution. I would not make cache trigger the execution of
the job because one loses some flexibility by eagerly triggering the
execution.

I tried to argue for an explicit CachedTable which is returned by the
cache() method like Piotr did in order to make the API more explicit.

Cheers,
Till

On Mon, Dec 3, 2018 at 4:23 PM Becket Qin <becket.qin@xxxxxxxxx> wrote:

> Hi Till,
>
> That is a good example. Just a minor correction, in this case, b, c and d
> will all consume from a non-cached a. This is because cache will only be
> created on the very first job submission that generates the table to be
> cached.
>
> If I understand correctly, this is example is about whether .cache() method
> should be eagerly evaluated or lazily evaluated. In another word, if
> cache() method actually triggers a job that creates the cache, there will
> be no such confusion. Is that right?
>
> In the example, although d will not consume from the cached Table while it
> looks supposed to, from correctness perspective the code will still return
> correct result, assuming that tables are immutable.
>
> Personally I feel it is OK because users probably won't really worry about
> whether the table is cached or not. And lazy cache could avoid some
> unnecessary caching if a cached table is never created in the user
> application. But I am not opposed to do eager evaluation of cache.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Mon, Dec 3, 2018 at 10:01 PM Till Rohrmann <trohrmann@xxxxxxxxxx>
> wrote:
>
> > Another argument for Piotr's point is that lazily changing properties of
> a
> > node affects all down stream consumers but does not necessarily have to
> > happen before these consumers are defined. From a user's perspective this
> > can be quite confusing:
> >
> > b = a.map(...)
> > c = a.map(...)
> >
> > a.cache()
> > d = a.map(...)
> >
> > now b, c and d will consume from a cached operator. In this case, the
> user
> > would most likely expect that only d reads from a cached result.
> >
> > Cheers,
> > Till
> >
> > On Mon, Dec 3, 2018 at 11:32 AM Piotr Nowojski <piotr@xxxxxxxxxxxxxxxxx>
> > wrote:
> >
> > > Hey Shaoxuan and Becket,
> > >
> > > > Can you explain a bit more one what are the side effects? So far my
> > > > understanding is that such side effects only exist if a table is
> > mutable.
> > > > Is that the case?
> > >
> > > Not only that. There are also performance implications and those are
> > > another implicit side effects of using `void cache()`. As I wrote
> before,
> > > reading from cache might not always be desirable, thus it can cause
> > > performance degradation and I’m fine with that - user's or optimiser’s
> > > choice. What I do not like is that this implicit side effect can
> manifest
> > > in completely different part of code, that wasn’t touched by a user
> while
> > > he was adding `void cache()` call somewhere else. And even if caching
> > > improves performance, it’s still a side effect of `void cache()`.
> Almost
> > > from the definition `void` methods have only side effects. As I wrote
> > > before, there are couple of scenarios where this might be undesirable
> > > and/or unexpected, for example:
> > >
> > > 1.
> > > Table b = …;
> > > b.cache()
> > > x = b.join(…)
> > > y = b.count()
> > > // ...
> > > // 100
> > > // hundred
> > > // lines
> > > // of
> > > // code
> > > // later
> > > z = b.filter(…).groupBy(…) // this might be even hidden in a different
> > > method/file/package/dependency
> > >
> > > 2.
> > >
> > > Table b = ...
> > > If (some_condition) {
> > >   foo(b)
> > > }
> > > Else {
> > >   bar(b)
> > > }
> > > z = b.filter(…).groupBy(…)
> > >
> > >
> > > Void foo(Table b) {
> > >   b.cache()
> > >   // do something with b
> > > }
> > >
> > > In both above examples, `b.cache()` will implicitly affect (semantic
> of a
> > > program in case of sources being mutable and performance) `z =
> > > b.filter(…).groupBy(…)` which might be far from obvious.
> > >
> > > On top of that, there is still this argument of mine that having a
> > > `MaterializedTable` or `CachedTable` handle is more flexible for us for
> > the
> > > future and for the user (as a manual option to bypass cache reads).
> > >
> > > >  But Jiangjie is correct,
> > > > the source table in batching should be immutable. It is the user’s
> > > > responsibility to ensure it, otherwise even a regular failover may
> lead
> > > > to inconsistent results.
> > >
> > > Yes, I agree that’s what perfect world/good deployment should be. But
> its
> > > often isn’t and while I’m not trying to fix this (since the proper fix
> is
> > > to support transactions), I’m just trying to minimise confusion for the
> > > users that are not fully aware what’s going on and operate in less then
> > > perfect setup. And if something bites them after adding `b.cache()`
> call,
> > > to make sure that they at least know all of the places that adding this
> > > line can affect.
> > >
> > > Thanks, Piotrek
> > >
> > > > On 1 Dec 2018, at 15:39, Becket Qin <becket.qin@xxxxxxxxx> wrote:
> > > >
> > > > Hi Piotrek,
> > > >
> > > > Thanks again for the clarification. Some more replies are following.
> > > >
> > > > But keep in mind that `.cache()` will/might not only be used in
> > > interactive
> > > >> programming and not only in batching.
> > > >
> > > > It is true. Actually in stream processing, cache() has the same
> > semantic
> > > as
> > > > batch processing. The semantic is following:
> > > > For a table created via a series of computation, save that table for
> > > later
> > > > reference to avoid running the computation logic to regenerate the
> > table.
> > > > Once the application exits, drop all the cache.
> > > > This semantic is same for both batch and stream processing. The
> > > difference
> > > > is that stream applications will only run once as they are long
> > running.
> > > > And the batch applications may be run multiple times, hence the cache
> > may
> > > > be created and dropped each time the application runs.
> > > > Admittedly, there will probably be some resource management
> > requirements
> > > > for the streaming cached table, such as time based / size based
> > > retention,
> > > > to address the infinite data issue. But such requirement does not
> > change
> > > > the semantic.
> > > > You are right that interactive programming is just one use case of
> > > cache().
> > > > It is not the only use case.
> > > >
> > > > For me the more important issue is of not having the `void cache()`
> > with
> > > >> side effects.
> > > >
> > > > This is indeed the key point. The argument around whether cache()
> > should
> > > > return something already indicates that cache() and materialize()
> > address
> > > > different issues.
> > > > Can you explain a bit more one what are the side effects? So far my
> > > > understanding is that such side effects only exist if a table is
> > mutable.
> > > > Is that the case?
> > > >
> > > > I don’t know, probably initially we should make CachedTable
> read-only.
> > I
> > > >> don’t find it more confusing than the fact that user can not write
> to
> > > views
> > > >> or materialised views in SQL or that user currently can not write
> to a
> > > >> Table.
> > > >
> > > > I don't think anyone should insert something to a cache. By
> definition
> > > the
> > > > cache should only be updated when the corresponding original table is
> > > > updated. What I am wondering is that given the following two facts:
> > > > 1. If and only if a table is mutable (with something like insert()),
> a
> > > > CachedTable may have implicit behavior.
> > > > 2. A CachedTable extends a Table.
> > > > We can come to the conclusion that a CachedTable is mutable and users
> > can
> > > > insert into the CachedTable directly. This is where I thought
> > confusing.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Sat, Dec 1, 2018 at 2:45 AM Piotr Nowojski <
> piotr@xxxxxxxxxxxxxxxxx
> > >
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> Regarding naming `cache()` vs `materialize()`. One more explanation
> > why
> > > I
> > > >> think `materialize()` is more natural to me is that I think of all
> > > “Table”s
> > > >> in Table-API as views. They behave the same way as SQL views, the
> only
> > > >> difference for me is that their live scope is short - current
> session
> > > which
> > > >> is limited by different execution model. That’s why “cashing” a view
> > > for me
> > > >> is just materialising it.
> > > >>
> > > >> However I see and I understand your point of view. Coming from
> > > >> DataSet/DataStream and generally speaking non-SQL world, `cache()`
> is
> > > more
> > > >> natural. But keep in mind that `.cache()` will/might not only be
> used
> > in
> > > >> interactive programming and not only in batching. But naming is one
> > > issue,
> > > >> and not that critical to me. Especially that once we implement
> proper
> > > >> materialised views, we can always deprecate/rename `cache()` if we
> > deem
> > > so.
> > > >>
> > > >>
> > > >> For me the more important issue is of not having the `void cache()`
> > with
> > > >> side effects. Exactly for the reasons that you have mentioned. True:
> > > >> results might be non deterministic if underlying source table are
> > > changing.
> > > >> Problem is that `void cache()` implicitly changes the semantic of
> > > >> subsequent uses of the cached/materialized Table. It can cause “wtf”
> > > moment
> > > >> for a user if he inserts “b.cache()” call in some place in his code
> > and
> > > >> suddenly some other random places are behaving differently. If
> > > >> `materialize()` or `cache()` returns a Table handle, we force user
> to
> > > >> explicitly use the cache which removes the “random” part from the
> > > "suddenly
> > > >> some other random places are behaving differently”.
> > > >>
> > > >> This argument and others that I’ve raised (greater
> > flexibility/allowing
> > > >> user to explicitly bypass the cache) are independent of `cache()` vs
> > > >> `materialize()` discussion.
> > > >>
> > > >>> Does that mean one can also insert into the CachedTable? This
> sounds
> > > >> pretty confusing.
> > > >>
> > > >> I don’t know, probably initially we should make CachedTable
> > read-only. I
> > > >> don’t find it more confusing than the fact that user can not write
> to
> > > views
> > > >> or materialised views in SQL or that user currently can not write
> to a
> > > >> Table.
> > > >>
> > > >> Piotrek
> > > >>
> > > >>> On 30 Nov 2018, at 17:38, Xingcan Cui <xingcanc@xxxxxxxxx> wrote:
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I agree with @Becket that `cache()` and `materialize()` should be
> > > >> considered as two different methods where the later one is more
> > > >> sophisticated.
> > > >>>
> > > >>> According to my understanding, the initial idea is just to
> introduce
> > a
> > > >> simple cache or persist mechanism, but as the TableAPI is a
> high-level
> > > API,
> > > >> it’s naturally for as to think in a SQL way.
> > > >>>
> > > >>> Maybe we can add the `cache()` method to the DataSet API and force
> > > users
> > > >> to translate a Table to a Dataset before caching it. Then the users
> > > should
> > > >> manually register the cached dataset to a table again (we may need
> > some
> > > >> table replacement mechanisms for datasets with an identical schema
> but
> > > >> different contents here). After all, it’s the dataset rather than
> the
> > > >> dynamic table that need to be cached, right?
> > > >>>
> > > >>> Best,
> > > >>> Xingcan
> > > >>>
> > > >>>> On Nov 30, 2018, at 10:57 AM, Becket Qin <becket.qin@xxxxxxxxx>
> > > wrote:
> > > >>>>
> > > >>>> Hi Piotrek and Jark,
> > > >>>>
> > > >>>> Thanks for the feedback and explanation. Those are good arguments.
> > > But I
> > > >>>> think those arguments are mostly about materialized view. Let me
> try
> > > to
> > > >>>> explain the reason I believe cache() and materialize() are
> > different.
> > > >>>>
> > > >>>> I think cache() and materialize() have quite different
> implications.
> > > An
> > > >>>> analogy I can think of is save()/publish(). When users call
> cache(),
> > > it
> > > >> is
> > > >>>> just like they are saving an intermediate result as a draft of
> their
> > > >> work,
> > > >>>> this intermediate result may not have any realistic meaning.
> Calling
> > > >>>> cache() does not mean users want to publish the cached table in
> any
> > > >> manner.
> > > >>>> But when users call materialize(), that means "I have something
> > > >> meaningful
> > > >>>> to be reused by others", now users need to think about the
> > validation,
> > > >>>> update & versioning, lifecycle of the result, etc.
> > > >>>>
> > > >>>> Piotrek's suggestions on variations of the materialize() methods
> are
> > > >> very
> > > >>>> useful. It would be great if Flink have them. The concept of
> > > >> materialized
> > > >>>> view is actually a pretty big feature, not to say the related
> stuff
> > > like
> > > >>>> triggers/hooks you mentioned earlier. I think the materialized
> view
> > > >> itself
> > > >>>> should be discussed in a more thorough and systematic manner. And
> I
> > > >> found
> > > >>>> that discussion is kind of orthogonal and way beyond interactive
> > > >>>> programming experience.
> > > >>>>
> > > >>>> The example you gave was interesting. I still have some questions,
> > > >> though.
> > > >>>>
> > > >>>> Table source = … // some source that scans files from a directory
> > > >>>>> “/foo/bar/“
> > > >>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > >>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > >>>>
> > > >>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > >>>>> int a1 = t1.count()
> > > >>>>> int b1 = t2.count()
> > > >>>>> // something in the background (or we trigger it) writes new
> files
> > to
> > > >>>>> /foo/bar
> > > >>>>> int a2 = t1.count()
> > > >>>>> int b2 = t2.count()
> > > >>>>> t2.refresh() // possible future extension, not to be implemented
> in
> > > the
> > > >>>>> initial version
> > > >>>>>
> > > >>>>
> > > >>>> what if someone else added some more files to /foo/bar at this
> > point?
> > > In
> > > >>>> that case, a3 won't equals to b3, and the result become
> > > >> non-deterministic,
> > > >>>> right?
> > > >>>>
> > > >>>> int a3 = t1.count()
> > > >>>>> int b3 = t2.count()
> > > >>>>> t2.drop() // another possible future extension, manual “cache”
> > > dropping
> > > >>>>
> > > >>>>
> > > >>>> When we talk about interactive programming, in most cases, we are
> > > >> talking
> > > >>>> about batch applications. A fundamental assumption of such case is
> > > that
> > > >> the
> > > >>>> source data is complete before the data processing begins, and the
> > > data
> > > >>>> will not change during the data processing. IMO, if additional
> rows
> > > >> needs
> > > >>>> to be added to some source during the processing, it should be
> done
> > in
> > > >> ways
> > > >>>> like union the source with another table containing the rows to be
> > > >> added.
> > > >>>>
> > > >>>> There are a few cases that computations are executed repeatedly on
> > the
> > > >>>> changing data source.
> > > >>>>
> > > >>>> For example, people may run a ML training job every hour with the
> > > >> samples
> > > >>>> newly added in the past hour. In that case, the source data
> between
> > > will
> > > >>>> indeed change. But still, the data remain unchanged within one
> run.
> > > And
> > > >>>> usually in that case, the result will need versioning, i.e. for a
> > > given
> > > >>>> result, it tells that the result is a result from the source data
> > by a
> > > >>>> certain timestamp.
> > > >>>>
> > > >>>> Another example is something like data warehouse. In this case,
> > there
> > > >> are a
> > > >>>> few source of original/raw data. On top of those sources, many
> > > >> materialized
> > > >>>> view / queries / reports / dashboards can be created to generate
> > > derived
> > > >>>> data. Those derived data needs to be updated when the underlying
> > > >> original
> > > >>>> data changes. In that case, the processing logic that derives the
> > > >> original
> > > >>>> data needs to be executed repeatedly to update those
> reports/views.
> > > >> Again,
> > > >>>> all those derived data also need to have version management, such
> as
> > > >>>> timestamp.
> > > >>>>
> > > >>>> In any of the above two cases, during a single run of the
> processing
> > > >> logic,
> > > >>>> the data cannot change. Otherwise the behavior of the processing
> > logic
> > > >> may
> > > >>>> be undefined. In the above two examples, when writing the
> processing
> > > >> logic,
> > > >>>> Users can use .cache() to hint Flink that those results should be
> > > saved
> > > >> to
> > > >>>> avoid repeated computation. And then for the result of my
> > application
> > > >>>> logic, I'll call materialize(), so that these results could be
> > managed
> > > >> by
> > > >>>> the system with versioning, metadata management, lifecycle
> > management,
> > > >>>> ACLs, etc.
> > > >>>>
> > > >>>> It is true we can use materialize() to do the cache() job, but I
> am
> > > >> really
> > > >>>> reluctant to shoehorn cache() into materialize() and force users
> to
> > > >> worry
> > > >>>> about a bunch of implications that they needn't have to. I am
> > > >> absolutely on
> > > >>>> your side that redundant API is bad. But it is equally
> frustrating,
> > if
> > > >> not
> > > >>>> more, that the same API does different things.
> > > >>>>
> > > >>>> Thanks,
> > > >>>>
> > > >>>> Jiangjie (Becket) Qin
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Nov 30, 2018 at 10:34 PM Shaoxuan Wang <
> wshaoxuan@xxxxxxxxx
> > >
> > > >> wrote:
> > > >>>>
> > > >>>>> Thanks Piotrek,
> > > >>>>> You provided a very good example, it explains all the confusions
> I
> > > >> have.
> > > >>>>> It is clear that there is something we have not considered in the
> > > >> initial
> > > >>>>> proposal. We intend to force the user to reuse the
> > > cached/materialized
> > > >>>>> table, if its cache() method is executed. We did not expect that
> > user
> > > >> may
> > > >>>>> want to re-executed the plan from the source table. Let me
> re-think
> > > >> about
> > > >>>>> it and get back to you later.
> > > >>>>>
> > > >>>>> In the meanwhile, this example/observation also infers that we
> > cannot
> > > >> fully
> > > >>>>> involve the optimizer to decide the plan if a cache/materialize
> is
> > > >>>>> explicitly used, because weather to reuse the cache data or
> > > re-execute
> > > >> the
> > > >>>>> query from source data may lead to different results. (But I
> guess
> > > >>>>> optimizer can still help in some cases ---- as long as it does
> not
> > > >>>>> re-execute from the varied source, we should be safe).
> > > >>>>>
> > > >>>>> Regards,
> > > >>>>> Shaoxuan
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Nov 30, 2018 at 9:13 PM Piotr Nowojski <
> > > >> piotr@xxxxxxxxxxxxxxxxx>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi Shaoxuan,
> > > >>>>>>
> > > >>>>>> Re 2:
> > > >>>>>>
> > > >>>>>>> Table t3 = methodThatAppliesOperators(t1) // t1 is modified
> to->
> > > t1’
> > > >>>>>>
> > > >>>>>> What do you mean that “ t1 is modified to-> t1’ ” ? That
> > > >>>>>> `methodThatAppliesOperators()` method has changed it’s plan?
> > > >>>>>>
> > > >>>>>> I was thinking more about something like this:
> > > >>>>>>
> > > >>>>>> Table source = … // some source that scans files from a
> directory
> > > >>>>>> “/foo/bar/“
> > > >>>>>> Table t1 = source.groupBy(…).select(…).where(…) ….;
> > > >>>>>> Table t2 = t1.materialize() // (or `cache()`)
> > > >>>>>>
> > > >>>>>> t2.count() // initialise cache (if it’s lazily initialised)
> > > >>>>>>
> > > >>>>>> int a1 = t1.count()
> > > >>>>>> int b1 = t2.count()
> > > >>>>>>
> > > >>>>>> // something in the background (or we trigger it) writes new
> files
> > > to
> > > >>>>>> /foo/bar
> > > >>>>>>
> > > >>>>>> int a2 = t1.count()
> > > >>>>>> int b2 = t2.count()
> > > >>>>>>
> > > >>>>>> t2.refresh() // possible future extension, not to be implemented
> > in
> > > >> the
> > > >>>>>> initial version
> > > >>>>>>
> > > >>>>>> int a3 = t1.count()
> > > >>>>>> int b3 = t2.count()
> > > >>>>>>
> > > >>>>>> t2.drop() // another possible future extension, manual “cache”
> > > >> dropping
> > > >>>>>>
> > > >>>>>> assertTrue(a1 == b1) // same results, but b1 comes from the
> > “cache"
> > > >>>>>> assertTrue(b1 == b2) // both values come from the same cache
> > > >>>>>> assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full
> > > table
> > > >>>>> scan
> > > >>>>>> and has more data
> > > >>>>>> assertTrue(b3 > b2) // b3 comes from refreshed cache
> > > >>>>>> assertTrue(b3 == a2 == a3)
> > > >>>>>>
> > > >>>>>> Piotrek
> > > >>>>>>
> > > >>>>>>> On 30 Nov 2018, at 10:22, Jark Wu <imjark@xxxxxxxxx> wrote:
> > > >>>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> It is an very interesting and useful design!
> > > >>>>>>>
> > > >>>>>>> Here I want to share some of my thoughts:
> > > >>>>>>>
> > > >>>>>>> 1. Agree with that cache() method should return some Table to
> > avoid
> > > >>>>> some
> > > >>>>>>> unexpected problems because of the mutable object.
> > > >>>>>>> All the existing methods of Table are returning a new Table
> > > instance.
> > > >>>>>>>
> > > >>>>>>> 2. I think materialize() would be more consistent with SQL,
> this
> > > >> makes
> > > >>>>> it
> > > >>>>>>> possible to support the same feature for SQL (materialize view)
> > and
> > > >>>>> keep
> > > >>>>>>> the same API for users in the future.
> > > >>>>>>> But I'm also fine if we choose cache().
> > > >>>>>>>
> > > >>>>>>> 3. In the proposal, a TableService (or FlinkService?) is used
> to
> > > >> cache
> > > >>>>>> the
> > > >>>>>>> result of the (intermediate) table.
> > > >>>>>>> But the name of TableService may be a bit general which is not
> > > quite
> > > >>>>>>> understanding correctly in the first glance (a metastore for
> > > >> tables?).
> > > >>>>>>> Maybe a more specific name would be better, such as
> > > TableCacheSerive
> > > >>>>> or
> > > >>>>>>> TableMaterializeSerivce or something else.
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Jark
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fhueske@xxxxxxxxx
> >
> > > >> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi,
> > > >>>>>>>>
> > > >>>>>>>> Thanks for the clarification Becket!
> > > >>>>>>>>
> > > >>>>>>>> I have a few thoughts to share / questions:
> > > >>>>>>>>
> > > >>>>>>>> 1) I'd like to know how you plan to implement the feature on a
> > > plan
> > > >> /
> > > >>>>>>>> planner level.
> > > >>>>>>>>
> > > >>>>>>>> I would imaging the following to happen when Table.cache() is
> > > >> called:
> > > >>>>>>>>
> > > >>>>>>>> 1) immediately optimize the Table and internally convert it
> > into a
> > > >>>>>>>> DataSet/DataStream. This is necessary, to avoid that operators
> > of
> > > >>>>> later
> > > >>>>>>>> queries on top of the Table are pushed down.
> > > >>>>>>>> 2) register the DataSet/DataStream as a
> > DataSet/DataStream-backed
> > > >>>>> Table
> > > >>>>>> X
> > > >>>>>>>> 3) add a sink to the DataSet/DataStream. This is the
> > > materialization
> > > >>>>> of
> > > >>>>>> the
> > > >>>>>>>> Table X
> > > >>>>>>>>
> > > >>>>>>>> Based on your proposal the following would happen:
> > > >>>>>>>>
> > > >>>>>>>> Table t1 = ....
> > > >>>>>>>> t1.cache(); // cache() returns void. The logical plan of t1 is
> > > >>>>> replaced
> > > >>>>>> by
> > > >>>>>>>> a scan of X. There is also a reference to the materialization
> of
> > > X.
> > > >>>>>>>>
> > > >>>>>>>> t1.count(); // this executes the program, including the
> > > >>>>>> DataSet/DataStream
> > > >>>>>>>> that backs X and the sink that writes the materialization of X
> > > >>>>>>>> t1.count(); // this executes the program, but reads X from the
> > > >>>>>>>> materialization.
> > > >>>>>>>>
> > > >>>>>>>> My question is, how do you determine when whether the scan of
> t1
> > > >>>>> should
> > > >>>>>> go
> > > >>>>>>>> against the DataSet/DataStream program and when against the
> > > >>>>>>>> materialization?
> > > >>>>>>>> AFAIK, there is no hook that will tell you that a part of the
> > > >> program
> > > >>>>>> was
> > > >>>>>>>> executed. Flipping a switch during optimization or plan
> > generation
> > > >> is
> > > >>>>>> not
> > > >>>>>>>> sufficient as there is no guarantee that the plan is also
> > > executed.
> > > >>>>>>>>
> > > >>>>>>>> Overall, this behavior is somewhat similar to what I proposed
> in
> > > >>>>>>>> FLINK-8950, which does not include persisting the table, but
> > just
> > > >>>>>>>> optimizing and reregistering it as DataSet/DataStream scan.
> > > >>>>>>>>
> > > >>>>>>>> 2) I think Piotr has a point about the implicit behavior and
> > side
> > > >>>>>> effects
> > > >>>>>>>> of the cache() method if it does not return anything.
> > > >>>>>>>> Consider the following example:
> > > >>>>>>>>
> > > >>>>>>>> Table t1 = ???
> > > >>>>>>>> Table t2 = methodThatAppliesOperators(t1);
> > > >>>>>>>> Table t3 = methodThatAppliesOtherOperators(t1);
> > > >>>>>>>>
> > > >>>>>>>> In this case, the behavior/performance of the plan that
> results
> > > from
> > > >>>>> the
> > > >>>>>>>> second method call depends on whether t1 was modified by the
> > first
> > > >>>>>> method
> > > >>>>>>>> or not.
> > > >>>>>>>> This is the classic issue of mutable vs. immutable objects.
> > > >>>>>>>> Also, as Piotr pointed out, it might also be good to have the
> > > >> original
> > > >>>>>> plan
> > > >>>>>>>> of t1, because in some cases it is possible to push filters
> down
> > > >> such
> > > >>>>>> that
> > > >>>>>>>> evaluating the query from scratch might be more efficient than
> > > >>>>> accessing
> > > >>>>>>>> the cache.
> > > >>>>>>>> Moreover, a CachedTable could extend Table() and offer a
> method
> > > >>>>>> refresh().
> > > >>>>>>>> This sounds quite useful in an interactive session mode.
> > > >>>>>>>>
> > > >>>>>>>> 3) Regarding the name, I can see both arguments. IMO,
> > > materialize()
> > > >>>>>> seems
> > > >>>>>>>> to be more future proof.
> > > >>>>>>>>
> > > >>>>>>>> Best, Fabian
> > > >>>>>>>>
> > > >>>>>>>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
> > > >>>>>>>> wshaoxuan@xxxxxxxxx>:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Piotr,
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks for sharing your ideas on the method naming. We will
> > think
> > > >>>>> about
> > > >>>>>>>>> your suggestions. But I don't understand why we need to
> change
> > > the
> > > >>>>>> return
> > > >>>>>>>>> type of cache().
> > > >>>>>>>>>
> > > >>>>>>>>> Cache() is a physical operation, it does not change the logic
> > of
> > > >>>>>>>>> the `Table`. On the tableAPI layer, we should not introduce a
> > new
> > > >>>>> table
> > > >>>>>>>>> type unless the logic of table has been changed. If we
> > introduce
> > > a
> > > >>>>> new
> > > >>>>>>>>> table type `CachedTable`, we need create the same set of
> > methods
> > > of
> > > >>>>>>>> `Table`
> > > >>>>>>>>> for it. I don't think it is worth doing this. Or can you
> please
> > > >>>>>> elaborate
> > > >>>>>>>>> more on what could be the "implicit behaviours/side effects"
> > you
> > > >> are
> > > >>>>>>>>> thinking about?
> > > >>>>>>>>>
> > > >>>>>>>>> Regards,
> > > >>>>>>>>> Shaoxuan
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <
> > > >>>>>> piotr@xxxxxxxxxxxxxxxxx>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks for the response.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 1. I wasn’t saying that materialised view must be mutable or
> > > not.
> > > >>>>> The
> > > >>>>>>>>> same
> > > >>>>>>>>>> thing applies to caches as well. To the contrary, I would
> > expect
> > > >>>>> more
> > > >>>>>>>>>> consistency and updates from something that is called
> “cache”
> > vs
> > > >>>>>>>>> something
> > > >>>>>>>>>> that’s a “materialised view”. In other words, IMO most
> caches
> > do
> > > >> not
> > > >>>>>>>>> serve
> > > >>>>>>>>>> you invalid/outdated data and they handle updates on their
> > own.
> > > >>>>>>>>>>
> > > >>>>>>>>>> 2. I don’t think that having in the future two very similar
> > > >> concepts
> > > >>>>>> of
> > > >>>>>>>>>> `materialized` view and `cache` is a good idea. It would be
> > > >>>>> confusing
> > > >>>>>>>> for
> > > >>>>>>>>>> the users. I think it could be handled by
> > variations/overloading
> > > >> of
> > > >>>>>>>>>> materialised view concept. We could start with:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `MaterializedTable materialize()` - immutable, session life
> > > scope
> > > >>>>>>>>>> (basically the same semantic as you are proposing
> > > >>>>>>>>>>
> > > >>>>>>>>>> And then in the future (if ever) build on top of that/expand
> > it
> > > >>>>> with:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `MaterializedTable materialize(refreshTime=…)` or
> > > >> `MaterializedTable
> > > >>>>>>>>>> materialize(refreshHook=…)`
> > > >>>>>>>>>>
> > > >>>>>>>>>> Or with cross session support:
> > > >>>>>>>>>>
> > > >>>>>>>>>> `MaterializedTable materializeInto(connector=…)` or
> > > >>>>> `MaterializedTable
> > > >>>>>>>>>> materializeInto(tableFactory=…)`
> > > >>>>>>>>>>
> > > >>>>>>>>>> I’m not saying that we should implement cross
> > session/refreshing
> > > >> now
> > > >>>>>> or
> > > >>>>>>>>>> even in the near future. I’m just arguing that naming
> current
> > > >>>>>> immutable
> > > >>>>>>>>>> session life scope method `materialize()` is more future
> proof
> > > and
> > > >>>>>> more
> > > >>>>>>>>>> consistent with SQL (on which after all table-api is heavily
> > > >> basing
> > > >>>>>>>> on).
> > > >>>>>>>>>>
> > > >>>>>>>>>> 3. Even if we agree on naming it `cache()`, I would still
> > insist
> > > >> on
> > > >>>>>>>>>> `cache()` returning `CachedTable` handle to avoid implicit
> > > >>>>>>>>> behaviours/side
> > > >>>>>>>>>> effects and to give both us & users more flexibility.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Piotrek
> > > >>>>>>>>>>
> > > >>>>>>>>>>> On 29 Nov 2018, at 06:20, Becket Qin <becket.qin@xxxxxxxxx
> >
> > > >> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Just to add a little bit, the materialized view is probably
> > > more
> > > >>>>>>>>> similar
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> the persistent() brought up earlier in the thread. So it is
> > > >> usually
> > > >>>>>>>>> cross
> > > >>>>>>>>>>> session and could be used in a larger scope. For example, a
> > > >>>>>>>>> materialized
> > > >>>>>>>>>>> view created by user A may be visible to user B. It is
> > probably
> > > >>>>>>>>> something
> > > >>>>>>>>>>> we want to have in the future. I'll put it in the future
> work
> > > >>>>>>>> section.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <
> > > becket.qin@xxxxxxxxx
> > > >>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks for the explanation.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Right now we are mostly thinking of the cached table as
> > > >>>>> immutable. I
> > > >>>>>>>>> can
> > > >>>>>>>>>>>> see the Materialized view would be useful in the future.
> > That
> > > >>>>> said,
> > > >>>>>>>> I
> > > >>>>>>>>>> think
> > > >>>>>>>>>>>> a simple cache mechanism is probably still needed. So to
> me,
> > > >>>>> cache()
> > > >>>>>>>>> and
> > > >>>>>>>>>>>> materialize() should be two separate method as they
> address
> > > >>>>>>>> different
> > > >>>>>>>>>>>> needs. Materialize() is a higher level concept usually
> > > implying
> > > >>>>>>>>>> periodical
> > > >>>>>>>>>>>> update, while cache() has much simpler semantic. For
> > example,
> > > >> one
> > > >>>>>>>> may
> > > >>>>>>>>>>>> create a materialized view and use cache() method in the
> > > >>>>>>>> materialized
> > > >>>>>>>>>> view
> > > >>>>>>>>>>>> creation logic. So that during the materialized view
> update,
> > > >> they
> > > >>>>> do
> > > >>>>>>>>> not
> > > >>>>>>>>>>>> need to worry about the case that the cached table is also
> > > >>>>> changed.
> > > >>>>>>>>>> Maybe
> > > >>>>>>>>>>>> under the hood, materialized() and cache() could share
> some
> > > >>>>>>>> mechanism,
> > > >>>>>>>>>> but
> > > >>>>>>>>>>>> I think a simple cache() method would be handy in a lot of
> > > >> cases.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> > > >>>>>>>>> piotr@xxxxxxxxxxxxxxxxx
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Is there any extra thing user can do on a
> > MaterializedTable
> > > >> that
> > > >>>>>>>>> they
> > > >>>>>>>>>>>>> cannot do on a Table?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Maybe not in the initial implementation, but various DBs
> > > offer
> > > >>>>>>>>>> different
> > > >>>>>>>>>>>>> ways to “refresh” the materialised view. Hooks, triggers,
> > > >> timers,
> > > >>>>>>>>>> manually
> > > >>>>>>>>>>>>> etc. Having `MaterializedTable` would help us to handle
> > that
> > > in
> > > >>>>> the
> > > >>>>>>>>>> future.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> After users call *table.cache(), *users can just use
> that
> > > >> table
> > > >>>>>>>> and
> > > >>>>>>>>> do
> > > >>>>>>>>>>>>> anything that is supported on a Table, including SQL.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This is some implicit behaviour with side effects.
> Imagine
> > if
> > > >>>>> user
> > > >>>>>>>>> has
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>> long and complicated program, that touches table `b`
> > multiple
> > > >>>>>>>> times,
> > > >>>>>>>>>> maybe
> > > >>>>>>>>>>>>> scattered around different methods. If he modifies his
> > > program
> > > >> by
> > > >>>>>>>>>> inserting
> > > >>>>>>>>>>>>> in one place
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> b.cache()
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> This implicitly alters the semantic and behaviour of his
> > code
> > > >> all
> > > >>>>>>>>> over
> > > >>>>>>>>>>>>> the place, maybe in a ways that might cause problems. For
> > > >> example
> > > >>>>>>>>> what
> > > >>>>>>>>>> if
> > > >>>>>>>>>>>>> underlying data is changing?
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Having invisible side effects is also not very clean, for
> > > >> example
> > > >>>>>>>>> think
> > > >>>>>>>>>>>>> about something like this (but more complicated):
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Table b = ...;
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> If (some_condition) {
> > > >>>>>>>>>>>>> processTable1(b)
> > > >>>>>>>>>>>>> }
> > > >>>>>>>>>>>>> else {
> > > >>>>>>>>>>>>> processTable2(b)
> > > >>>>>>>>>>>>> }
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> // do more stuff with b
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> And user adds `b.cache()` call to only one of the
> > > >> `processTable1`
> > > >>>>>>>> or
> > > >>>>>>>>>>>>> `processTable2` methods.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On the other hand
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Table materialisedB = b.materialize()
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Avoids (at least some of) the side effect issues and
> forces
> > > >> user
> > > >>>>> to
> > > >>>>>>>>>>>>> explicitly use `materialisedB` where it’s appropriate and
> > > >> forces
> > > >>>>>>>> user
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>> think what does it actually mean. And if something
> doesn’t
> > > work
> > > >>>>> in
> > > >>>>>>>>> the
> > > >>>>>>>>>> end
> > > >>>>>>>>>>>>> for the user, he will know what has he changed instead of
> > > >> blaming
> > > >>>>>>>>>> Flink for
> > > >>>>>>>>>>>>> some “magic” underneath. In the above example, after
> > > >>>>> materialising
> > > >>>>>>>> b
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>> only one of the methods, he should/would realise about
> the
> > > >> issue
> > > >>>>>>>> when
> > > >>>>>>>>>>>>> handling the return value `MaterializedTable` of that
> > method.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I guess it comes down to personal preferences if you like
> > > >> things
> > > >>>>> to
> > > >>>>>>>>> be
> > > >>>>>>>>>>>>> implicit or not. The more power is the user, probably the
> > > more
> > > >>>>>>>> likely
> > > >>>>>>>>>> he is
> > > >>>>>>>>>>>>> to like/understand implicit behaviour. And we as Table
> API
> > > >>>>>>>> designers
> > > >>>>>>>>>> are
> > > >>>>>>>>>>>>> the most power users out there, so I would proceed with
> > > caution
> > > >>>>> (so
> > > >>>>>>>>>> that we
> > > >>>>>>>>>>>>> do not end up in the crazy perl realm with it’s lovely
> > > implicit
> > > >>>>>>>>> method
> > > >>>>>>>>>>>>> arguments ;)  <
> > https://stackoverflow.com/a/14922656/8149051
> > > >)
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Table API to also support non-relational processing
> cases,
> > > >>>>> cache()
> > > >>>>>>>>>>>>> might be slightly better.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> I think even such extended Table API could benefit from
> > > >> sticking
> > > >>>>>>>>>> to/being
> > > >>>>>>>>>>>>> consistent with SQL where both SQL and Table API are
> > > basically
> > > >>>>> the
> > > >>>>>>>>>> same.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> One more thing. `MaterializedTable materialize()` could
> be
> > > more
> > > >>>>>>>>>>>>> powerful/flexible allowing the user to operate both on
> > > >>>>> materialised
> > > >>>>>>>>>> and not
> > > >>>>>>>>>>>>> materialised view at the same time for whatever reasons
> > > >>>>> (underlying
> > > >>>>>>>>>> data
> > > >>>>>>>>>>>>> changing/better optimisation opportunities after pushing
> > down
> > > >>>>> more
> > > >>>>>>>>>> filters
> > > >>>>>>>>>>>>> etc). For example:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Table b = …;
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> MaterlizedTable mb = b.materialize();
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Val min = mb.min();
> > > >>>>>>>>>>>>> Val max = mb.max();
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Val user42 = b.filter(‘userId = 42);
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Could be more efficient compared to `b.cache()` if
> > > >>>>> `filter(‘userId
> > > >>>>>>>> =
> > > >>>>>>>>>>>>> 42);` allows for much more aggressive optimisations.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <
> > fhueske@xxxxxxxxx>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> I'm not suggesting to add support for Ignite. This was
> > just
> > > an
> > > >>>>>>>>>> example.
> > > >>>>>>>>>>>>>> Plasma and Arrow sound interesting, too.
> > > >>>>>>>>>>>>>> For the sake of this proposal, it would be up to the
> user
> > to
> > > >>>>>>>>>> implement a
> > > >>>>>>>>>>>>>> TableFactory and corresponding TableSource / TableSink
> > > classes
> > > >>>>> to
> > > >>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>> and read the data.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio
> > > Pompermaier
> > > >> <
> > > >>>>>>>>>>>>>> pompermaier@xxxxxxxx>:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> What about to add also Apache Plasma + Arrow as an
> > > >> alternative
> > > >>>>> to
> > > >>>>>>>>>>>>> Apache
> > > >>>>>>>>>>>>>>> Ignite?
> > > >>>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>
> > > >>
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
> > > >>>>>>>> fhueske@xxxxxxxxx>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Thanks for the proposal!
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> To summarize, you propose a new method Table.cache():
> > > Table
> > > >>>>> that
> > > >>>>>>>>>> will
> > > >>>>>>>>>>>>>>>> trigger a job and write the result into some temporary
> > > >> storage
> > > >>>>>>>> as
> > > >>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>> by a TableFactory.
> > > >>>>>>>>>>>>>>>> The cache() call blocks while the job is running and
> > > >>>>> eventually
> > > >>>>>>>>>>>>> returns a
> > > >>>>>>>>>>>>>>>> Table object that represents a scan of the temporary
> > > table.
> > > >>>>>>>>>>>>>>>> When the "session" is closed (closing to be defined?),
> > the
> > > >>>>>>>>> temporary
> > > >>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>> are all dropped.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> I think this behavior makes sense and is a good first
> > step
> > > >>>>>>>> towards
> > > >>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>> interactive workloads.
> > > >>>>>>>>>>>>>>>> However, its performance suffers from writing to and
> > > reading
> > > >>>>>>>> from
> > > >>>>>>>>>>>>>>> external
> > > >>>>>>>>>>>>>>>> systems.
> > > >>>>>>>>>>>>>>>> I think this is OK for now. Changes that would
> > > significantly
> > > >>>>>>>>> improve
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>> situation (i.e., pinning data in-memory across jobs)
> > would
> > > >>>>> have
> > > >>>>>>>>>> large
> > > >>>>>>>>>>>>>>>> impacts on many components of Flink.
> > > >>>>>>>>>>>>>>>> Users could use in-memory filesystems or storage grids
> > > >> (Apache
> > > >>>>>>>>>>>>> Ignite) to
> > > >>>>>>>>>>>>>>>> mitigate some of the performance effects.
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Best, Fabian
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin
> <
> > > >>>>>>>>>>>>>>>> becket.qin@xxxxxxxxx
> > > >>>>>>>>>>>>>>>>> :
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks for the explanation, Piotrek.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Is there any extra thing user can do on a
> > > MaterializedTable
> > > >>>>>>>> that
> > > >>>>>>>>>> they
> > > >>>>>>>>>>>>>>>>> cannot do on a Table? After users call
> *table.cache(),
> > > >> *users
> > > >>>>>>>> can
> > > >>>>>>>>>>>>> just
> > > >>>>>>>>>>>>>>>> use
> > > >>>>>>>>>>>>>>>>> that table and do anything that is supported on a
> > Table,
> > > >>>>>>>>> including
> > > >>>>>>>>>>>>> SQL.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Naming wise, either cache() or materialize() sounds
> > fine
> > > to
> > > >>>>> me.
> > > >>>>>>>>>>>>> cache()
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>> a bit more general than materialize(). Given that we
> > are
> > > >>>>>>>>> enhancing
> > > >>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> Table API to also support non-relational processing
> > > cases,
> > > >>>>>>>>> cache()
> > > >>>>>>>>>>>>>>> might
> > > >>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>> slightly better.
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > > >>>>>>>>>>>>>>> piotr@xxxxxxxxxxxxxxxxx
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Hi Becket,
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse
> > > >> existing
> > > >>>>>>>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that
> > you
> > > >>>>> want
> > > >>>>>>>> to
> > > >>>>>>>>>>>>>>>> provide
> > > >>>>>>>>>>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>> alternate way of writing the data.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe
> we
> > > >> could
> > > >>>>>>>>>> rename
> > > >>>>>>>>>>>>>>>>>> `cache()` to
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> void materialize()
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> or going step further
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> MaterializedTable materialize()
> > > >>>>>>>>>>>>>>>>>> MaterializedTable createMaterializedView()
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> ?
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> The second option with returning a handle I think is
> > > more
> > > >>>>>>>>> flexible
> > > >>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>> could provide features such as “refresh”/“delete” or
> > > >>>>> generally
> > > >>>>>>>>>>>>>>> speaking
> > > >>>>>>>>>>>>>>>>>> manage the the view. In the future we could also
> think
> > > >> about
> > > >>>>>>>>>> adding
> > > >>>>>>>>>>>>>>>> hooks
> > > >>>>>>>>>>>>>>>>>> to automatically refresh view etc. It is also more
> > > >> explicit
> > > >>>>> -
> > > >>>>>>>>>>>>>>>>>> materialization returning a new table handle will
> not
> > > have
> > > >>>>> the
> > > >>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>> implicit side effects as adding a simple line of
> code
> > > like
> > > >>>>>>>>>>>>>>> `b.cache()`
> > > >>>>>>>>>>>>>>>>>> would have.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> It would also be more SQL like, making it more
> > intuitive
> > > >> for
> > > >>>>>>>>> users
> > > >>>>>>>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>> familiar with the SQL.
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <
> > > >> becket.qin@xxxxxxxxx
> > > >>>>>>
> > > >>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Hi Piotrek,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> For the cache() method itself, yes, it is
> equivalent
> > to
> > > >>>>>>>>> creating
> > > >>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> BUILT-IN
> > > >>>>>>>>>>>>>>>>>>> materialized view with a lifecycle. That
> > functionality
> > > is
> > > >>>>>>>>> missing
> > > >>>>>>>>>>>>>>>>> today,
> > > >>>>>>>>>>>>>>>>>>> though. Not sure if I understand your question. Do
> > you
> > > >> mean
> > > >>>>>>>> we
> > > >>>>>>>>>>>>>>>> already
> > > >>>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>>> the functionality and just need a syntax sugar?
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> What's more interesting in the proposal is do we
> want
> > > to
> > > >>>>> stop
> > > >>>>>>>>> at
> > > >>>>>>>>>>>>>>>>> creating
> > > >>>>>>>>>>>>>>>>>>> the materialized view? Or do we want to extend that
> > in
> > > >> the
> > > >>>>>>>>> future
> > > >>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>> more
> > > >>>>>>>>>>>>>>>>>>> useful unified data store distributed with Flink?
> And
> > > do
> > > >> we
> > > >>>>>>>>> want
> > > >>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>> have
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>> mechanism allow more flexible user job pattern with
> > > their
> > > >>>>> own
> > > >>>>>>>>>> user
> > > >>>>>>>>>>>>>>>>>> defined
> > > >>>>>>>>>>>>>>>>>>> services. These considerations are much more
> > > >> architectural.
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > > >>>>>>>>>>>>>>>>> piotr@xxxxxxxxxxxxxxxxx>
> > > >>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the
> > > problem.
> > > >>>>>>>> Isn’t
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a
> > sink
> > > >> and
> > > >>>>>>>>> later
> > > >>>>>>>>>>>>>>>>> reading
> > > >>>>>>>>>>>>>>>>>>>> from it? Where this sink has a limited live
> > scope/live
> > > >>>>> time?
> > > >>>>>>>>> And
> > > >>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>> sink
> > > >>>>>>>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> If so, what’s the problem with creating a
> > materialised
> > > >>>>> view
> > > >>>>>>>>>> from a
> > > >>>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>> “b” (from your document’s example) and reusing
> this
> > > >>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>> view
> > > >>>>>>>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> > > >>>>>>>>> materialised
> > > >>>>>>>>>>>>>>>> views
> > > >>>>>>>>>>>>>>>>>> (for
> > > >>>>>>>>>>>>>>>>>>>> example when current session finishes)? Maybe we
> > need
> > > >> some
> > > >>>>>>>>>>>>>>> syntactic
> > > >>>>>>>>>>>>>>>>>> sugar
> > > >>>>>>>>>>>>>>>>>>>> on top of it?
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>> Piotrek
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <
> > > >>>>> becket.qin@xxxxxxxxx
> > > >>>>>>>>>
> > > >>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist()
> > with
> > > >>>>>>>>>>>>>>>>> lifecycle/defined
> > > >>>>>>>>>>>>>>>>>>>>> scope. I just added a section in the future work
> > for
> > > >>>>> this.
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > > >>>>>>>>>>>>>>>>> sunjincheng121@xxxxxxxxx
> > > >>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie,
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
> > > >>>>>>>> `cache()`, I
> > > >>>>>>>>>>>>>>>>>> understand
> > > >>>>>>>>>>>>>>>>>>>> why
> > > >>>>>>>>>>>>>>>>>>>>>> you designed this way!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Another idea is whether we can specify a
> lifecycle
> > > for
> > > >>>>>>>> data
> > > >>>>>>>>>>>>>>>>>> persistence?
> > > >>>>>>>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so
> that
> > > the
> > > >>>>> user
> > > >>>>>>>>> is
> > > >>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>> worried
> > > >>>>>>>>>>>>>>>>>>>>>> about data loss, and will clearly specify the
> time
> > > >> range
> > > >>>>>>>> for
> > > >>>>>>>>>>>>>>>> keeping
> > > >>>>>>>>>>>>>>>>>>>> time.
> > > >>>>>>>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can
> > also
> > > >>>>> share
> > > >>>>>>>>> in a
> > > >>>>>>>>>>>>>>>>> certain
> > > >>>>>>>>>>>>>>>>>>>>>> group of session, for example:
> > > >>>>>>>>> LifeCycle.SESSION_GROUP(...), I
> > > >>>>>>>>>>>>>>> am
> > > >>>>>>>>>>>>>>>>> not
> > > >>>>>>>>>>>>>>>>>>>> sure,
> > > >>>>>>>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Bests,
> > > >>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>> Becket Qin <becket.qin@xxxxxxxxx>
> 于2018年11月23日周五
> > > >>>>>>>> 下午1:33写道:
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Re: Jincheng,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
> > > >>>>>>>> persist(),
> > > >>>>>>>>>>>>>>>>>> personally I
> > > >>>>>>>>>>>>>>>>>>>>>>> find cache() to be more accurately describing
> the
> > > >>>>>>>> behavior,
> > > >>>>>>>>>>>>>>> i.e.
> > > >>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>> Table
> > > >>>>>>>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted
> > > after
> > > >>>>> the
> > > >>>>>>>>>>>>>>> session
> > > >>>>>>>>>>>>>>>> is
> > > >>>>>>>>>>>>>>>>>>>>>> closed.
> > > >>>>>>>>>>>>>>>>>>>>>>> persist() seems a little misleading as people
> > might
> > > >>>>> think
> > > >>>>>>>>> the
> > > >>>>>>>>>>>>>>>> table
> > > >>>>>>>>>>>>>>>>>>>> will
> > > >>>>>>>>>>>>>>>>>>>>>>> still be there even after the session is gone.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream
> > > >>>>> processing
> > > >>>>>>>> in
> > > >>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>> same
> > > >>>>>>>>>>>>>>>>>>>> job.
> > > >>>>>>>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I
> > > >> imagine
> > > >>>>>>>> that
> > > >>>>>>>>>>>>>>> would
> > > >>>>>>>>>>>>>>>>> be
> > > >>>>>>>>>>>>>>>>>> a
> > > >>>>>>>>>>>>>>>>>>>>>> huge
> > > >>>>>>>>>>>>>>>>>>>>>>> change across the board, including sources,
> > > operators
> > > >>>>> and
> > > >>>>>>>>>>>>>>>>>>>> optimizations,
> > > >>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>> name some. Likely we will need several separate
> > > >>>>> in-depth
> > > >>>>>>>>>>>>>>>>> discussions.
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > > >>>>>>>>>>>>>>> xingcanc@xxxxxxxxx>
> > > >>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access
> > domain
> > > >> are
> > > >>>>>>>> both
> > > >>>>>>>>>>>>>>>>>> orthogonal
> > > >>>>>>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be
> the
> > > >> first
> > > >>>>>>>> time
> > > >>>>>>>>>> we
> > > >>>>>>>>>>>>>>>> plan
> > > >>>>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than
> > the
> > > >>>>>>>> state.
> > > >>>>>>>>>>>>>>> Maybe
> > > >>>>>>>>>>>>>>>>> it’s
> > > >>>>>>>>>>>>>>>>>>>>>>> better
> > > >>>>>>>>>>>>>>>>>>>>>>>> to first draw a big picture and then
> concentrate
> > > on
> > > >> a
> > > >>>>>>>>>> specific
> > > >>>>>>>>>>>>>>>>> part?
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned
> with
> > > the
> > > >>>>>>>>>> underlying
> > > >>>>>>>>>>>>>>>>>>>> service.
> > > >>>>>>>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the
> > > >> existing
> > > >>>>>>>>>>>>>>> codebase.
> > > >>>>>>>>>>>>>>>> As
> > > >>>>>>>>>>>>>>>>>> you
> > > >>>>>>>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to
> > > support
> > > >>>>>>>> other
> > > >>>>>>>>>>>>>>>>>> components
> > > >>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more
> > > >> interactive
> > > >>>>>>>>> Table
> > > >>>>>>>>>>>>>>>> API,
> > > >>>>>>>>>>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>>>>>>>> of a general and flexible enough service
> > > mechanism.
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>> Xingcan
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > > >>>>>>>>>>>>>>>> xiaoweij@xxxxxxxxx>
> > > >>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for
> > > clean
> > > >> up
> > > >>>>>>>> is
> > > >>>>>>>>>> not
> > > >>>>>>>>>>>>>>>> very
> > > >>>>>>>>>>>>>>>>>>>>>>>> reliable.
> > > >>>>>>>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be
> executed
> > > >>>>>>>>>> successfully.
> > > >>>>>>>>>>>>>>> We
> > > >>>>>>>>>>>>>>>>> may
> > > >>>>>>>>>>>>>>>>>>>>>>> risk
> > > >>>>>>>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's
> > safer
> > > to
> > > >>>>>>>> have
> > > >>>>>>>>> an
> > > >>>>>>>>>>>>>>>>>>>>>> association
> > > >>>>>>>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can
> > > always
> > > >>>>>>>> clean
> > > >>>>>>>>>> up
> > > >>>>>>>>>>>>>>>> temp
> > > >>>>>>>>>>>>>>>>>>>>>>> tables
> > > >>>>>>>>>>>>>>>>>>>>>>>>> which are no longer associated with any
> active
> > > >>>>>>>> sessions.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>>>>>>>>>>>>> Xiaowei
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng
> sun <
> > > >>>>>>>>>>>>>>>>>>>>>>> sunjincheng121@xxxxxxxxx>
> > > >>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and
> > user
> > > >>>>>>>> friendly
> > > >>>>>>>>>> in
> > > >>>>>>>>>>>>>>>> case
> > > >>>>>>>>>>>>>>>>>> of
> > > >>>>>>>>>>>>>>>>>>>>>>> your
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> examples.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Moreover, especially when a business has to
> be
> > > >>>>>>>> executed
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>>> several
> > > >>>>>>>>>>>>>>>>>>>>>>>> stages
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> with dependencies,such as the pipeline of
> > Flink
> > > >> ML,
> > > >>>>> in
> > > >>>>>>>>>> order
> > > >>>>>>>>>>>>>>>> to
> > > >>>>>>>>>>>>>>>>>>>>>>> utilize
> > > >>>>>>>>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to
> > > >> submit a
> > > >>>>>>>> job
> > > >>>>>>>>>> by
> > > >>>>>>>>>>>>>>>>>>>>>>>> env.execute().
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to
> > > named
> > > >>>>>>>>>>>>>>> `persist()`,
> > > >>>>>>>>>>>>>>>>> And
> > > >>>>>>>>>>>>>>>>>>>>>> The
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we
> > internally
> > > >>>>> cache
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>> memory
> > > >>>>>>>>>>>>>>>>>> or
> > > >>>>>>>>>>>>>>>>>>>>>>>> persist
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> to the storage system,Maybe save the data
> into
> > > >> state
> > > >>>>>>>>>> backend
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend
> > etc.)
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the
> future,
> > > >>>>> support
> > > >>>>>>>>> for
> > > >>>>>>>>>>>>>>>>>> streaming
> > > >>>>>>>>>>>>>>>>>>>>>>> and
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will
> also
> > > >>>>> benefit
> > > >>>>>>>>> in
> > > >>>>>>>>>>>>>>>>>>>>>> "Interactive
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your
> > > JIRAs
> > > >>>>> and
> > > >>>>>>>>>> FLIP!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Best,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Jincheng
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> Becket Qin <becket.qin@xxxxxxxxx>
> > > 于2018年11月20日周二
> > > >>>>>>>>>> 下午9:56写道:
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed
> > out,
> > > >> it
> > > >>>>>>>> is a
> > > >>>>>>>>>>>>>>>>> promising
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in
> > > various
> > > >>>>>>>>>> aspects,
> > > >>>>>>>>>>>>>>>>>>>>>> including
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others.
> > One
> > > >> of
> > > >>>>>>>> the
> > > >>>>>>>>>>>>>>>>> scenarios
> > > >>>>>>>>>>>>>>>>>>>>>>> where
> > > >>>>>>>>>>>>>>>>>>>>>>>> we
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive
> > > >>>>> programming.
> > > >>>>>>>> To
> > > >>>>>>>>>>>>>>>> explain
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>> issues
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the
> > solution,
> > > we
> > > >>>>> put
> > > >>>>>>>>>>>>>>>> together
> > > >>>>>>>>>>>>>>>>>> the
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
>