osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Support Interactive Programming in Flink Table API


Hi,

Interesting idea. I’m trying to understand the problem. Isn’t the `cache()` call an equivalent of writing data to a sink and later reading from it? Where this sink has a limited live scope/live time? And the sink could be implemented as in memory or a file sink?

If so, what’s the problem with creating a materialised view from a table “b” (from your document’s example) and reusing this materialised view later? Maybe we are lacking mechanisms to clean up materialised views (for example when current session finishes)? Maybe we need some syntactic sugar on top of it?  

Piotrek

> On 23 Nov 2018, at 07:21, Becket Qin <becket.qin@xxxxxxxxx> wrote:
> 
> Thanks for the suggestion, Jincheng.
> 
> Yes, I think it makes sense to have a persist() with lifecycle/defined
> scope. I just added a section in the future work for this.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <sunjincheng121@xxxxxxxxx>
> wrote:
> 
>> Hi Jiangjie,
>> 
>> Thank you for the explanation about the name of `cache()`, I understand why
>> you designed this way!
>> 
>> Another idea is whether we can specify a lifecycle for data persistence?
>> For example, persist (LifeCycle.SESSION), so that the user is not worried
>> about data loss, and will clearly specify the time range for keeping time.
>> At the same time, if we want to expand, we can also share in a certain
>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not sure,
>> just an immature suggestion, for reference only!
>> 
>> Bests,
>> Jincheng
>> 
>> Becket Qin <becket.qin@xxxxxxxxx> 于2018年11月23日周五 下午1:33写道:
>> 
>>> Re: Jincheng,
>>> 
>>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
>>> find cache() to be more accurately describing the behavior, i.e. the
>> Table
>>> is cached for the session, but will be deleted after the session is
>> closed.
>>> persist() seems a little misleading as people might think the table will
>>> still be there even after the session is gone.
>>> 
>>> Great point about mixing the batch and stream processing in the same job.
>>> We should absolutely move towards that goal. I imagine that would be a
>> huge
>>> change across the board, including sources, operators and optimizations,
>> to
>>> name some. Likely we will need several separate in-depth discussions.
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xingcanc@xxxxxxxxx> wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal
>> to
>>>> the cache problem. Essentially, this may be the first time we plan to
>>>> introduce another storage mechanism other than the state. Maybe it’s
>>> better
>>>> to first draw a big picture and then concentrate on a specific part?
>>>> 
>>>> @Becket, yes, actually I am more concerned with the underlying service.
>>>> This seems to be quite a major change to the existing codebase. As you
>>>> claimed, the service should be extendible to support other components
>> and
>>>> we’d better discussed it in another thread.
>>>> 
>>>> All in all, I also eager to enjoy the more interactive Table API, in
>> case
>>>> of a general and flexible enough service mechanism.
>>>> 
>>>> Best,
>>>> Xingcan
>>>> 
>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xiaoweij@xxxxxxxxx>
>>> wrote:
>>>>> 
>>>>> Relying on a callback for the temp table for clean up is not very
>>>> reliable.
>>>>> There is no guarantee that it will be executed successfully. We may
>>> risk
>>>>> leaks when that happens. I think that it's safer to have an
>> association
>>>>> between temp table and session id. So we can always clean up temp
>>> tables
>>>>> which are no longer associated with any active sessions.
>>>>> 
>>>>> Regards,
>>>>> Xiaowei
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>> sunjincheng121@xxxxxxxxx>
>>>>> wrote:
>>>>> 
>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>> 
>>>>>> Thanks for initiating this great proposal!
>>>>>> 
>>>>>> Interactive Programming is very useful and user friendly in case of
>>> your
>>>>>> examples.
>>>>>> Moreover, especially when a business has to be executed in several
>>>> stages
>>>>>> with dependencies,such as the pipeline of Flink ML, in order to
>>> utilize
>>>> the
>>>>>> intermediate calculation results we have to submit a job by
>>>> env.execute().
>>>>>> 
>>>>>> About the `cache()`  , I think is better to named `persist()`, And
>> The
>>>>>> Flink framework determines whether we internally cache in memory or
>>>> persist
>>>>>> to the storage system,Maybe save the data into state backend
>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>> 
>>>>>> BTW, from the points of my view in the future, support for streaming
>>> and
>>>>>> batch mode switching in the same job will also benefit in
>> "Interactive
>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>>>>>> 
>>>>>> Best,
>>>>>> Jincheng
>>>>>> 
>>>>>> 
>>>>>> Becket Qin <becket.qin@xxxxxxxxx> 于2018年11月20日周二 下午9:56写道:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> As a few recent email threads have pointed out, it is a promising
>>>>>>> opportunity to enhance Flink Table API in various aspects,
>> including
>>>>>>> functionality and ease of use among others. One of the scenarios
>>> where
>>>> we
>>>>>>> feel Flink could improve is interactive programming. To explain the
>>>>>> issues
>>>>>>> and facilitate the discussion on the solution, we put together the
>>>>>>> following document with our proposal.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>> 
>>>>>>> Feedback and comments are very welcome!
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Jiangjie (Becket) Qin
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>