OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] FLIP-6 Problems


Our main use cases are mesos, maybe we can start with mesos support.
On Wed, Jun 6, 2018 at 5:00 PM Stephan Ewen <sewen@xxxxxxxxxx> wrote:

> The FLIP-6 design was specifically such that it allows for separation of
> Dispatcher, ResourceManager, and JobManagers.
> So that could be another extension at some point.
>
> It should be conceptually rather simple, the dispatcher creates per job a
> new container launch context with the "JobManagerRunner" and starts that.
> In practice, it is quite a bit of work still, with all the details of Yarn
> to take care of.
>
>
>
> On Wed, Jun 6, 2018 at 9:45 AM, Renjie Liu <liurenjie2008@xxxxxxxxx>
> wrote:
>
> > That's really great! I'll help to contribute to the process.
> >
> > On Wed, Jun 6, 2018 at 3:17 PM Till Rohrmann <trohrmann@xxxxxxxxxx>
> wrote:
> >
> > > Hi Renjie,
> > >
> > > there is already an issue for introducing further scheduling
> constraints
> > > (e.g. tags) to achieve TM isolation when using the session mode [1].
> What
> > > it does not cover is the isolation of the JMs which need to be executed
> > in
> > > their own processes. At the moment they share the same process with the
> > > Dispatcher because it was simpler to do it like that as first
> iteration.
> > > Here is the issue for isolating JobManagers [2].
> > >
> > > Concerning the resource specification, the corresponding issue can be
> > found
> > > here [3].
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-8886
> > > [2] https://issues.apache.org/jira/browse/FLINK-9537
> > > [3] https://issues.apache.org/jira/browse/FLINK-5131
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jun 6, 2018 at 2:13 AM Renjie Liu <liurenjie2008@xxxxxxxxx>
> > wrote:
> > >
> > > > Hi, Stephan:
> > > >
> > > > Yes that's what I mean. In fact the most import thing is to share the
> > > > dispatcher so that we can have *a centralized gateway for flink job
> > > > management and submission. The problem with per job cluster is that
> we
> > > > can't have a centralized gateway.*
> > > >
> > > > I didn't realize that job manager also needs to run user code before
> > and
> > > > yes that means we job manager should also be isolated.
> > > >
> > > > Wouldn't it be better to separate job manager from the dispatcher so
> > that
> > > > user code does't interfere with each other? In fact it seems that in
> > most
> > > > production environments job isolation is required since nobody want
> > their
> > > > job to be affected by others.
> > > >
> > > > On Tue, Jun 5, 2018 at 11:34 PM Stephan Ewen <sewen@xxxxxxxxxx>
> wrote:
> > > >
> > > > > Hi Renjie,
> > > > >
> > > > > When you suggest to have TaskManager isolation in session mode, do
> > you
> > > > mean
> > > > > to have a shared JobManager / Dispatcher, but job-specific
> > > TaskManagers?
> > > > > If this mainly to reduce the overhead of the per-job JobManager?
> > > > >
> > > > > One assumption so far was that if TaskManager isolation is
> required,
> > > > > JobManager isolation is also required, because some user code
> > > potentially
> > > > > also runs on the JobManager, like CheckpointHooks, Input/Output
> > > Formats,
> > > > > ...
> > > > >
> > > > > Best,
> > > > > Stephan
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 5, 2018 at 4:20 PM, Renjie Liu <
> liurenjie2008@xxxxxxxxx>
> > > > > wrote:
> > > > >
> > > > > > Hi, Till:
> > > > > >
> > > > > >
> > > > > >    1. Does the community has any plan to add task manager
> isolation
> > > > into
> > > > > >    the session mode?
> > > > > >    2. Is there any issues to track this feature? I want to help
> > > > > contribute.
> > > > > >    3. Thanks for the knowledge but it can't help if task manager
> > > > > isolation
> > > > > >    is not present.
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 5, 2018 at 7:28 PM Till Rohrmann <
> trohrmann@xxxxxxxxxx
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Renjie,
> > > > > > >
> > > > > > > 1) you're right that the Flink session mode does not give you
> > > proper
> > > > > job
> > > > > > > isolation. It is the same as with Flink 1.4 session mode. If
> this
> > > is
> > > > a
> > > > > > > strong requirement for you, then I recommend using the per job
> > > mode.
> > > > > > >
> > > > > > > 2) At the moment it is also not possible to define per job
> > resource
> > > > > > > requirements when using the session mode. This is a feature
> which
> > > the
> > > > > > > community has started implementing but it is not yet fully
> done.
> > I
> > > > > assume
> > > > > > > that the community will continue working on it. At the moment,
> > the
> > > > > > solution
> > > > > > > would be to use the per job mode to not waste unnecessary
> > > resources.
> > > > > > >
> > > > > > > 3) I think the assigned ResourceID for a TaskManager is shown
> in
> > > the
> > > > > web
> > > > > > UI
> > > > > > > and when querying the "/taskmanagers" REST endpoint. The
> resource
> > > id
> > > > is
> > > > > > > derived from the Mesos task id. Would that help to identify
> which
> > > TM
> > > > is
> > > > > > > running on which Mesos task?
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Tue, Jun 5, 2018 at 5:13 AM Renjie Liu <
> > liurenjie2008@xxxxxxxxx
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > ---------- Forwarded message ---------
> > > > > > > > From: Renjie Liu <liurenjie2008@xxxxxxxxx>
> > > > > > > > Date: Tue, Jun 5, 2018 at 10:43 AM
> > > > > > > > Subject: [DISCUSS] FLIP-6 Problems
> > > > > > > > To: user <user@xxxxxxxxxxxxxxxx>
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi:
> > > > > > > >
> > > > > > > > We've deployed flink 1.5.0 and tested the new cluster
> manager,
> > > it's
> > > > > > > really
> > > > > > > > great for flink to be elastic. However we've also found some
> > > > problems
> > > > > > > that
> > > > > > > > blocks us from deploying it to production environment.
> > > > > > > >
> > > > > > > > 1. Task manager isolation. Currently flink allows different
> > jobs
> > > to
> > > > > > > execute
> > > > > > > > on same task managers, this is unacceptable in production
> > > > environment
> > > > > > > since
> > > > > > > > a faulty written job may kill task managers and affect other
> > > jobs.
> > > > > > > > 2. Per job resource configuration. Currently flink session
> > > cluster
> > > > > can
> > > > > > > only
> > > > > > > > allocate same size and configuration task managers. This may
> > > waste
> > > > a
> > > > > > lot
> > > > > > > of
> > > > > > > > resources if we have a lot of jobs with different resource
> > > > > requirement.
> > > > > > > > 3. Task manager's name is meanless.  This is a problem since
> we
> > > > can't
> > > > > > > > monitor status of container in mesos environment.
> > > > > > > >
> > > > > > > > One solution to the above problems is to use per job cluster,
> > > but a
> > > > > > > > centralized cluster manager can help to manage flink
> deployment
> > > and
> > > > > > jobs
> > > > > > > > better.
> > > > > > > >
> > > > > > > > How you guys think about those? If the community agrees with
> > us,
> > > we
> > > > > > would
> > > > > > > > like to propose design and implementation to enhance the
> flink
> > > > > cluster
> > > > > > > > manager.
> > > > > > > > --
> > > > > > > > Liu, Renjie
> > > > > > > > Software Engineer, MVAD
> > > > > > > > --
> > > > > > > > Liu, Renjie
> > > > > > > > Software Engineer, MVAD
> > > > > > > >
> > > > > > >
> > > > > > --
> > > > > > Liu, Renjie
> > > > > > Software Engineer, MVAD
> > > > > >
> > > > >
> > > > --
> > > > Liu, Renjie
> > > > Software Engineer, MVAD
> > > >
> > >
> > --
> > Liu, Renjie
> > Software Engineer, MVAD
> >
>
-- 
Liu, Renjie
Software Engineer, MVAD