OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Airflow - YARN as an executor?


Now I think if Airflow on PySpark Executor would be an easier target.
Spark runs on YARN, Mesos and now Kubernetes.
So PySpark Executor would give Airflow porting to these schedulers.
It's my understanding we now have only Spark Operator and not Executor.

Thanks!



-- 
Ruslan Dautkhanov

On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <acehaidrey@xxxxxxxxx> wrote:

> Hey I didn’t know this Bolke, I was under the impression of the same as
> Ruslan.
> Thanks for the share
>
> Sent from my iPhone
>
> > On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
> >
> > It actually can nowadays: https://cdn.oreillystatic.com/
> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> >
> > We also have an on premise setup with ceph (s3a) and HDFS for when we
> need the speed and kubernetes for our workloads. We are kicking out Yarn
> (and hive etc for that matter).
> >
> > Bolke
> >
> >
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <dautkhanov@xxxxxxxxx>
> het volgende geschreven:
> >>
> >> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
> YARN
> >> can - for example schedule tasks local to data.
> >> Hadoop has multiple levels of data locality (node-local, rack-local) -
> so
> >> computation happens local to data to minimize network
> >> data transfer which is expensive.
> >> K8s wasn't designed to handle this scheduling scenarios, as far as I
> know.
> >>
> >> For cloud deployments where we don't have data locality problem
> (because of
> >> s3 is being used instead of storage local
> >> to servers), k8s might be okay.
> >>
> >> Nice comparison [1] of k8s vs two-level schedulers like yarn and messos
> ..
> >> although I think it's an offtopic.
> >>
> >> We're mostly on-prem and we don't see kubernetes take over yarn any time
> >> soon.
> >>
> >> Thanks.
> >>
> >>
> >>
> >> [1]
> >>
> >> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> >>
> >> *2.3.2 Monolithic Schedulers *
> >>
> >>
> >>
> >> Monolithic schedulers use a single, centralized scheduling algorithm for
> >> all jobs. All workload is run through the same scheduler and same
> >> scheduling logic. Swarm,
> >> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> >> improvised on basic monolithic version of Borg and Swarm schedulers.
> This
> >> type of schedulers are not suitable for running heterogeneous modern
> >> workloads which include Spark jobs, containers, and other long running
> jobs,
> >> etc.
> >>
> >>
> >>
> >> *2.3.3 Two Level Schedulers *
> >>
> >>
> >>
> >> Two-level schedulers address the drawbacks of a monolithic scheduler by
> >> separating concerns of resource allocation and task placement. An active
> >> resource manager offers compute resources to multiple parallel,
> independent
> >> “scheduler frameworks”. The Mesos cluster manager pioneered this
> approach,
> >> and YARN supports a limited version of it. In Mesos, resources are
> offered
> >> to application-level schedulers. This allows for custom,
> workload-specific
> >> scheduling policies. The drawback with this type of scheduling
> architecture
> >> is that the application level frameworks cannot see all the possible
> >> placement options anymore. Instead, they only see those options that
> >> correspond to resources offered (Mesos) or allocated (YARN) by the
> resource
> >> manager component. This makes priority preemption (higher priority tasks
> >> kick out lower priority ones) difficult.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Ruslan Dautkhanov
> >>
> >>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bdbruin@xxxxxxxxx>
> wrote:
> >>>
> >>> Happy to have it as a contrib executor. However, I personally think
> yarn
> >>> is a dead end. It has a lot of catching up to do and all the momentum
> is
> >>> with kubernetes.
> >>>
> >>> B.
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> dautkhanov@xxxxxxxxx>
> >>> het volgende geschreven:
> >>>>
> >>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> >>>> somewhat a competitor for Kubernetes.
> >>>>
> >>>> Great job on adding k8s support to Airflow.
> >>>>
> >>>> Very similarly I see Airflow could integrate with YARN and use
> >>>> its infrastructure as an "executor" .. have anyone explored
> feasibility
> >>> of
> >>>> this approach?
> >>>>
> >>>>
> >>>> Thanks!
> >>>> Ruslan Dautkhanov
> >>>
>