[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Airflow - YARN as an executor?


I used "Executor" as an Airflow term, not meant spark executor ...
Like Spark would be one of Executors
in here
https://github.com/apache/incubator-airflow/tree/master/airflow/executors
or in here
https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/executors

Thanks.



-- 
Ruslan Dautkhanov

On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:

> Im a bit lost on the spark executor to be honest. To my knowledge the
> spark driver creates spark executors which run spark code. In other words
> in can’t arbitrarily run generic code. Or can it?
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <dautkhanov@xxxxxxxxx>
> het volgende geschreven:
> >
> > Now I think if Airflow on PySpark Executor would be an easier target.
> > Spark runs on YARN, Mesos and now Kubernetes.
> > So PySpark Executor would give Airflow porting to these schedulers.
> > It's my understanding we now have only Spark Operator and not Executor.
> >
> > Thanks!
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey <acehaidrey@xxxxxxxxx>
> wrote:
> >>
> >> Hey I didn’t know this Bolke, I was under the impression of the same as
> >> Ruslan.
> >> Thanks for the share
> >>
> >> Sent from my iPhone
> >>
> >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:
> >>>
> >>> It actually can nowadays: https://cdn.oreillystatic.com/
> >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> >>>
> >>> We also have an on premise setup with ceph (s3a) and HDFS for when we
> >> need the speed and kubernetes for our workloads. We are kicking out Yarn
> >> (and hive etc for that matter).
> >>>
> >>> Bolke
> >>>
> >>>
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> dautkhanov@xxxxxxxxx>
> >> het volgende geschreven:
> >>>>
> >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
> >> YARN
> >>>> can - for example schedule tasks local to data.
> >>>> Hadoop has multiple levels of data locality (node-local, rack-local) -
> >> so
> >>>> computation happens local to data to minimize network
> >>>> data transfer which is expensive.
> >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I
> >> know.
> >>>>
> >>>> For cloud deployments where we don't have data locality problem
> >> (because of
> >>>> s3 is being used instead of storage local
> >>>> to servers), k8s might be okay.
> >>>>
> >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> messos
> >> ..
> >>>> although I think it's an offtopic.
> >>>>
> >>>> We're mostly on-prem and we don't see kubernetes take over yarn any
> time
> >>>> soon.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>>
> >>>> [1]
> >>>>
> >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> >>>>
> >>>> *2.3.2 Monolithic Schedulers *
> >>>>
> >>>>
> >>>>
> >>>> Monolithic schedulers use a single, centralized scheduling algorithm
> for
> >>>> all jobs. All workload is run through the same scheduler and same
> >>>> scheduling logic. Swarm,
> >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> >>>> improvised on basic monolithic version of Borg and Swarm schedulers.
> >> This
> >>>> type of schedulers are not suitable for running heterogeneous modern
> >>>> workloads which include Spark jobs, containers, and other long running
> >> jobs,
> >>>> etc.
> >>>>
> >>>>
> >>>>
> >>>> *2.3.3 Two Level Schedulers *
> >>>>
> >>>>
> >>>>
> >>>> Two-level schedulers address the drawbacks of a monolithic scheduler
> by
> >>>> separating concerns of resource allocation and task placement. An
> active
> >>>> resource manager offers compute resources to multiple parallel,
> >> independent
> >>>> “scheduler frameworks”. The Mesos cluster manager pioneered this
> >> approach,
> >>>> and YARN supports a limited version of it. In Mesos, resources are
> >> offered
> >>>> to application-level schedulers. This allows for custom,
> >> workload-specific
> >>>> scheduling policies. The drawback with this type of scheduling
> >> architecture
> >>>> is that the application level frameworks cannot see all the possible
> >>>> placement options anymore. Instead, they only see those options that
> >>>> correspond to resources offered (Mesos) or allocated (YARN) by the
> >> resource
> >>>> manager component. This makes priority preemption (higher priority
> tasks
> >>>> kick out lower priority ones) difficult.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ruslan Dautkhanov
> >>>>
> >>>>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin <bdbruin@xxxxxxxxx>
> >> wrote:
> >>>>>
> >>>>> Happy to have it as a contrib executor. However, I personally think
> >> yarn
> >>>>> is a dead end. It has a lot of catching up to do and all the momentum
> >> is
> >>>>> with kubernetes.
> >>>>>
> >>>>> B.
> >>>>>
> >>>>> Verstuurd vanaf mijn iPad
> >>>>>
> >>>>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> >> dautkhanov@xxxxxxxxx>
> >>>>> het volgende geschreven:
> >>>>>>
> >>>>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> >>>>>> somewhat a competitor for Kubernetes.
> >>>>>>
> >>>>>> Great job on adding k8s support to Airflow.
> >>>>>>
> >>>>>> Very similarly I see Airflow could integrate with YARN and use
> >>>>>> its infrastructure as an "executor" .. have anyone explored
> >> feasibility
> >>>>> of
> >>>>>> this approach?
> >>>>>>
> >>>>>>
> >>>>>> Thanks!
> >>>>>> Ruslan Dautkhanov
> >>>>>
> >>
>