osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Airflow - High Availability and Scale Up vs Scale Out


We also run one beefy box in AWS ECS with the scheduler and webserver
running on the same container. However, we have run into issues with this
approach as the scheduler does fail at times and our DAGs get stuck until I
have to manually restart the container.
What approaches do you guys use to restart the scheduler automatically when
it's stuck and/or failed?

- Ali

On Sun, Jun 10, 2018 at 8:44 PM Bolke de Bruin <bdbruin@xxxxxxxxx> wrote:

> If you are running on one big box, you most certainly want to put the
> scheduler in its own cgroup and run the tasks with sudo it their own.
> Otherwise your availability might suffer.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 10 jun. 2018 om 16:30 heeft Sam Sen <sxs@xxxxxxxxxxxxxxxx> het
> volgende geschreven:
> >
> > Wouldn't you want immutable containers, hence, baking in the code in the
> > container would be more ideal?
> >
> >> On Sun, Jun 10, 2018, 9:53 AM Arash Soheili <tonyarash@xxxxxxxxx>
> wrote:
> >>
> >> We are just starting out but our setup is 2 EC2 with one running the web
> >> server and scheduler and the other having multiple workers. The
> database is
> >> an RDS which both are connected to as well as Redis on AWS elastic cache
> >> for the Celery connection.
> >>
> >> All 4 services are run in containers with systemd and we use CodeDeploy
> and
> >> sync up the code by mapping volumes from local file to the container. We
> >> are not yet heavy users of Airflow so I can't speak to performance and
> >> scale up just yet.
> >>
> >> In general I think an AMI with baked in code can be brittle and hard to
> >> maintain and update. Container is the way to go as you can bake in the
> code
> >> in the image if you want. We have chosen not to do that and rely on
> volume
> >> mapping to update the latest code in the container. This makes it easier
> >> that you don't need to keep creating new images.
> >>
> >> Arash
> >>
> >>> On Sat, Jun 9, 2018 at 9:47 AM Naik Kaxil <k.naik@xxxxxxxxx> wrote:
> >>>
> >>> Let us know after trying the beefy box approach about your findings.
> >>>
> >>> On 08/06/2018, 12:24, "Sam Sen" <sxs@xxxxxxxxxxxxxxxx> wrote:
> >>>
> >>>    We are facing this now. We have tried the celeryexecutor and it adds
> >>> more
> >>>    moving parts. While we have no thrown out this idea, we are going to
> >>> give
> >>>    one big beefy box a try.
> >>>
> >>>    To handle the HA side of things, we are putting the server in an
> >>>    auto-scaling group (we use AWS) with a min and Max of 1 server. We
> >>> deploy
> >>>    from an AMI that has airflow baked in and we point the DB config to
> >> an
> >>> RDS
> >>>    using service discovery (consul).
> >>>
> >>>    As for the dag code, we can either bake it into the AMI as well or
> >>> install
> >>>    it on bootup. We haven't decided what to do for this but either way,
> >> we
> >>>    realize it could take a few minutes to fully recover in the event of
> >> a
> >>>    catastrophe.
> >>>
> >>>    The other option is to have a standby server if using celery isn't
> >>> ideal.
> >>>    With that, I have tried using Hashicorp nomad to handle the
> services.
> >>> In my
> >>>    limited trial, it did what we wanted but we need more time to test.
> >>>
> >>>>    On Fri, Jun 8, 2018, 4:23 AM Naik Kaxil <k.naik@xxxxxxxxx> wrote:
> >>>>
> >>>> Hi guys,
> >>>>
> >>>>
> >>>>
> >>>> I have 2 specific questions for the guys using Airflow in
> >> production?
> >>>>
> >>>>
> >>>>
> >>>>   1. How have you achieved High availability? How does the
> >>> architecture
> >>>>   look like? Do you replicate the master node as well?
> >>>>   2. Scale Up vs Scale Out?
> >>>>      1. What is the preferred approach you take? 1 beefy Airflow
> >> VM
> >>> with
> >>>>      Worker, Scheduler and Webserver using Local Executor or a
> >>> cluster with
> >>>>      multiple workers using Celery Executor.
> >>>>
> >>>>
> >>>>
> >>>> I think this thread should help others as well with similar
> >> question.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Kaxil
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Kaxil Naik
> >>>>
> >>>> Data Reply
> >>>> 2nd Floor, Nova South
> >>>> 160 Victoria Street, Westminster
> >>>> London SW1E 5LB - UK
> >>>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
> >>>> k.naik@xxxxxxxxx
> >>>> www.reply.com
> >>>>
> >>>> [image: Data Reply]
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Kaxil Naik
> >>>
> >>> Data Reply
> >>> 2nd Floor, Nova South
> >>> 160 Victoria Street, Westminster
> >>> London SW1E 5LB - UK
> >>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
> >>> k.naik@xxxxxxxxx
> >>> www.reply.com
> >>>
> >>
>