[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Airflow - High Availability and Scale Up vs Scale Out


@Andy - Any reasons why you prefer scaling out as opposed to scaling up?

On 08/06/2018, 15:33, "Andy Cooper" <andy.cooper@xxxxxxxxxxxxx> wrote:

    Once you solve DAG deployments and container orchestration, the
    celeryExecutor becomes more interesting. We solve DAG deployments by
    putting the DAG code into the container at build time and trigger image
    updates on our Kubernetes cluster via webhooks with a private Docker
    registry. We are currently using the CeleryExecutor to scale out vs up, but
    have begun to explore the KubernetesExecutor to further simplify our stack.
    
    It seems like you would want to go the route of separating concerns as well
    if you want to move towards HA. More generally though, I don't think that
    HA could be achieved with the current scheduler architecture requiring that
    there is only one scheduler running at a time. The next best thing though
    is to put the scheduler into container orchestration that will restart it
    immediately on failure at which point it will continue to schedule work
    where it left off.
    
    - Andy Cooper
    
    On Fri, Jun 8, 2018 at 7:24 AM Sam Sen <sxs@xxxxxxxxxxxxxxxx> wrote:
    
    > We are facing this now. We have tried the celeryexecutor and it adds more
    > moving parts. While we have no thrown out this idea, we are going to give
    > one big beefy box a try.
    >
    > To handle the HA side of things, we are putting the server in an
    > auto-scaling group (we use AWS) with a min and Max of 1 server. We deploy
    > from an AMI that has airflow baked in and we point the DB config to an RDS
    > using service discovery (consul).
    >
    > As for the dag code, we can either bake it into the AMI as well or install
    > it on bootup. We haven't decided what to do for this but either way, we
    > realize it could take a few minutes to fully recover in the event of a
    > catastrophe.
    >
    > The other option is to have a standby server if using celery isn't ideal.
    > With that, I have tried using Hashicorp nomad to handle the services. In my
    > limited trial, it did what we wanted but we need more time to test.
    >
    > On Fri, Jun 8, 2018, 4:23 AM Naik Kaxil <k.naik@xxxxxxxxx> wrote:
    >
    > > Hi guys,
    > >
    > >
    > >
    > > I have 2 specific questions for the guys using Airflow in production?
    > >
    > >
    > >
    > >    1. How have you achieved High availability? How does the architecture
    > >    look like? Do you replicate the master node as well?
    > >    2. Scale Up vs Scale Out?
    > >       1. What is the preferred approach you take? 1 beefy Airflow VM with
    > >       Worker, Scheduler and Webserver using Local Executor or a cluster
    > with
    > >       multiple workers using Celery Executor.
    > >
    > >
    > >
    > > I think this thread should help others as well with similar question.
    > >
    > >
    > >
    > >
    > >
    > > Regards,
    > >
    > > Kaxil
    > >
    > >
    > >
    > >
    > > Kaxil Naik
    > >
    > > Data Reply
    > > 2nd Floor, Nova South
    > > 160 Victoria Street, Westminster
    > > London SW1E 5LB - UK
    > > phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > > k.naik@xxxxxxxxx
    > > www.reply.com
    > >
    > > [image: Data Reply]
    > >
    >
    
    
    



Kaxil Naik 

Data Reply
2nd Floor, Nova South
160 Victoria Street, Westminster
London SW1E 5LB - UK 
phone: +44 (0)20 7730 6000
k.naik@xxxxxxxxx
www.reply.com