osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks


Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran
into that limit. The next possible limit might be etcd, as pod creation is
expensive so if there were a lot of short lived pods you might run into
issues (e.g. k8s API refusing connections) or so a google SRE tells me.

On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <greg@xxxxxxxxxxxxx> wrote:

> Yep, that should work fine. Pgbouncer is pretty configurable, so you can
> play around with different settings for your environment. You can set
> limits on the amount of connections you want to the actual database and
> point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In my
> experience, you can get away with a pretty low amount of actual connections
> to postgres. Pgbouncer has some tools to observe the count of clients
> (airflow processes), the amount of actual connections to the database, as
> well as the number of waiting clients. You should be able to tune your
> max_connections to the point where you have little to no clients waiting,
> but using a dramatically lower number of actual connections to postgres.
>
> That chart also deploys a sidecar to pgbouncer that exports the metrics for
> Prometheus to scrape. Here's an example Grafana dashboard that we use to
> keep an eye on things -
>
> https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
> .
>
> On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <eamon.keane1@xxxxxxxxx>
> wrote:
>
> > Interesting, Greg. Do you know if using pg_bouncer would allow you to
> have
> > more than 100 running k8s executor tasks at one time if e.g. there is a
> 100
> > connection limit on gcp instance?
> >
> > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <greg@xxxxxxxxxxxxx>
> wrote:
> >
> > > Good point Eamon, maxing connections out is definitely something to
> look
> > > out for. We recently added pgbouncer to our helm charts to pool
> > connections
> > > to the database for all the different airflow processes. Here's our
> chart
> > > for reference -
> > >
> > >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> > >
> > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <hamlin.kn@xxxxxxxxx>
> wrote:
> > >
> > > > Thanks for your responses! Glad to hear that tasks can run
> > independently
> > > if
> > > > something happens.
> > > >
> > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <eamon.keane1@xxxxxxxxx>
> > > > wrote:
> > > >
> > > > > Adding to Greg's point, if you're using the k8s executor and for
> some
> > > > > reason the k8s executor worker pod fails to launch within 120
> seconds
> > > > (e.g.
> > > > > pending due to scaling up a new node), this counts as a task
> failure.
> > > > Also,
> > > > > if the k8s executor pod has already launched a pod operator but is
> > > killed
> > > > > (e.g. manually or due to node upgrade), the  pod operator it
> launched
> > > is
> > > > > not killed and runs to completion so if using retries, you need to
> > > ensure
> > > > > idempotency. The worker pods update the db per my understanding,
> with
> > > > each
> > > > > requiring a separate connection to the db, this can tax your
> > connection
> > > > > budget (100-300 for small postgres instances on gcp or aws).
> > > > >
> > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <greg@xxxxxxxxxxxxx
> >
> > > > wrote:
> > > > >
> > > > > > Hey Kyle, the task pods will continue to run even if you reboot
> the
> > > > > > scheduler and webserver and the status does get updated in the
> > > airflow
> > > > > db,
> > > > > > which is great.
> > > > > >
> > > > > > I know the scheduler subscribes to the Kubernetes watch API to
> get
> > an
> > > > > event
> > > > > > stream of pods completing and it keeps a checkpoint so it can
> > > > resubscribe
> > > > > > when it comes back up.
> > > > > >
> > > > > > I forget if the worker pods update the db or if the scheduler is
> > > doing
> > > > > > that, but it should work out.
> > > > > >
> > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <hamlin.kn@xxxxxxxxx>
> > > wrote:
> > > > > >
> > > > > > > gentle bump
> > > > > > >
> > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <
> hamlin.kn@xxxxxxxxx
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > I'm about to make the switch to Kubernetes with Airflow, but
> am
> > > > > > wondering
> > > > > > > > what happens when my CI/CD pipeline redeploys the webserver
> and
> > > > > > scheduler
> > > > > > > > and there are still long-running tasks (pods). My intuition
> is
> > > that
> > > > > > since
> > > > > > > > the database hold all state and the tasks are in charge of
> > > updating
> > > > > > their
> > > > > > > > own state, and the UI only renders what it sees in the
> database
> > > > that
> > > > > > this
> > > > > > > > is not so much of a problem. To be sure, however, here are my
> > > > > > questions:
> > > > > > > >
> > > > > > > > Will task pods continue to run?
> > > > > > > > Can task pods continue to poll the external system they are
> > > running
> > > > > > tasks
> > > > > > > > on while being "headless"?
> > > > > > > > Can the tasks pods change/update state in the database while
> > > being
> > > > > > > > "headless"?
> > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods) once
> > > they
> > > > > are
> > > > > > > > live again?
> > > > > > > >
> > > > > > > > Is there anything else the might cause issues when deploying
> > > while
> > > > > > tasks
> > > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > > >
> > > > > > > > Kyle Hamlin
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Kyle Hamlin
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Kyle Hamlin
> > > >
> > >
> > >
> > > --
> > > *Greg Neiheisel* / CTO Astronomer.io
> > >
> >
>
>
> --
> *Greg Neiheisel* / CTO Astronomer.io
>