OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using Too Many Aiflow Variables in Dag is Good thing ?


On top of that we can expire the cache in order of few times of scheduler
runs(5 or 10 times one scheduler run time)

On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <phani8996@xxxxxxxxx> wrote:

> Thats true. But variable wont change very frequently.  We can cache these
> variables in some place outside airflow ecosystem. Something like redis or
> memcache. As queries to these dbs are fast. We can reduce the latency and
> decrease the number of connections to main database. This whole assumption
> need to be benchmarked to prove the point. I feel like its worth a try.
>
> On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <ash@xxxxxxxxxx> wrote:
>
>> Cache them where? When would it get invalidated? Given the DAG parsing
>> happens in a sub-process how would the cache live longer than that process?
>>
>> I think the change might be to use a per-process/per-thread SQLA
>> connection when parsing dags, so that if a DAG needs access to the metadata
>> DB it does it with just one connection rather than N.
>>
>> -ash
>>
>> > On 22 Oct 2018, at 11:11, Sai Phanindhra <phani8996@xxxxxxxxx> wrote:
>> >
>> > Who don't we cache variables? We can fairly assume that variables won't
>> get
>> > changed very frequently(not as frequent as scheduler DAG run time). We
>> can
>> > keep default timeout to few times scheduler run time. This will help
>> > control number of connections to database and reduces load both on
>> > scheduler and database.
>> >
>> > On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms32035@xxxxxxxxx> wrote:
>> >
>> >> Hi
>> >>
>> >> You are right, it's a sure way to saturate db connections, as a
>> connection
>> >> is established every few seconds when the DAGs are parsed. The same
>> happens
>> >> when you use variables in __init__ of an operator. Os environment
>> variable
>> >> would be safer for your need.
>> >>
>> >> Marcin
>> >>
>> >>
>> >> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pramitigoel20@xxxxxxxxx>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> We want to make owner and email Id general, so we don't want to put in
>> >>> airflow dag. Using variables will help us in changing the email/owner
>> >>> later, if there are lot of dags of same owner.
>> >>>
>> >>> For example:
>> >>>
>> >>>
>> >>> default_args = {
>> >>>    'owner': Variable.get('test_owner_de'),
>> >>>    'depends_on_past': False,
>> >>>    'start_date': datetime(2018, 10, 17),
>> >>>    'email': Variable.get('de_infra_email'),
>> >>>    'email_on_failure': True,
>> >>>    'email_on_retry': True,
>> >>>    'retries': 2,
>> >>>    'retry_delay': timedelta(minutes=1)}
>> >>>
>> >>>
>> >>> Looking into the code of Airflow, it is making connection session
>> >> everytime
>> >>> the variable is created, and then close it. (Let me know if I
>> understand
>> >>> wrong). If there are many dags with variables in default args running
>> >>> parallel, querying variable table in MySQL, will it have any sort of
>> >>> limitation on number of sessions of SQLAlchemy ? Will that make dag
>> slow
>> >> as
>> >>> there will be many queries to mysql for each dag? is the above
>> approach
>> >>> good ?
>> >>>
>> >>>> using Airlfow 1.9
>> >>>
>> >>> Thanks,
>> >>> Pramiti.
>> >>>
>> >>
>>
>>