OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using Too Many Aiflow Variables in Dag is Good thing ?


Redis is not a requirement of Airflow currently, nor should it become a hard requirement either.

Benchmarks definitely needed before we bring in anything as complex as a cache, certainly.

Queries to the variables table _should_ be fast too - even if it's got 1000 rows in it that is tiny by RDBMS standards. If the problem is connection set up and tear down times then we should find that out.

> On 22 Oct 2018, at 11:59, Sai Phanindhra <phani8996@xxxxxxxxx> wrote:
> 
> On top of that we can expire the cache in order of few times of scheduler
> runs(5 or 10 times one scheduler run time)
> 
> On Mon 22 Oct, 2018, 16:27 Sai Phanindhra, <phani8996@xxxxxxxxx> wrote:
> 
>> Thats true. But variable wont change very frequently.  We can cache these
>> variables in some place outside airflow ecosystem. Something like redis or
>> memcache. As queries to these dbs are fast. We can reduce the latency and
>> decrease the number of connections to main database. This whole assumption
>> need to be benchmarked to prove the point. I feel like its worth a try.
>> 
>> On Mon 22 Oct, 2018, 15:47 Ash Berlin-Taylor, <ash@xxxxxxxxxx> wrote:
>> 
>>> Cache them where? When would it get invalidated? Given the DAG parsing
>>> happens in a sub-process how would the cache live longer than that process?
>>> 
>>> I think the change might be to use a per-process/per-thread SQLA
>>> connection when parsing dags, so that if a DAG needs access to the metadata
>>> DB it does it with just one connection rather than N.
>>> 
>>> -ash
>>> 
>>>> On 22 Oct 2018, at 11:11, Sai Phanindhra <phani8996@xxxxxxxxx> wrote:
>>>> 
>>>> Who don't we cache variables? We can fairly assume that variables won't
>>> get
>>>> changed very frequently(not as frequent as scheduler DAG run time). We
>>> can
>>>> keep default timeout to few times scheduler run time. This will help
>>>> control number of connections to database and reduces load both on
>>>> scheduler and database.
>>>> 
>>>> On Mon 22 Oct, 2018, 13:34 Marcin Szymański, <ms32035@xxxxxxxxx> wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> You are right, it's a sure way to saturate db connections, as a
>>> connection
>>>>> is established every few seconds when the DAGs are parsed. The same
>>> happens
>>>>> when you use variables in __init__ of an operator. Os environment
>>> variable
>>>>> would be safer for your need.
>>>>> 
>>>>> Marcin
>>>>> 
>>>>> 
>>>>> On Mon, 22 Oct 2018, 08:34 Pramiti Goel, <pramitigoel20@xxxxxxxxx>
>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> We want to make owner and email Id general, so we don't want to put in
>>>>>> airflow dag. Using variables will help us in changing the email/owner
>>>>>> later, if there are lot of dags of same owner.
>>>>>> 
>>>>>> For example:
>>>>>> 
>>>>>> 
>>>>>> default_args = {
>>>>>>   'owner': Variable.get('test_owner_de'),
>>>>>>   'depends_on_past': False,
>>>>>>   'start_date': datetime(2018, 10, 17),
>>>>>>   'email': Variable.get('de_infra_email'),
>>>>>>   'email_on_failure': True,
>>>>>>   'email_on_retry': True,
>>>>>>   'retries': 2,
>>>>>>   'retry_delay': timedelta(minutes=1)}
>>>>>> 
>>>>>> 
>>>>>> Looking into the code of Airflow, it is making connection session
>>>>> everytime
>>>>>> the variable is created, and then close it. (Let me know if I
>>> understand
>>>>>> wrong). If there are many dags with variables in default args running
>>>>>> parallel, querying variable table in MySQL, will it have any sort of
>>>>>> limitation on number of sessions of SQLAlchemy ? Will that make dag
>>> slow
>>>>> as
>>>>>> there will be many queries to mysql for each dag? is the above
>>> approach
>>>>>> good ?
>>>>>> 
>>>>>>> using Airlfow 1.9
>>>>>> 
>>>>>> Thanks,
>>>>>> Pramiti.
>>>>>> 
>>>>> 
>>> 
>>>