OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Transferring files between S3 and GCS


Hello Airflow community,

I'm interested in transferring data between S3 and Google Cloud Storage.  I
want to transfer data on the scale of hundreds of gigabytes to a few
terrabytes.

Airflow already has an operator that could be used for this use-case:
the S3ToGoogleCloudStorageOperator.
However, looking over its implementation it appears that all the data to be
transferred actually passes through the machine running airflow.  That
seems completely unnecessary to me, and will place a lot of burden on the
airflow workers and will be bottlenecked by the bandwidth of the workers.
It could even lead to out of disk errors like this one
<https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>
.

I would much rather use Google Cloud's 'Transfer Service' for doing
this--that way the airflow operator just needs to make an API call and
(optionally) keep polling the API until the transfer is done (this last bit
could be done in a sensor).  The heavy work of performing the transfer is
offloaded to the Transfer Service.

Was it an intentional design decision to avoid using the Google Transfer
Service?  If I create a PR that adds the ability to perform transfers with
the Google Transfer Service, should it

   - replace the existing operator
   - be an option on the existing operator (i.e., add an argument that
   toggles between 'local worker transfer' and 'google hosted transfer')
   - make a new operator

Thanks,
Conrad Lee