OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transferring files between S3 and GCS


  Hello Conrad,

 I reply since I was the one who sent this contribution initially hence its
design.

 The actual reason to develop this operator was precisely to avoid the
Google Transfer Service as we found out that it was somewhat unreliable
(the resources from GCP to retrieve data from S3, at the time when we took
this approach, were shared among GCP customers hence a lower performance
that couldn't be really predicted).
Our main scenario for this operator has been so far retrieving files at
certain time from S3 to do some processing thereafter and so on. The Google
Transfer Service wouldn't guarantee the delivery on time (again, at the
time I did this operator.

 Unfortunately the performance of the operator depends on the scheduler and
so the machine (or pod for that matter) where it is executed and you can
possibly end up with a failed task due to lack of memory.
This is not an operator's problem and rather the GCS hook implementation
involved (if I am not mistaken as I am writing out of memory now). I recall
that the task would literally create a copy of the file in the local memory
after retrieving the file from S3 and until the transfer to GCS was
complete.

Ideally you would want a stream of data from one hook to another, but I
think the GCS hook doesn't support it yet. There is (or recall something at
least) some idea to convert all hooks to the new API provided to access GCP
resources; and I have a remote idea that this new API would support
something like this.
I never got into the refactoring of the hook but I considered it, and it is
clearly pending work, but it is a major one I believe given what involves.

I think that replacing the current operator is not a good idea but adding
the option to the operator of using the Google Transfer Service is better
although note that it may leave some DAGs stuck for hours, so some output
for the status would probably be good from the user's perspective.

/Guillermo


On Fri, Oct 19, 2018 at 2:22 PM Chris Fei <cfei18@xxxxxxxxx> wrote:

> I ran into the same issue and ended building a separate operator that
> works as you describe, though I haven't submitted it as a PR. Happy to
> share my implementation with you.
> I found that it's useful to have both ways of transferring data.
> Initially, I migrated all of my S3ToGCS tasks to use the transfer
> service, but I found that its performance can be unreliable with some
> combination of 1) transferring smaller datasets and 2) invoking many
> transfers in parallel. The transfer service is a bit of a black box, so
> when it doesn't work as expected you're stuck. Because of this, I ended
> up migrating some of my tasks to the original implementation. I would
> definitely keep both options around--I don't think I have a preference
> between new operator vs a param on the existing operator.
> Chris
>
>
> On Fri, Oct 19, 2018, at 7:09 AM, Conrad Lee wrote:
> > Hello Airflow community,
> >
> > I'm interested in transferring data between S3 and Google Cloud
> > Storage.  I> want to transfer data on the scale of hundreds of gigabytes
> to a few
> > terrabytes.
> >
> > Airflow already has an operator that could be used for this use-case:>
> the S3ToGoogleCloudStorageOperator.
> > However, looking over its implementation it appears that all the
> > data to be> transferred actually passes through the machine running
> airflow.  That> seems completely unnecessary to me, and will place a lot of
> > burden on the> airflow workers and will be bottlenecked by the bandwidth
> of the
> > workers.> It could even lead to out of disk errors like this one
> > <
> https://stackoverflow.com/questions/52400144/airflow-s3togooglecloudstorageoperator-no-space-left-on-device>>
> .
> >
> > I would much rather use Google Cloud's 'Transfer Service' for doing
> > this--that way the airflow operator just needs to make an API call and>
> (optionally) keep polling the API until the transfer is done (this
> > last bit> could be done in a sensor).  The heavy work of performing the
> > transfer is> offloaded to the Transfer Service.
> >
> > Was it an intentional design decision to avoid using the Google
> > Transfer> Service?  If I create a PR that adds the ability to perform
> > transfers with> the Google Transfer Service, should it
> >
> >   - replace the existing operator
> >   - be an option on the existing operator (i.e., add an argument that>
>  toggles between 'local worker transfer' and 'google hosted
> >   transfer')>   - make a new operator
> >
> > Thanks,
> > Conrad Lee
>
>