I am running Beam with the DataflowRunner and want to do 3 tasks:
- Read an image from GCS
- Process the image (data augmentation)
- Serialize the image to a string
I could do all this in a single DoFn, but I could also split it into these 3 stages. I don't know what would be better given the Beam model. Here are some thoughts:
- Doing it in a single DoFn wastes concurrency e.g. one stage can be reading the image while the other does the processing.
- Doing it in multiple DoFns might mean sending the images through the network, increasing latency.
Sorry if these question are very basic. I am trying to get my head around this. The pipeline I currently have is processing about 15 imgs/sec which seems really slow, dataflow suggest that I increase some quotas to enable around 400 workers (is this an overkill?)