Re: Python: Single vs Multiple DoFns for Image Processing

I would write these as three separate DoFns; they will get fused together to minimize IO. 

400 workers may not be overkill, depending on how many images you have. Ia dataflow not scaling up and sharing the work? Where is your list of images coming from? 

I am running Beam with the DataflowRunner and want to do 3 tasks:
  1. Read an image from GCS
  2. Process the image (data augmentation)
  3. Serialize the image to a string
I could do all this in a single DoFn, but I could also split it into these 3 stages. I don't know what would be better given the Beam model. Here are some thoughts:
  • Doing it in a single DoFn wastes concurrency e.g. one stage can be reading the image while the other does the processing.
  • Doing it in multiple DoFns might mean sending the images through the network, increasing latency.
Sorry if these question are very basic. I am trying to get my head around this. The pipeline I currently have is processing about 15 imgs/sec which seems really slow, dataflow suggest that I increase some quotas to enable around 400 workers (is this an overkill?)