[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Python: Single vs Multiple DoFns for Image Processing


I am running Beam with the DataflowRunner and want to do 3 tasks:
  1. Read an image from GCS
  2. Process the image (data augmentation)
  3. Serialize the image to a string
I could do all this in a single DoFn, but I could also split it into these 3 stages. I don't know what would be better given the Beam model. Here are some thoughts:
  • Doing it in a single DoFn wastes concurrency e.g. one stage can be reading the image while the other does the processing.
  • Doing it in multiple DoFns might mean sending the images through the network, increasing latency.
Sorry if these question are very basic. I am trying to get my head around this. The pipeline I currently have is processing about 15 imgs/sec which seems really slow, dataflow suggest that I increase some quotas to enable around 400 workers (is this an overkill?)