Re: Apache beam DataFlow runner throwing setup error

Hi Rajesh,

Have you looked at the worker-startup logs [1]? You should be able to see the setup error there. It is possible that something in your requirements file is failing to install in the workers. If that is the case, see Managing Python Pipeline Dependencies [2] for alternative options. You could also reach out to Google Cloud Dataflow support for getting additional help [3]

On Thu, Mar 22, 2018 at 10:08 PM, Rajesh Hegde <rhegde@xxxxxxxxxxxxxxx> wrote:
We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,

A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.

But could not find detailed worker-startup logs. 

We tried increasing memory size, worker count etc, but still getting the same error.

Here is the command we use,
python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2

pipeline snippet

data = "" | "load data" >> beam.io.Read(    
    beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")

data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)

Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.

Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.


Rajesh Hegde | Lead Product Developer | Datalicious