[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: BigqueryIO field clustering

Thanks for the contribution. I can take a look later this week.

On Wed, Nov 28, 2018 at 12:29 AM Wout Scheepers <Wout.Scheepers@xxxxxxxxxxxxxxxxxxx> wrote:

Hey all,


Almost two weeks ago, I create a PR to support BigQuery clustering [1].

Can someone please have a look?





1: https://github.com/apache/beam/pull/7061



From: Lukasz Cwik <lcwik@xxxxxxxxxx>
Reply-To: "user@xxxxxxxxxxxxxxx" <user@xxxxxxxxxxxxxxx>
Date: Wednesday, 29 August 2018 at 18:32
To: dev <dev@xxxxxxxxxxxxxxx>, "user@xxxxxxxxxxxxxxx" <user@xxxxxxxxxxxxxxx>
Cc: Bob De Schutter <Bob.DeSchutter@xxxxxxxxxxxxxxxxxxx>
Subject: Re: BigqueryIO field clustering



Wout, I assigned this task to you since it seems like your interested in contributing.

The Apache Beam contribution guide[1] is a good place to start for answering questions on how to contribute.


If you need help in getting stuff reviewed or having questions, feel free to reach out on dev@xxxxxxxxxxxxxxx or on Slack.




On Wed, Aug 29, 2018 at 1:28 AM Wout Scheepers <Wout.Scheepers@xxxxxxxxxxxxxxxxxxx> wrote:

Hey all,


I’m trying to use the field clustering beta feature in bigquery [1].

However, the current Beam/dataflow worker bigquery api service dependency is ‘google-api-services-bigquery: com.google.apis: v2-rev374-1.23.0’, which does not include the clustering option in the TimePartitioning class.

Hereby, I can’t specify the clustering field when loading/streaming into bigquery. See [2] for the bigquery api error details.


Does anyone know a workaround for this?


I guess that in the worst case I’ll have to wait until Beam supports a newer version of the bigquery api service.

1.    After checking the Beam Jira I’ve found BEAM-5191. Is there any way I can help to push this forward and make this feature possible in the near future?


Thanks in advance,



[1] https://cloud.google.com/bigquery/docs/clustered-tables

[2] "errorResult" : {

      "message" : "Incompatible table partitioning specification. Expects partitioning specification interval(type:day,field:publish_time) clustering(clustering_id), but input partitioning specification is interval(type:day,field:publish_time)",

      "reason" : "invalid"