[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ElasticsearchIO bulk delete

> we decided to postpone the feature

That makes sense.

I believe the ES6 branch is in-part working (I've looked at the code but not used it) which you can see here [1] and the jira to watch or contribute is [2]. It would be a useful addition to test independently and report any observations or improvement requests on that jira.

The offer to assist in your first PR remains open for the future - please don't hesitate to ask.


On Mon, Jul 30, 2018 at 10:55 AM, Wout Scheepers <Wout.Scheepers@xxxxxxxxxxxxxxxxxxx> wrote:

Hey Tim,


Thanks for your proposal to mentor me through my first PR.

As we’re definitely planning to upgrade to ES6 when Beam supports it, we decided to postpone the feature (we have a fix that works for us, for now).

When Beam supports ES6, I’ll be happy to make a contribution to get bulk deletes working.


For reference, I opened a ticket (https://issues.apache.org/jira/browse/BEAM-5042).






From: Tim Robertson <timrobertson100@xxxxxxxxx>
Reply-To: "user@xxxxxxxxxxxxxxx" <user@xxxxxxxxxxxxxxx>
Date: Friday, 27 July 2018 at 17:43
To: "user@xxxxxxxxxxxxxxx" <user@xxxxxxxxxxxxxxx>
Subject: Re: ElasticsearchIO bulk delete


Hi Wout,


This is great, thank you. I wrote the partial update support you reference and I'll be happy to mentor you through your first PR - welcome aboard. Can you please open a Jira to reference this work and we'll assign it to you?


We discussed having the "_xxx" fields in the document and triggering actions based on that in the partial update jira but opted to avoid it. Based on that discussion the ActionFn would likely be the preferred approach.  Would that be possible?


It will be important to provide unit and integration tests as well.


Please be aware that there is a branch and work underway for ES6 already which is rather different on the write() path so this may become redundant rather quickly.





@timrobertson100 on the Beam slack channel




On Fri, Jul 27, 2018 at 2:53 PM, Wout Scheepers <Wout.Scheepers@vente-exclusive.com> wrote:

Hey all,


A while ago, I patched ElasticsearchIO to be able to do partial updates and deletes.

However, I did not consider my patch pull-request-worthy as the json parsing was done inefficient (parsed it twice per document).


Since Beam 2.5.0 partial updates are supported, so the only thing I’m missing is the ability to send bulk delete requests.

We’re using entity updates for event sourcing in our data lake and need to persist deleted entities in elastic.

We’ve been using my patch in production for the last year, but I would like to contribute to get the functionality we need into one of the next releases.


I’ve created a gist that works for me, but is still inefficient (parsing twice: once to check the ‘_action` field, once to get the metadata).

Each document I want to delete needs an additional ‘_action’ field with the value ‘delete’. It doesn’t matter the document still contains the redundant field, as the delete action only requires the metadata.

I’ve added the method isDelete() and made some changes to the processElement() method.



I would like to make my solution more generic to fit into the current ElasticsearchIO and create a proper pull request.

As this would be my first pull request for beam, can anyone point me in the right direction before I spent too much time creating something that will be rejected?


Some questions on the top of my mind are:

  • Is it a good idea it to make the ‘action’ part for the bulk api generic?
  • Should it be even more generic? (e.g.: set an ‘ActionFn’ on the ElasticsearchIO)
  • If I want to avoid parsing twice, the parsing should be done outside of the getDocumentMetaData() method. Would this be acceptable?
  • Is it possible to avoid passing the action as a field in the document?
  • Is there another or better way to get the delete functionality in general?


All feedback is more than welcome.