osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument


+1 for explicit --runner_option=param=val,... 
It's hard to tell otherwise where an option is going to,

On Mon, Oct 15, 2018 at 8:04 AM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
On Mon, Oct 15, 2018 at 3:58 PM Maximilian Michels <mxm@xxxxxxxxxx> wrote:
>
> I agree that the current approach breaks the pipeline options contract
> because "unknown" options get parsed in the same way as options which
> have been defined by the user.

FWIW, I think we're already breaking this "contract." Unknown options
are silently ignored; with this change we just change how we record
them. It still feels a bit hacky though.

> I'm not sure the `experiments` flag works for us. AFAIK it only allows
> true/false flags. We want to pass all types of pipeline options to the
> Runner.

Experiments is an arbitrary set of strings, which can be of the form
"param=value" if that's useful. (Dataflow does this.) There is, again,
no namespacing on the param names, but we could user urns or impose
some other structure here.

> How to solve this?
>
> 1) Add all options of all Runners to each SDK
> We added some of the FlinkRunner options to the Python SDK but realized
> syncing is rather cumbersome in the long term. However, we want the most
> important options to be validated on the client side.

I don't think this is sustainable in the long run. However, thinking
about this, in the worse case validation happens after construction
but before execution (as with much of our other validation) so it
isn't that bad.

> 2) Pass "unknown" options via a separate list in the Proto which can
> only be accessed internally by the Runners. This still allows passing
> arbitrary options but we wouldn't leak unknown options and display them
> as top-level options.

I think there needs to be a way for the user to communicate values
directly to the runner regardless of the SDK. My preference would be
to make this explicit, e.g. (repeated) --runner_option=..., rather
than scooping up all unknown flags at command line parsing time.
Perhaps an SDK that is aware of some runners could choose to lift
these as top-level options, but still pass them as runner options.

> On 13.10.18 02:34, Charles Chen wrote:
> > The current release branch
> > (https://github.com/apache/beam/commits/release-2.8.0) was cut after the
> > revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a
> > revert of the revert.  Regarding your comment above, I can help out with
> > the design / PR reviews for common Python code as you suggest.
> >
> > On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@xxxxxxxxxx
> > <mailto:thw@xxxxxxxxxx>> wrote:
> >
> >     Thanks, will tag you and looking forward to feedback so we can
> >     ensure that changes work for everyone.
> >
> >     Looking at the PR, I see agreement from Max to revert the change on
> >     the release branch, but not in master. Would you mind to restore it
> >     in master?
> >
> >     Thanks
> >
> >     On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@xxxxxxxxxx
> >     <mailto:altay@xxxxxxxxxx>> wrote:
> >
> >
> >
> >         On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@xxxxxxxxxx
> >         <mailto:ccy@xxxxxxxxxx>> wrote:
> >
> >             What I mean is that a user may find that it works for them
> >             to pass "--myarg blah" and access it as "options.myarg"
> >             without explicitly defining a "my_arg" flag due to the added
> >             logic.  This is not the intended behavior and we may want to
> >             change this implementation detail in the future.  However,
> >             having this logic in a released version makes it hard to
> >             change this behavior since users may erroneously depend on
> >             this undocumented behavior.  Instead, we should namespace /
> >             scope this so that it is obvious that this is meant for
> >             runner (and not Beam user) consumption.
> >
> >             On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
> >             <thw@xxxxxxxxxx <mailto:thw@xxxxxxxxxx>> wrote:
> >
> >                 Can you please elaborate more what practical problems
> >                 this introduces for users?
> >
> >                 I can see that this change allows a user to specify a
> >                 runner specific option, which in the future may change
> >                 because we decide to scope differently. If this only
> >                 affects users of the portable Flink runner (like us),
> >                 then no need to revert, because at this early stage we
> >                 prefer something that works over being blocked.
> >
> >                 It would also be really great if some of the core Python
> >                 SDK developers could help out with the design aspects
> >                 and PR reviews of changes that affect common Python
> >                 code. Anyone who specifically wants to be tagged on
> >                 relevant JIRAs and PRs?
> >
> >
> >         I would be happy to be tagged, and I can also help with
> >         including other relevant folks whenever possible. In general I
> >         think Robert, Charles, myself are good candidates.
> >
> >
> >                 Thanks
> >
> >
> >                 On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
> >                 <altay@xxxxxxxxxx <mailto:altay@xxxxxxxxxx>> wrote:
> >
> >
> >
> >                     On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
> >                     <ccy@xxxxxxxxxx <mailto:ccy@xxxxxxxxxx>> wrote:
> >
> >                         For context, I made comments on
> >                         https://github.com/apache/beam/pull/6600 noting
> >                         that the changes being made were not good for
> >                         Beam backwards-compatibility.  The change as is
> >                         allows users to use pipeline options without
> >                         explicitly defining them, which is not the type
> >                         of usage we would like to encourage since we
> >                         prefer to be explicit whenever possible.  If
> >                         users write pipelines with this sort of pattern,
> >                         they will potentially encounter pain when
> >                         upgrading to a later version since this is an
> >                         implementation detail and not an officially
> >                         supported pattern.  I agree with the comments
> >                         above that this is ultimately a scoping issue.
> >                         I would not have a problem with these changes if
> >                         they were explicitly scoped under either a
> >                         runner or unparsed options namespace.
> >
> >                         As a second note, since the 2.8.0 release is
> >                         being cut right now, because of these
> >                         backwards-compatibility concerns, I would
> >                         suggest reverting these changes, at least until
> >                         2.8.0 is cut, so we can have a discussion here
> >                         before committing to and releasing any API-level
> >                         changes.
> >
> >
> >                     +1 I would like to revert the changes in order not
> >                     rush this into the release. Once this discussion
> >                     results in an agreement changes can be brought back.
> >
> >
> >                         On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
> >                         <herohde@xxxxxxxxxx <mailto:herohde@xxxxxxxxxx>>
> >                         wrote:
> >
> >                             Agree that pipeline options lack some
> >                             mechanism for scoping. It is also not always
> >                             possible distinguish options meant to be
> >                             consumed at pipeline construction time, by
> >                             the runner, by the SDK harness, by the user
> >                             code or any combination -- and this causes
> >                             confusion every now and then.
> >
> >                             For Dataflow, we have been using
> >                             "experiments" for arbitrary runner-specific
> >                             options. It's simply a string list pipeline
> >                             option that all SDKs support and, for Go at
> >                             least, is sent to portable runners. Flink
> >                             can do the same in the short term to move
> >                             forward.
> >
> >                             Henning
> >
> >
> >                             On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise
> >                             <thw@xxxxxxxxxx <mailto:thw@xxxxxxxxxx>> wrote:
> >
> >                                 [moving to the list]
> >
> >                                 The requirement driving this part of the
> >                                 change was to allow a user to specify
> >                                 pipeline options that a runner supports
> >                                 without having to declare those in each
> >                                 language SDK.
> >
> >                                 In the specific scenario, we have
> >                                 options that the Flink runner supports
> >                                 (and can validate), that are not
> >                                 enumerated in the Python SDK.
> >
> >                                 I think we have a bigger problem scoping
> >                                 pipeline options. For example, the
> >                                 runner options are dumped into the SDK
> >                                 worker. There is also a possibility of
> >                                 name collisions. So I think this would
> >                                 benefit from broader feedback.
> >
> >                                 Thanks,
> >                                 Thomas
> >
> >
> >                                 ---------- Forwarded message ---------
> >                                 From: *Charles Chen*
> >                                 <notifications@xxxxxxxxxx
> >                                 <mailto:notifications@xxxxxxxxxx>>
> >                                 Date: Fri, Oct 12, 2018 at 8:36 AM
> >                                 Subject: Re: [apache/beam] [BEAM-5442]
> >                                 Store duplicate unknown options in a
> >                                 list argument (#6600)
> >                                 To: apache/beam <beam@xxxxxxxxxxxxxxxxxx
> >                                 <mailto:beam@xxxxxxxxxxxxxxxxxx>>
> >                                 Cc: Thomas Weise <thomas.weise@xxxxxxxxx
> >                                 <mailto:thomas.weise@xxxxxxxxx>>,
> >                                 Mention <mention@xxxxxxxxxxxxxxxxxx
> >                                 <mailto:mention@xxxxxxxxxxxxxxxxxx>>
> >
> >
> >                                 CC: @tweise <https://github.com/tweise>
> >
> >                                 —
> >                                 You are receiving this because you were
> >                                 mentioned.
> >                                 Reply to this email directly, view it on
> >                                 GitHub
> >                                 <https://github.com/apache/beam/pull/6600#issuecomment-429367754>,
> >                                 or mute the thread
> >                                 <https://github.com/notifications/unsubscribe-auth/AAQGDwwt15R85eq9pySUisyxq2HYz-Vyks5ukLcLgaJpZM4XMo-T>.
> >
> >
> >

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature