[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

I agree that the current approach breaks the pipeline options contract because "unknown" options get parsed in the same way as options which have been defined by the user.

I'm not sure the `experiments` flag works for us. AFAIK it only allows true/false flags. We want to pass all types of pipeline options to the Runner.

How to solve this?

1) Add all options of all Runners to each SDK
We added some of the FlinkRunner options to the Python SDK but realized syncing is rather cumbersome in the long term. However, we want the most important options to be validated on the client side.

2) Pass "unknown" options via a separate list in the Proto which can only be accessed internally by the Runners. This still allows passing arbitrary options but we wouldn't leak unknown options and display them as top-level options.


On 13.10.18 02:34, Charles Chen wrote:
The current release branch (https://github.com/apache/beam/commits/release-2.8.0) was cut after the revert went in.  Sent out https://github.com/apache/beam/pull/6683 as a revert of the revert.  Regarding your comment above, I can help out with the design / PR reviews for common Python code as you suggest.

On Fri, Oct 12, 2018 at 4:48 PM Thomas Weise <thw@xxxxxxxxxx <mailto:thw@xxxxxxxxxx>> wrote:

    Thanks, will tag you and looking forward to feedback so we can
    ensure that changes work for everyone.

    Looking at the PR, I see agreement from Max to revert the change on
    the release branch, but not in master. Would you mind to restore it
    in master?


    On Fri, Oct 12, 2018 at 4:40 PM Ahmet Altay <altay@xxxxxxxxxx
    <mailto:altay@xxxxxxxxxx>> wrote:

        On Fri, Oct 12, 2018 at 11:31 AM, Charles Chen <ccy@xxxxxxxxxx
        <mailto:ccy@xxxxxxxxxx>> wrote:

            What I mean is that a user may find that it works for them
            to pass "--myarg blah" and access it as "options.myarg"
            without explicitly defining a "my_arg" flag due to the added
            logic.  This is not the intended behavior and we may want to
            change this implementation detail in the future.  However,
            having this logic in a released version makes it hard to
            change this behavior since users may erroneously depend on
            this undocumented behavior.  Instead, we should namespace /
            scope this so that it is obvious that this is meant for
            runner (and not Beam user) consumption.

            On Fri, Oct 12, 2018 at 10:48 AM Thomas Weise
            <thw@xxxxxxxxxx <mailto:thw@xxxxxxxxxx>> wrote:

                Can you please elaborate more what practical problems
                this introduces for users?

                I can see that this change allows a user to specify a
                runner specific option, which in the future may change
                because we decide to scope differently. If this only
                affects users of the portable Flink runner (like us),
                then no need to revert, because at this early stage we
                prefer something that works over being blocked.

                It would also be really great if some of the core Python
                SDK developers could help out with the design aspects
                and PR reviews of changes that affect common Python
                code. Anyone who specifically wants to be tagged on
                relevant JIRAs and PRs?

        I would be happy to be tagged, and I can also help with
        including other relevant folks whenever possible. In general I
        think Robert, Charles, myself are good candidates.


                On Fri, Oct 12, 2018 at 10:20 AM Ahmet Altay
                <altay@xxxxxxxxxx <mailto:altay@xxxxxxxxxx>> wrote:

                    On Fri, Oct 12, 2018 at 10:11 AM, Charles Chen
                    <ccy@xxxxxxxxxx <mailto:ccy@xxxxxxxxxx>> wrote:

                        For context, I made comments on
                        https://github.com/apache/beam/pull/6600 noting
                        that the changes being made were not good for
                        Beam backwards-compatibility.  The change as is
                        allows users to use pipeline options without
                        explicitly defining them, which is not the type
                        of usage we would like to encourage since we
                        prefer to be explicit whenever possible.  If
                        users write pipelines with this sort of pattern,
                        they will potentially encounter pain when
                        upgrading to a later version since this is an
                        implementation detail and not an officially
                        supported pattern.  I agree with the comments
above that this is ultimately a scoping issue. I would not have a problem with these changes if
                        they were explicitly scoped under either a
                        runner or unparsed options namespace.

                        As a second note, since the 2.8.0 release is
                        being cut right now, because of these
                        backwards-compatibility concerns, I would
                        suggest reverting these changes, at least until
                        2.8.0 is cut, so we can have a discussion here
                        before committing to and releasing any API-level

                    +1 I would like to revert the changes in order not
                    rush this into the release. Once this discussion
                    results in an agreement changes can be brought back.

                        On Fri, Oct 12, 2018 at 9:26 AM Henning Rohde
                        <herohde@xxxxxxxxxx <mailto:herohde@xxxxxxxxxx>>

                            Agree that pipeline options lack some
                            mechanism for scoping. It is also not always
                            possible distinguish options meant to be
                            consumed at pipeline construction time, by
                            the runner, by the SDK harness, by the user
                            code or any combination -- and this causes
                            confusion every now and then.

                            For Dataflow, we have been using
                            "experiments" for arbitrary runner-specific
                            options. It's simply a string list pipeline
                            option that all SDKs support and, for Go at
                            least, is sent to portable runners. Flink
                            can do the same in the short term to move


                            On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise
                            <thw@xxxxxxxxxx <mailto:thw@xxxxxxxxxx>> wrote:

                                [moving to the list]

                                The requirement driving this part of the
                                change was to allow a user to specify
                                pipeline options that a runner supports
                                without having to declare those in each
                                language SDK.

                                In the specific scenario, we have
                                options that the Flink runner supports
                                (and can validate), that are not
                                enumerated in the Python SDK.

                                I think we have a bigger problem scoping
                                pipeline options. For example, the
                                runner options are dumped into the SDK
                                worker. There is also a possibility of
                                name collisions. So I think this would
                                benefit from broader feedback.


                                ---------- Forwarded message ---------
                                From: *Charles Chen*
                                Date: Fri, Oct 12, 2018 at 8:36 AM
                                Subject: Re: [apache/beam] [BEAM-5442]
                                Store duplicate unknown options in a
                                list argument (#6600)
                                To: apache/beam <beam@xxxxxxxxxxxxxxxxxx
                                Cc: Thomas Weise <thomas.weise@xxxxxxxxx
                                Mention <mention@xxxxxxxxxxxxxxxxxx

                                CC: @tweise <https://github.com/tweise>

                                You are receiving this because you were
                                Reply to this email directly, view it on
                                or mute the thread