osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument


Thank you Robert and Lukasz for your points.

Note that I believe that we will want to have multiple URLs to support cross language pipelines since we will want to be able to ask other SDK languages/versions for their "list" of supported PipelineOptions.

Why is that? The Runner itself is the source of truth for its options. Everything else is SDK-related and should be validated there.

I imagined the process to go like this:

  a) Parse options to find JobServer URL
  a) Retrieve options from JobServer
  c) Parse all options
  ...continue as always...

An option is just represented by a name and a type. There is nothing more to it, at least as of now. So it should be possible to parse them in the SDK without much further work.

Nevertheless, I agree with your sentiment, Robert. The "runner_option" flag would prevent additional complexity. I still don't prefer it because it's not nice from an end user perspective. If we were to implement it, I would definitely go for the "option promotion" which you mentioned.

I hadn't thought about delegating runners, although the PortableRunner is basically a delegating Runner. If that was an important feature, I suppose the "runner_option" would be the preferred way.

All in all, since there doesn't seem to be an excitement to implement JobServer option retrieval and we will need the help of all SDK developers, "runner_option" seems to be the more likely path.

-Max

On 08.11.18 21:50, Lukasz Cwik wrote:
The purpose of the spec would be to provide the names, type and descriptions of the options. We don't need anything beyond the JSON types (string, number, bool, object, list) because the only ambiguity we get is how do we parse command line string into the JSON type (and that ambiguity is actually only between string and non-string since all the other JSON types are unambiguous).

Also, I believe the flow would be
1) Parse options
  a) Find the URL from args specified and/or additional methods on PipelineOptions that exposes a programmatic way to set the URL during parsing.
   b) Query URL for option specs
   c) Parse the remainder of the options
2) Construct pipeline
3) Choose runner
4) Submit job to runner

Note that I believe that we will want to have multiple URLs to support cross language pipelines since we will want to be able to ask other SDK languages/versions for their "list" of supported PipelineOptions.

On Thu, Nov 8, 2018 at 11:51 AM Robert Bradshaw <robertwb@xxxxxxxxxx <mailto:robertwb@xxxxxxxxxx>> wrote:

    There's two questions here:

    (A) What do we do in the short term?

    I think adding every runner option to every SDK is not sustainable
    (n*m work, assuming every SDK knows about every runner), and having a
    patchwork of options that were added as one-offs to SDKs is not
    desirable either. Furthermore, it seems difficult to parse unknown
    options as if they were valid options, so my preference here would be
    to just use a special runner_option flag. (One could also pass a set
    of unparsed/unvalidated runner options to the runner, even if they're
    not distinguished for the user, and runners (or any intermediates)
    could run a "promote" operation that promotes any of these unknowns
    that they recognize to real options before further processing. The
    parsing would be done as repeated-string, and not be intermingled with
    the actually validated options. This is essential a variant of
    option 1.)

    (B) What do do in the long term? While the JobServer approach sounds
    nice, I think it introduces a lot of complexity (we have too much of
    that already) and still doesn't completely solve the problem. In
    particular, it changes the flow from

    1. Parse options
    2. Construct pipeline
    3. Choose runner
    4. Submit job to runner

    to

    1. Parse options
    2. Construct pipeline
    3. Choose runner
    4a. Query runner for option specs
    4b. Re-parse options
    4c. Submit job to runner

    In particular, doing 4b in the SDK rather than just let the runner
    itself do the validation as part of (4) doesn't save much and forces
    us to come up with a (probably incomplete) spec as to how to define
    options, their types, and their validations. It also means that a
    delegating runner must choose and interact with its downstream
    runner(s) synchronously, else we haven't actually solved the issue.

    For these reasons, I don't think we even want to go with the JobServer
    approach in the long term, which has bearing on (A).

    - Robert


    On Wed, Nov 7, 2018 at 8:50 PM Maximilian Michels <mxm@xxxxxxxxxx
    <mailto:mxm@xxxxxxxxxx>> wrote:
     >
     > +1
     >
     > If the preferred approach is to eventually have the JobServer
    serve the
     > options, then the best intermediate solution is to replicate common
     > options in the SDKs.
     >
     > If we went down the "--runner_option" path, we would end up with
     > multiple ways of specifying the same options. We would eventually
    have
     > to deprecate "runner options" once we have the JobServer
    approach. I'd
     > like to avoid that.
     >
     > For the upcoming release we can revert the changes again and add the
     > most common missing options to the SDKs. Then hopefully we should
    have
     > fetching implemented for the release after.
     >
     > Do you think that is feasible?
     >
     > Thanks,
     > Max
     >
     > On 30.10.18 23:00, Lukasz Cwik wrote:
     > > I still like #3 the most, just can't devote the time to get it
    done.
     > >
     > > Instead of going with a fully implemented #3, we could hardcode
    the a
     > > subset of options and types within each SDK until the job server is
     > > ready to provide this information and then migrate to the
    "full" list.
     > > This would be an easy path for SDKs to take on. They could
    "know" of a
     > > few well known options, and if they want to support all
    options, they
     > > implement the integration with the job server.
     > >
     > > On Fri, Oct 26, 2018 at 9:19 AM Maximilian Michels
    <mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>
     > > <mailto:mxm@xxxxxxxxxx <mailto:mxm@xxxxxxxxxx>>> wrote:
     > >
     > >      > I would prefer we don't introduce a (quirky) way of passing
     > >     unknown options that forces users to type JSON into the
    command line
     > >     (or similar acrobatics)
     > >     Same here, the JSON approach seems technically nice but too
    bulky
     > >     for users.
     > >
     > >      > To someone wanting to run a pipeline, all options are
    equally
     > >     important, whether they are application specific, SDK
    specific or
     > >     runner specific.
     > >
     > >     I'm also reluctant to force users to use `--runner_option=`
    because the
     > >     division into "Runner" options and other options seems
    rather arbitrary
     > >     to users. Most built-in options are also Runner-related.
     > >
     > >      > It should be possible to *optionally* qualify/scope (to
    cover
     > >     cases where there is ambiguity), but otherwise I prefer the
    format
     > >     we currently have.
     > >
     > >     Yes, namespacing is a problem. What happens if the user
    defines a
     > >     custom
     > >     PipelineOption which clashes with one of the builtin ones?
    If both are
     > >
     > >     set, which one is actually applied?
     > >
     > >
     > > Note that PipelineOptions so far has been treating name
    equality to mean
     > > option equality and the Java implementation has a bunch of
    strict checks
     > > to make sure that default values aren't used for duplicate
    definitions,
     > > they have the same type, etc...
     > > With 1), you fail the job if the runner can't understand your
    option
     > > because its not represented the same way. User then needs to fix-up
     > > their declaration of the option name.
     > > With 2), there are no name conflicts, the SDK will need to
    validate that
     > > the option isn't set in both formats and error out if it is before
     > > pipeline submission time.
     > > With 3), you can prefetch all the options and error out to the user
     > > during argument parsing time.
     > >
     > >
     > >
     > >     Here is a summary of the possible paths going forward:
     > >
     > >
     > >     1) Validate PipelineOptions at Runner side
     > >     ==========================================
     > >
     > >     The main issue raised here was that we want to move away
    from parsing
     > >     arguments which look like options without validating them.
    An easy fix
     > >     would be to actually validate them on the Runner side. This
    could be
     > >     done by changing the deserialization code of
    PipelineOptions which so
     > >     far ignores unknown JSON options.
     > >
     > >     See: PipelineOptionsTranslation.fromProto(Struct protoOptions)
     > >
     > >     Actually, this wouldn't work for user-defined
    PipelineOptions because
     > >     they might not be known to the Runner (if they are defined
    in Python).
     > >
     > >
     > >     2) Introduce a Runner-Option Flag
     > >     =================================
     > >
     > >     In this approach we would try to add as many pipeline
    options for a
     > >     Runner to the SDK, but allow additional Runner options to
    be passed
     > >     using the `--runner-option=key=val` flag. The Runner, like
    in 1), would
     > >     have to ensure validation. I think this has been the most
    favored
     > >     way so
     > >     far. Going forward, that means that `--parallelism=4` and
     > >     `--runner-option=parallelism=4` will have the same effect
    for the Flink
     > >     Runner.
     > >
     > >
     > >     3) Implement Fetching of Options from JobServer
     > >     ===============================================
     > >
     > >     The options are retrieved from the JobServer before
    submitting the
     > >     pipeline. I think this would be ideal but, as mentioned
    before, it
     > >     increases the complexity for implementing new SDKs and
    might overall
     > >     just not be worth the effort.
     > >
     > >
     > >     What do you think? I'd implement 2) for the next release,
    unless there
     > >     are advocates for a different approach.
     > >
     > >     Cheers,
     > >     Max