Re: Evolving a Coder for an added field
I think we'll want to allow upgrades across SDK versions. A runner
should be able to recognize when a coder (or any other aspect of the
pipeline) has changed and adapt/reject accordingly. (Until we remove
coders from sources/sinks, there's also possibly the expectation that
one should be able to read data from a source written with that same
coder across versions as well.)
I think it really comes down to how coders are named. If we decide to
let coders change arbitrarily between versions, probably the URN for
SerializedJavaCoder should have the SDK version number in it. Coders
that are stable across SDKs can have better, more stable URNs defined
I am more OK with changing the registry to infer different coders as
the SDK evolves (which would be detected and manually overwritten with
the old ones, on a case-by-case basis, if they still exist). This
should still be done with caution as it will make upgrading harder.
Highly composite, experimental coders should possibly be designed in
an intrinsically extensible way.
On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <jb@xxxxxxxxxxxx> wrote:
> That's really a pita. It's an important and impacting change.
> I would go to 1.
> For LTS, as already said, I would create a LTS branch and only cherry
> pick some changes. Using master as LTS release branch won't work IMHO.
> On 05/11/2018 15:47, Ismaël Mejía wrote:
> > For some extra context this change touches more than FileIO, in
> > reality this will affect updates in any file-based pipelines because
> > the metadata on each file will have now an extra field for the
> > lastModifiedDate.
> > The PR looks perfect, only issue is the backwards compatibility Coder
> > question. Knowing that probably Dataflow is the only one affected, I
> > would like to know what can we do?
> >  Should we merge and the Coder updatability be tied to SDK versions
> > (which makes sense and is probably more aligned with the LTS
> > discussion)?
> >  Should we have a MetadataCoderV2? (does this imply a repeated
> > Matadata object) ? In this case where is the right place to identify
> > and decide what coder to use?
> > Other ideas... ?
> > Last thing, the link that Luke shared does not seem to work (looks
> > like a googley-friendly URL, here it is the full URL for those
> > interested in the drain/update proposal:
> >  https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
> > On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
> >> I think the idea is that you would use one coder for paths where you don't need this information and would have FileIO provide a separate path that uses your updated coder.
> >> Existing users would not be impacted and users of the new FileIO that depend on this information would not be able to have updated their pipeline in the first place.
> >> If the feature in FileIO is experimental, we could choose to break it for existing users though since I don't know how feasible my suggestion above is.
> >> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jklukas@xxxxxxxxxxx> wrote:
> >>> Lukasz - Thanks for those links. That's very helpful context.
> >>> It sounds like there's no explicit user contract about evolving Coder classes in the Java SDK and users might reasonably assume Coders to be stable between SDK versions. Thus, users of the Dataflow or Flink runners might reasonably expect that they can update the Java SDK version used in their pipeline when performing an update.
> >>> Based in that understanding, evolving a class like Metadata might not be possible except in a major version bump where it's obvious to users to expect breaking changes and not to expect an "update" operation to work.
> >>> It's not clear to me what changing the "name" of a coder would look like or whether that's a tenable solution here. Would that change be able to happen within the SDK itself, or is it something users would need to specify?
> Jean-Baptiste Onofré
> Talend - http://www.talend.com