osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users


Thanks all for comments and suggestions. We want to close this thread and start implementing the new policy based on the discussion:

1. Stop assigning JIRAs to the first person listed in the dependency owners files. Instead, cc people on the owner list.
2. We will be creating JIRAs for upgrading individual dependencies, not for upgrading to specific versions of those dependencies. For example, if a given dependency X is three minor versions or an year behind we will create a JIRA for upgrading that. But the specific version to upgrade to has to be determined by the Beam community. Beam community might choose to close a JIRA if there are known issues with available recent releases. Tool will reopen such a closed JIRA to inform owners if Beam is hitting the 'fixed version' or 3 new versions of the dependency have been released since JIRA was closed.

Thank you.

Regards.
Yifan

On Wed, Sep 5, 2018 at 2:14 PM Yifan Zou <yifanzou@xxxxxxxxxx> wrote:
+1 on the jira "fix version". 
The release frequency of dependencies are various, so that using new information such as versions from the Jira closing date to reopen the issues might not be very proper. We could check the fix versions first, and if specified, then reopen the issue in that version's release cycle; it not, follow Cham's proposal (2). 

On Wed, Sep 5, 2018 at 1:59 PM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:


On Wed, Sep 5, 2018 at 12:50 PM Tim Robertson <timrobertson100@xxxxxxxxx> wrote:
Thank you Cham, and everyone for contributing

Sorry for slow reply to a thread I started, but I've been swamped on non Beam projects. 

KafkaIO's policy of 'let the user decide exact version at runtime' has been quite useful so far. How feasible is that for other connectors?

I presume shimming might be needed in a few places but it's certainly something we might want to explore more. I'll look into KafkaIO.

On Cham's proposal :

(1) +0.5. We can always then opt to either assign or take ownership of an issue, although I am also happy to stick with the owners model - it prompted me to investigate and resulted in this thread.

(2) I think this makes sense. 
A bot informing us that we're falling behind versions is immensely useful as long as we can link issues to others which might have a wider discussion (remember many dependencies need to be treated together such as "Support Hadoop 3.0.x" or "Support HBase 2.x"). Would it make sense to let owners use the Jira "fix versions" to put in future release to inform the bot when it should start alerting again?

I think this makes sense. Setting a "fix version" will be specially useful for dependency changes that result in API changes that have to be postponed till next major version of Beam.

On grouping, I believe we already group JIRAs into tasks and sub-tasks based on group ids of dependencies. I suppose it will not be too hard to close multiple sub-tasks with the same reasoning.




On Wed, Sep 5, 2018 at 3:18 AM Yifan Zou <yifanzou@xxxxxxxxxx> wrote:
Thanks Cham for putting this together. Also, after modifying the dependency tool based on the policy above, we will close all existing JIRA issues that prevent creating duplicate bugs and stop pushing assignees to upgrade dependencies with old bugs.

Please let us know if you have any comments on the revised policy in Cham's email.

Thanks all.

Regards.
Yifan Zou

On Tue, Sep 4, 2018 at 5:35 PM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:
Based on this email thread and offline feedback from several folks, current concerns regarding dependency upgrade policy and tooling seems to be following.

(1) We have to be careful when upgrading dependencies. For example, we should not create JIRAs for upgrading to dependency versions that have known issues.

(2) Dependency owners list can get stale. Somebody who is interested in upgrading a dependency today might not be interested in the same task in six months. Responsibility of upgrading a dependency should lie with the community instead of pre-identified owner(s).

On the other hand we do not want Beam to significantly fall behind when it comes to dependencies. We should upgrade dependencies whenever it makes sense. This allows us to offer a more up to date system and to makes things easy for users that deploy Beam along with other systems.

I discussed these issues with Yifan and we would like to suggest following changes to current policy and tooling that might help alleviate some of the concerns.

(1) Instead of a dependency "owners" list we will be maintaining an "interested parties" list. When we create a JIRA for a dependency we will not assign it to an owner but rather we will CC all the folks that mentioned that they will be interested in receiving updates related to that dependency. Hope is that some of the interested parties will also put forward the effort to upgrade dependencies they are interested in but the responsibility of upgrading dependencies lie with the community as a whole.

 (2) We will be creating JIRAs for upgrading individual dependencies, not for upgrading to specific versions of those dependencies. For example, if a given dependency X is three minor versions or an year behind we will create a JIRA for upgrading that. But the specific version to upgrade to has to be determined by the Beam community. Beam community might choose to close a JIRA if there are known issues with available recent releases. Tool may reopen such a closed JIRA in the future if new information becomes available (for example, 3 new versions have been released since JIRA was closed).

Thoughts ?

Thanks,
Cham 

On Tue, Aug 28, 2018 at 1:51 PM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:


On Tue, Aug 28, 2018 at 12:05 PM Thomas Weise <thw@xxxxxxxxxx> wrote:
I think there is an invalid assumption being made in this discussion, which is that most projects comply with semantic versioning. The reality in the open source big data space is unfortunately quite different. Ismaël has well characterized the situation and HBase isn't an exception. Another indicator for the scale of problem is extensive amount of shading used in Beam and other projects. It wouldn't be necessary if semver compliance was something we can rely on.

Our recent Flink upgrade broke user(s). And we noticed a backward incompatible Flink change that affected the portable Flink runner even between patches.

Many projects (including Beam) guarantee compatibility only for a subset of public API. Sometimes a REST API is not covered, sometimes not strictly internal protocols change and so on, all of which can break users, despite the public API remaining "compatible". As much as I would love to rely on the version number to tell me wether an upgrade is safe or not, that's not practically possible.

Furthermore, we need to proceed with caution forcing upgrades on users that host the target systems. To stay with the Flink example, moving Beam from 1.4 to 1.5 is actually a major change to some, because they now have to upgrade their Flink clusters/deployments to be able to use the new version of Beam.

Upgrades need to be done with caution and may require extensive verification beyond what our automation provides. I think the Spark change from 1.x to 2.x and also the JDK 1.8 change were good examples, they provided the community a window to provide feedback and influence the change.

Thanks for the clarification.

Current policy indeed requests caution and explicit checks when upgrading all dependencies (including minor and patch versions) but language might have to be updated to emphasize your concerns.

Here's the current text.

"Beam releases adhere to semantic versioning. Hence, community members should take care when updating dependencies. Minor version updates to dependencies should be backwards compatible in most cases. Some updates to dependencies though may result in backwards incompatible API or functionality changes to Beam. PR reviewers and committers should take care to detect any dependency updates that could potentially introduce backwards incompatible changes to Beam before merging and PRs that update dependencies should include a statement regarding this verification in the form of a PR comment. Dependency updates that result in backwards incompatible changes to non-experimental features of Beam should be held till next major version release of Beam. Any exceptions to this policy should only occur in extreme cases (for example, due to a security vulnerability of an existing dependency that is only fixed in a subsequent major version) and should be discussed in the Beam dev list. Note that backwards incompatible changes to experimental features may be introduced in a minor version release."

Also, are there any other steps we can take to make sure that Beam dependencies are not too old while offering a stable system ? Note that having a lot of legacy dependencies that do not get upgraded regularly can also result in user pain and Beam being unusable for certain users who run into dependency conflicts when using Beam along with other systems (which will increase the amount of shading/vendoring we have to do).

Please note that current tooling does not force upgrades or automatically upgrade dependencies. It simply creates JIRAs that can be closed with a reason if needed. For Python SDK though we have version ranges in place for most dependencies [1] so these dependencies get updated automatically according to the corresponding ranges.

Thanks,
Cham
 

Thanks,
Thomas



On Tue, Aug 28, 2018 at 11:29 AM Raghu Angadi <rangadi@xxxxxxxxxx> wrote:
Thanks for the IO versioning summary. 
KafkaIO's policy of 'let the user decide exact version at runtime' has been quite useful so far. How feasible is that for other connectors?

Also, KafkaIO does not limit itself to minimum features available across all the supported versions. Some of the features (e.g. server side timestamps) are disabled based on runtime Kafka version.  The unit tests currently run with single recent version. Integration tests could certainly use multiple versions. With some more effort in writing tests, we could make multiple versions of the unit tests. 
 
Raghu.

IO versioning
* Elasticsearch. We delayed the move to version 6 until we heard of
more active users needing it (more deployments). We support 2.x and
5.x (but 2.x went recently EOL). Support for 6.x is in progress.
* SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
because most big data distributions still use 5.x (however 5.x has
been EOL).
* KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
most of the deployments of Kafka use earlier versions than 1.x. This
module uses a single version with the kafka client as a provided
dependency and so far it works (but we don’t have multi version
tests).

 
On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <iemejia@xxxxxxxxx> wrote:
I think we should refine the strategy on dependencies discussed
recently. Sorry to come late with this (I did not follow closely the
previous discussion), but the current approach is clearly not in line
with the industry reality (at least not for IO connectors + Hadoop +
Spark/Flink use).

A really proactive approach to dependency updates is a good practice
for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
Bigquery, AWS S3, etc. However when we talk about self hosted data
sources or processing systems this gets more complicated and I think
we should be more flexible and do this case by case (and remove these
from the auto update email reminder).

Some open source projects have at least three maintained versions:
- LTS – maps to what most of the people have installed (or the big
data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
- Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
- Next – latest release. HBase 2.1.x Hadoop 3.1.x

Following the most recent versions can be good to be close to the
current development of other projects and some of the fixes, but these
versions are commonly not deployed for most users and adopting a LTS
or stable only approach won't satisfy all cases either. To understand
why this is complex let’s see some historical issues:

IO versioning
* Elasticsearch. We delayed the move to version 6 until we heard of
more active users needing it (more deployments). We support 2.x and
5.x (but 2.x went recently EOL). Support for 6.x is in progress.
* SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
because most big data distributions still use 5.x (however 5.x has
been EOL).
* KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
most of the deployments of Kafka use earlier versions than 1.x. This
module uses a single version with the kafka client as a provided
dependency and so far it works (but we don’t have multi version
tests).

Runners versioning
* The move to Spark 1 to Spark 2 was decided after evaluating the
tradeoffs between maintaining multiple version support and to have
breaking changes with the issues of maintaining multiple versions.
This is a rare case but also with consequences. This dependency is
provided but we don't actively test issues on version migration.
* Flink moved to version 1.5, introducing incompatibility in
checkpointing (discussed recently and with not yet consensus on how to
handle).

As you can see, it seems really hard to have a solution that fits all
cases. Probably the only rule that I see from this list is that we
should upgrade versions for connectors that have been deprecated or
arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).

For the case of the provided dependencies I wonder if as part of the
tests we should provide tests with multiple versions (note that this
is currently blocked by BEAM-4087).

Any other ideas or opinions to see how we can handle this? What other
people in the community think ? (Notice that this can have relation
with the ongoing LTS discussion.


On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
<timrobertson100@xxxxxxxxx> wrote:
>
> Hi folks,
>
> I'd like to revisit the discussion around our versioning policy specifically for the Hadoop ecosystem and make sure we are aware of the implications.
>
> As an example our policy today would have us on HBase 2.1 and I have reminders to address this.
>
> However, currently the versions of HBase in the major hadoop distros are:
>
>  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can assume is not widely adopted)
>  - AWS EMR HBase on 1.4
>
> On the versioning I think we might need a more nuanced approach to ensure that we target real communities of existing and potential users. Enterprise users need to stick to the supported versions in the distributions to maintain support contracts from the vendors.
>
> Should our versioning policy have more room to consider on a case by case basis?
>
> For Hadoop might we benefit from a strategy on which community of users Beam is targeting?
>
> (OT: I'm collecting some thoughts on what we might consider to target enterprise hadoop users - kerberos on all relevant IO, performance, leaking beyond encryption zones with temporary files etc)
>
> Thanks,
> Tim