[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

hi Antoine,

Some small critiques to the listing of implementations:

* The Java library predates the C++ library (it originated in Apache Drill)
* Python and C++ both interact with the Java library in different
ways. There's JNI for Gandiva and Plasma, and Python uses Java via
JPype in unit tests

There's some critical questions to answer here:

1. Is there such a thing as an "independent implementation"?
2. What's the best way to manage changesets / patches?
3. What is the best way to manage the burgeoning complexity of testing
and verification of the entire project?
4. How much longer will public CI services be adequate for our needs?

This may be a bit long winded so bear with me

1. Is there such a thing as an "independent implementation"?

My answer to this is actually "not really". The reasons are as follows:

* The integration tests are one of the most important parts of the
project. While C++, Java, and JavaScript are the only participants, we
eventually need Rust, Go, and C# to be in the matrix. This will
include integration testing for RPC / Flight in addition to the
current IPC tests.
* By the nature of Arrow, any implementation may build in-memory or
RPC-based bindings to computational libraries that are in C++ or use
LLVM, such as Gandiva and Plasma. This is already the case in Java,
and may expand beyond Java. I could see Go or Rust or C# using Gandiva
or Plasma. The scope of what kinds of shared infrastructure might be
used in multiple languages will only expand over time

2. What's the best way to manage changesets / patches?

* Because no two implementations can be guaranteed to be independent,
in a non-monorepo setup, changes may require multiple patches.
Verifying "joint patches" is likely to require manual / human
intervention in ways that are a non-issue for a monorepo
* Splitting development up into multiple repositories will decrease
visibility into the patch queues in the less active subprojects. I'm
strongly in support not only of a single codebase but a single patch
queue. I admit that seeing ~70 open pull requests on Arrow stresses me
out a bit, but having 70 patches spread across 5 repos would be more
stressful for me at least
* Broken builds in any part of the project should be a concern to the
entire community -- we should not have broken builds. I'd be concerned
about having any part of the project becoming a "ghetto" if the
plurality of developers are working elsewhere with an "out of sight,
out of mind" mindset

To play devil's advocate, some web applications could be developed to
create the appearance of a unified patch queue across many repos.

That being said, our patch queue pales in comparison to some larger /
more mature ASF projects:

* Spark has 523 open PRs: https://github.com/apache/spark/pulls
* Airflow has 218 open PRs: https://github.com/apache/incubator-airflow/pulls
* Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls

3. What is the best way to manage the burgeoning complexity of testing
and verification of the entire project?
4. How much longer will public CI services be adequate for our needs?

I think we are already reaching the limits of what we can reasonably
accomplish with public CI services. Apache Arrow is a project with
sophistication and scope that is destined to outgrow what Travis CI
can provide within the scope of a single implementation, i.e.
C++/Python. For example, we're going to be past the 50 minute time
limit before too long. I think that continuing to constrain ourselves
by the 50 minute time limit will also limit the scope of what kinds of
automated testing we can employ, to our long term detriment. We also
have things (like GPU support) that we cannot test there.

Considering more mature data projects in the ASF that I'm familiar
with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
testing uses Jenkins build slaves and run much longer than our CI
jobs. If we used beefier build slaves, our builds would also run much

So, what should we do? Well, part of why I have recently created an
organization (https://ursalabs.org/) dedicated to Arrow development is
to have the financial means and the engineering resources to actually
do something about problems like these. I would propose to make an
investment of hardware and engineering time to augment our ability to
test the repository to make sure we can manage 5-10x the current test
runtime that we have now. If I have to personally halt feature
development and focus on build and development tooling for a while, so
be it. We've already spent many months this year on packaging
automation but we are still coming up short in development tooling. If
anyone reading has funds to invest in hardware resources, please let
me know.

As Clint Eastwood's character said in "The Good, The Bad, and The
Ugly", "$200,000 is a lot of money. We're gonna have to earn it."

FWIW: I am not sure Parquet is a good example of a better way to be.
Parquet lacks automated integration tests (terrifying to me) and
failed to grow a community outside of the Java world until 2016 when a
few of us started building out the C++ library.

- Wes
On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <antoine@xxxxxxxxxx> wrote:
> Hello,
> We are quickly growing the number of Arrow implementations.  Soon we'll
> have:
> - C++: the most mature, reference, and historical implementation
> - Python: linked with Arrow C++
> - C/GLib: linked with Arrow C++
> - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> - R: linked with Arrow C++
> - Matlab: linked with Arrow C++
> - Java: independent implementation
> - Rust: independent implementation
> - Go: independent implementation
> - Javascript: independent implementation
> - .Net (C#): independent implementation
> This creates various kinds of issues.  Technical issues such as CI
> matrices being more and more large and complex.  Social issues such as
> different implementations having different development speeds and
> maturity, and the fact that development teams are effectively disjoint
> (for example, whoever develops on the C++ codebase usually doesn't
> develop on the Rust codebase, and vice-versa).
> I'm not proposing anything concrete here, but would like to ask what
> people think of moving independent implementations (those that don't
> depend on Arrow C++) into independent repositories.  This would let them
> define their own workflow, permissions, teams, CI configurations and
> whatnot.  This would also allow growing the CI matrix for the main repo
> without reaching humongous sizes.  The implementations would still be
> under the umbrella of the Apache Arrow project; but they would exist as
> independent GitHub projects (this is a bit how Parquet implementations
> are handled, AFAIK).
> To start with, Wes expressed opposition to the idea:
> """
> I am against breaking up the monorepo -- I think that we should scale
> our process using tools that we develop rather than conforming to the
> objectively crude affordances of Travis CI and Appveyor. Implementations
> that are independent now may not be so in the future by the nature of
> the project -- any implementation could integrate with Gandiva, for
> example, and that would become much more difficult to develop if the
> code is fragmented in multiple repositories.
> """
> (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> Regards
> Antoine.