[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss] Monorepo vs. independent repositories for independent implementations

I would also add -- Krisztian's recent work Dockerizing the project is
setting us up to be able to decouple ourselves from Travis CI. We need
build hosts where we can use Docker to be able to do this, though.
Preferably the build hosts would have NVIDIA GPUs so we can use
nvidia-docker to test our GPU functionality
On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> hi Antoine,
> Some small critiques to the listing of implementations:
> * The Java library predates the C++ library (it originated in Apache Drill)
> * Python and C++ both interact with the Java library in different
> ways. There's JNI for Gandiva and Plasma, and Python uses Java via
> JPype in unit tests
> There's some critical questions to answer here:
> 1. Is there such a thing as an "independent implementation"?
> 2. What's the best way to manage changesets / patches?
> 3. What is the best way to manage the burgeoning complexity of testing
> and verification of the entire project?
> 4. How much longer will public CI services be adequate for our needs?
> This may be a bit long winded so bear with me
> 1. Is there such a thing as an "independent implementation"?
> My answer to this is actually "not really". The reasons are as follows:
> * The integration tests are one of the most important parts of the
> project. While C++, Java, and JavaScript are the only participants, we
> eventually need Rust, Go, and C# to be in the matrix. This will
> include integration testing for RPC / Flight in addition to the
> current IPC tests.
> * By the nature of Arrow, any implementation may build in-memory or
> RPC-based bindings to computational libraries that are in C++ or use
> LLVM, such as Gandiva and Plasma. This is already the case in Java,
> and may expand beyond Java. I could see Go or Rust or C# using Gandiva
> or Plasma. The scope of what kinds of shared infrastructure might be
> used in multiple languages will only expand over time
> 2. What's the best way to manage changesets / patches?
> * Because no two implementations can be guaranteed to be independent,
> in a non-monorepo setup, changes may require multiple patches.
> Verifying "joint patches" is likely to require manual / human
> intervention in ways that are a non-issue for a monorepo
> * Splitting development up into multiple repositories will decrease
> visibility into the patch queues in the less active subprojects. I'm
> strongly in support not only of a single codebase but a single patch
> queue. I admit that seeing ~70 open pull requests on Arrow stresses me
> out a bit, but having 70 patches spread across 5 repos would be more
> stressful for me at least
> * Broken builds in any part of the project should be a concern to the
> entire community -- we should not have broken builds. I'd be concerned
> about having any part of the project becoming a "ghetto" if the
> plurality of developers are working elsewhere with an "out of sight,
> out of mind" mindset
> To play devil's advocate, some web applications could be developed to
> create the appearance of a unified patch queue across many repos.
> That being said, our patch queue pales in comparison to some larger /
> more mature ASF projects:
> * Spark has 523 open PRs: https://github.com/apache/spark/pulls
> * Airflow has 218 open PRs: https://github.com/apache/incubator-airflow/pulls
> * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
> 3. What is the best way to manage the burgeoning complexity of testing
> and verification of the entire project?
> 4. How much longer will public CI services be adequate for our needs?
> I think we are already reaching the limits of what we can reasonably
> accomplish with public CI services. Apache Arrow is a project with
> sophistication and scope that is destined to outgrow what Travis CI
> can provide within the scope of a single implementation, i.e.
> C++/Python. For example, we're going to be past the 50 minute time
> limit before too long. I think that continuing to constrain ourselves
> by the 50 minute time limit will also limit the scope of what kinds of
> automated testing we can employ, to our long term detriment. We also
> have things (like GPU support) that we cannot test there.
> Considering more mature data projects in the ASF that I'm familiar
> with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
> testing uses Jenkins build slaves and run much longer than our CI
> jobs. If we used beefier build slaves, our builds would also run much
> faster.
> So, what should we do? Well, part of why I have recently created an
> organization (https://ursalabs.org/) dedicated to Arrow development is
> to have the financial means and the engineering resources to actually
> do something about problems like these. I would propose to make an
> investment of hardware and engineering time to augment our ability to
> test the repository to make sure we can manage 5-10x the current test
> runtime that we have now. If I have to personally halt feature
> development and focus on build and development tooling for a while, so
> be it. We've already spent many months this year on packaging
> automation but we are still coming up short in development tooling. If
> anyone reading has funds to invest in hardware resources, please let
> me know.
> As Clint Eastwood's character said in "The Good, The Bad, and The
> Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
> FWIW: I am not sure Parquet is a good example of a better way to be.
> Parquet lacks automated integration tests (terrifying to me) and
> failed to grow a community outside of the Java world until 2016 when a
> few of us started building out the C++ library.
> - Wes
> On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <antoine@xxxxxxxxxx> wrote:
> >
> >
> > Hello,
> >
> > We are quickly growing the number of Arrow implementations.  Soon we'll
> > have:
> > - C++: the most mature, reference, and historical implementation
> > - Python: linked with Arrow C++
> > - C/GLib: linked with Arrow C++
> > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> > - R: linked with Arrow C++
> > - Matlab: linked with Arrow C++
> > - Java: independent implementation
> > - Rust: independent implementation
> > - Go: independent implementation
> > - Javascript: independent implementation
> > - .Net (C#): independent implementation
> >
> > This creates various kinds of issues.  Technical issues such as CI
> > matrices being more and more large and complex.  Social issues such as
> > different implementations having different development speeds and
> > maturity, and the fact that development teams are effectively disjoint
> > (for example, whoever develops on the C++ codebase usually doesn't
> > develop on the Rust codebase, and vice-versa).
> >
> > I'm not proposing anything concrete here, but would like to ask what
> > people think of moving independent implementations (those that don't
> > depend on Arrow C++) into independent repositories.  This would let them
> > define their own workflow, permissions, teams, CI configurations and
> > whatnot.  This would also allow growing the CI matrix for the main repo
> > without reaching humongous sizes.  The implementations would still be
> > under the umbrella of the Apache Arrow project; but they would exist as
> > independent GitHub projects (this is a bit how Parquet implementations
> > are handled, AFAIK).
> >
> > To start with, Wes expressed opposition to the idea:
> > """
> > I am against breaking up the monorepo -- I think that we should scale
> > our process using tools that we develop rather than conforming to the
> > objectively crude affordances of Travis CI and Appveyor. Implementations
> > that are independent now may not be so in the future by the nature of
> > the project -- any implementation could integrate with Gandiva, for
> > example, and that would become much more difficult to develop if the
> > code is fragmented in multiple repositories.
> > """
> >
> > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> >
> > Regards
> >
> > Antoine.