osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss] Monorepo vs. independent repositories for independent implementations


One point toward seperate repositories, vendoring Arrow for C++ project
with git submodules becomes awkward if it's a multi-lang monorepo.

On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:

> I would also add -- Krisztian's recent work Dockerizing the project is
> setting us up to be able to decouple ourselves from Travis CI. We need
> build hosts where we can use Docker to be able to do this, though.
> Preferably the build hosts would have NVIDIA GPUs so we can use
> nvidia-docker to test our GPU functionality
> On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> >
> > hi Antoine,
> >
> > Some small critiques to the listing of implementations:
> >
> > * The Java library predates the C++ library (it originated in Apache
> Drill)
> > * Python and C++ both interact with the Java library in different
> > ways. There's JNI for Gandiva and Plasma, and Python uses Java via
> > JPype in unit tests
> >
> > There's some critical questions to answer here:
> >
> > 1. Is there such a thing as an "independent implementation"?
> > 2. What's the best way to manage changesets / patches?
> > 3. What is the best way to manage the burgeoning complexity of testing
> > and verification of the entire project?
> > 4. How much longer will public CI services be adequate for our needs?
> >
> > This may be a bit long winded so bear with me
> >
> > 1. Is there such a thing as an "independent implementation"?
> >
> > My answer to this is actually "not really". The reasons are as follows:
> >
> > * The integration tests are one of the most important parts of the
> > project. While C++, Java, and JavaScript are the only participants, we
> > eventually need Rust, Go, and C# to be in the matrix. This will
> > include integration testing for RPC / Flight in addition to the
> > current IPC tests.
> > * By the nature of Arrow, any implementation may build in-memory or
> > RPC-based bindings to computational libraries that are in C++ or use
> > LLVM, such as Gandiva and Plasma. This is already the case in Java,
> > and may expand beyond Java. I could see Go or Rust or C# using Gandiva
> > or Plasma. The scope of what kinds of shared infrastructure might be
> > used in multiple languages will only expand over time
> >
> > 2. What's the best way to manage changesets / patches?
> >
> > * Because no two implementations can be guaranteed to be independent,
> > in a non-monorepo setup, changes may require multiple patches.
> > Verifying "joint patches" is likely to require manual / human
> > intervention in ways that are a non-issue for a monorepo
> > * Splitting development up into multiple repositories will decrease
> > visibility into the patch queues in the less active subprojects. I'm
> > strongly in support not only of a single codebase but a single patch
> > queue. I admit that seeing ~70 open pull requests on Arrow stresses me
> > out a bit, but having 70 patches spread across 5 repos would be more
> > stressful for me at least
> > * Broken builds in any part of the project should be a concern to the
> > entire community -- we should not have broken builds. I'd be concerned
> > about having any part of the project becoming a "ghetto" if the
> > plurality of developers are working elsewhere with an "out of sight,
> > out of mind" mindset
> >
> > To play devil's advocate, some web applications could be developed to
> > create the appearance of a unified patch queue across many repos.
> >
> > That being said, our patch queue pales in comparison to some larger /
> > more mature ASF projects:
> >
> > * Spark has 523 open PRs: https://github.com/apache/spark/pulls
> > * Airflow has 218 open PRs:
> https://github.com/apache/incubator-airflow/pulls
> > * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls
> >
> > 3. What is the best way to manage the burgeoning complexity of testing
> > and verification of the entire project?
> > 4. How much longer will public CI services be adequate for our needs?
> >
> > I think we are already reaching the limits of what we can reasonably
> > accomplish with public CI services. Apache Arrow is a project with
> > sophistication and scope that is destined to outgrow what Travis CI
> > can provide within the scope of a single implementation, i.e.
> > C++/Python. For example, we're going to be past the 50 minute time
> > limit before too long. I think that continuing to constrain ourselves
> > by the 50 minute time limit will also limit the scope of what kinds of
> > automated testing we can employ, to our long term detriment. We also
> > have things (like GPU support) that we cannot test there.
> >
> > Considering more mature data projects in the ASF that I'm familiar
> > with: Kudu, Impala, Spark: none of these projects use Travis CI. Their
> > testing uses Jenkins build slaves and run much longer than our CI
> > jobs. If we used beefier build slaves, our builds would also run much
> > faster.
> >
> > So, what should we do? Well, part of why I have recently created an
> > organization (https://ursalabs.org/) dedicated to Arrow development is
> > to have the financial means and the engineering resources to actually
> > do something about problems like these. I would propose to make an
> > investment of hardware and engineering time to augment our ability to
> > test the repository to make sure we can manage 5-10x the current test
> > runtime that we have now. If I have to personally halt feature
> > development and focus on build and development tooling for a while, so
> > be it. We've already spent many months this year on packaging
> > automation but we are still coming up short in development tooling. If
> > anyone reading has funds to invest in hardware resources, please let
> > me know.
> >
> > As Clint Eastwood's character said in "The Good, The Bad, and The
> > Ugly", "$200,000 is a lot of money. We're gonna have to earn it."
> >
> > FWIW: I am not sure Parquet is a good example of a better way to be.
> > Parquet lacks automated integration tests (terrifying to me) and
> > failed to grow a community outside of the Java world until 2016 when a
> > few of us started building out the C++ library.
> >
> > - Wes
> > On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <antoine@xxxxxxxxxx>
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > We are quickly growing the number of Arrow implementations.  Soon we'll
> > > have:
> > > - C++: the most mature, reference, and historical implementation
> > > - Python: linked with Arrow C++
> > > - C/GLib: linked with Arrow C++
> > > - Ruby: linked with Arrow C++ (indirectly through C/GLib)
> > > - R: linked with Arrow C++
> > > - Matlab: linked with Arrow C++
> > > - Java: independent implementation
> > > - Rust: independent implementation
> > > - Go: independent implementation
> > > - Javascript: independent implementation
> > > - .Net (C#): independent implementation
> > >
> > > This creates various kinds of issues.  Technical issues such as CI
> > > matrices being more and more large and complex.  Social issues such as
> > > different implementations having different development speeds and
> > > maturity, and the fact that development teams are effectively disjoint
> > > (for example, whoever develops on the C++ codebase usually doesn't
> > > develop on the Rust codebase, and vice-versa).
> > >
> > > I'm not proposing anything concrete here, but would like to ask what
> > > people think of moving independent implementations (those that don't
> > > depend on Arrow C++) into independent repositories.  This would let
> them
> > > define their own workflow, permissions, teams, CI configurations and
> > > whatnot.  This would also allow growing the CI matrix for the main repo
> > > without reaching humongous sizes.  The implementations would still be
> > > under the umbrella of the Apache Arrow project; but they would exist as
> > > independent GitHub projects (this is a bit how Parquet implementations
> > > are handled, AFAIK).
> > >
> > > To start with, Wes expressed opposition to the idea:
> > > """
> > > I am against breaking up the monorepo -- I think that we should scale
> > > our process using tools that we develop rather than conforming to the
> > > objectively crude affordances of Travis CI and Appveyor.
> Implementations
> > > that are independent now may not be so in the future by the nature of
> > > the project -- any implementation could integrate with Gandiva, for
> > > example, and that would become much more difficult to develop if the
> > > code is fragmented in multiple repositories.
> > > """
> > >
> > > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701)
> > >
> > > Regards
> > >
> > > Antoine.
>


-- 
Sent from my jetpack.