[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Monorepo vs. independent repositories for independent implementations


We are quickly growing the number of Arrow implementations.  Soon we'll
- C++: the most mature, reference, and historical implementation
- Python: linked with Arrow C++
- C/GLib: linked with Arrow C++
- Ruby: linked with Arrow C++ (indirectly through C/GLib)
- R: linked with Arrow C++
- Matlab: linked with Arrow C++
- Java: independent implementation
- Rust: independent implementation
- Go: independent implementation
- Javascript: independent implementation
- .Net (C#): independent implementation

This creates various kinds of issues.  Technical issues such as CI
matrices being more and more large and complex.  Social issues such as
different implementations having different development speeds and
maturity, and the fact that development teams are effectively disjoint
(for example, whoever develops on the C++ codebase usually doesn't
develop on the Rust codebase, and vice-versa).

I'm not proposing anything concrete here, but would like to ask what
people think of moving independent implementations (those that don't
depend on Arrow C++) into independent repositories.  This would let them
define their own workflow, permissions, teams, CI configurations and
whatnot.  This would also allow growing the CI matrix for the main repo
without reaching humongous sizes.  The implementations would still be
under the umbrella of the Apache Arrow project; but they would exist as
independent GitHub projects (this is a bit how Parquet implementations
are handled, AFAIK).

To start with, Wes expressed opposition to the idea:
I am against breaking up the monorepo -- I think that we should scale
our process using tools that we develop rather than conforming to the
objectively crude affordances of Travis CI and Appveyor. Implementations
that are independent now may not be so in the future by the nature of
the project -- any implementation could integrate with Gandiva, for
example, and that would become much more difficult to develop if the
code is fragmented in multiple repositories.