[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unification of Hadoop related IO modules

I think it makes sense to keep hadoop-file-system separate, as it's common to use HDFS even if one is not using any of the other hadoop (mapreduce) libraries. On the other hand, it makes a lot of sense to me to put the hadoop read and write into the same module, probably going with option (3) where hadoop-input-format would just be a (deprecated) alias for hadoop-mapreduce-format until we can simply remove it. I don't know enough about hadoop-common to judge whether it makes sense to merge it in or just keep it separate. 

On Thu, Sep 6, 2018 at 8:41 PM Lukasz Cwik <lcwik@xxxxxxxxxx> wrote:
I think 4 is best for users since when a user comes from the Hadoop ecosystem, it is likely they are using many parts of Hadoop and would likely get value from having everything together. My concern with 4 is whether a single Hadoop package would be overwhelming from a dependencies point of view.

From my experience with the google-cloud-platform IO package, it is not easy to handle this problem with so many different package versions and libraries and if we can't do that then the next best thing for me would be 2 or 3.

On Thu, Sep 6, 2018 at 10:22 AM Chamikara Jayalath <chamikara@xxxxxxxxxx> wrote:
I'd vote for (1).

For most of the IO modules, it makes sense to develop and keep read and write parts together given that they usually connect to the same datastore. But hadoop-input-format and hadoop-output-format are simply a level of indirection to connect to various data stores supported by Hadoop. Also, probably hadoop-format is not a common term used in Hadoop ecosystem ? 

hadoop-file-system is a FileSystem not a source/sink so makes sense to keep it separate. Also looks like we have connectors for other products from Hadoop ecosystem as separate modules.

Regarding breaking changes, I think for IOs it's better to make old classes proxies and keep them around (and deprecated) to not break users if we decide to take that route.  For any non-experimental code we'll have to keep old classes around till Beam 3.0.


On Thu, Sep 6, 2018 at 8:24 AM Alexey Romanenko <aromanenko.dev@xxxxxxxxx> wrote:
Hello everyone,

I’d like to discuss the following topic (see below) with community since the optimal solution is not clear for me.

There is Java IO module, called “hadoop-input-format”, which allows to use MapReduce InputFormat implementations to read data from different sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). According to its name, it has only “Read" and it's missing “Write” part, so, I'm working on “hadoop-output-format” to support MapReduce OutputFormat (PR 6306). For this I created another module with this name. So, in the end, we will have two different modules “hadoop-input-format” and “hadoop-output-format” and it looks quite strange for me since, afaik, every existed Java IO, that we have, incapsulates Read and Write parts into one module. Additionally, we have “hadoop-common” and “hadoop-file-system” as other hadoop-related modules. 

Now I’m thinking how it will be better to organise all these Hadoop modules better. There are several options in my mind: 

1) Add new module “hadoop-output-format” and leave all Hadoop modules “as it is”. 
Pros: no breaking changes, no additional work 
Cons: not logical for users to have the same IO in two different modules and with different names.

2) Merge “hadoop-input-format” and “hadoop-output-format” into one module called, say, “hadoop-format” or “hadoop-mapreduce-format”, keep the other Hadoop modules “as it is”.
Pros: to have InputFormat/OutputFormat in one IO module which is logical for users
Cons: breaking changes for user code because of module/IO renaming 

3) Add new module “hadoop-format” (or “hadoop-mapreduce-format”) which will include new “write” functionality and be a proxy for old “hadoop-input-format”. In its turn, “hadoop-input-format” should become deprecated and be finally moved to common “hadoop-format” module in future releases. Keep the other Hadoop modules “as it is”.
Pros: finally it will be only one module for hadoop MR format; changes are less painful for user
Cons: hidden difficulties of implementation this strategy; a bit confusing for user 

4) Add new module “hadoop” and move all already existed modules there as submodules (like we have for “io/google-cloud-platform”), merge “hadoop-input-format” and “hadoop-output-format” into one module. 
Pros: unification of all hadoop-related modules
Cons: breaking changes for user code, additional complexity with deps and testing

5) Your suggestion?..

My personal preferences are lying between 2 and 3 (if 3 is possible). 

I’m wondering if there were similar situations in Beam before and how it was finally resolved. If yes then probably we need to do here in similar way.
Any suggestions/advices/comments would be very appreciated.