[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: To create a WordCount-SideInput.java example?

On Mon, Nov 26, 2018 at 1:32 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:
Thanks Kenneth. Didn't look into subfolders, let me read a bit more.  And will look into the tests Luke pointed out as well. 

To make sure I understand your comments of "Side inputs _are_ different in streaming as you have to ...", are you saying either: 1) a user needs to use/treat SideInput API differently when handling streaming case, OR 2) Beam developments had to do the underlying implementations differently?

2) each runner has to do the execution differently


1) users might need to know about this, since their main input will wait for side input data or for the window to expire; in batch the window is always ready so there is no waiting


On Wed, Nov 21, 2018 at 7:50 PM Kenneth Knowles <kenn@xxxxxxxxxx> wrote:
I like the idea of a/many very simple example(s) of side inputs. There are existing examples that use side inputs:

$ cd examples/java/src/main/java/org/apache/beam/examples
$ grep -r withSideInput .
./complete/TfIdf.java:                  .withSideInputs(totalDocuments));
./complete/game/GameStats.java:                  .withSideInputs(globalMeanScore));
./complete/game/GameStats.java:                .withSideInputs(spammersView))
./cookbook/FilterExamples.java:                  .withSideInputs(globalMeanTemp));

From just this grep It looks like all but one are broadcast scalar values. I have not looked at them to see if they are too complex or too trivial.

Side inputs _are_ different in streaming as you have to pause the main input or push back elements until a side input is ready for a window.

I would suggest multiple simple examples each showing one way of using side inputs. A particular thing to demonstrated might be a triggered Combine.perKey() and tutorial that it requires a View.asMultimap() because triggers result in duplicate entries for a key. 


On Wed, Nov 21, 2018 at 4:40 PM Ruoyun Huang <ruoyun@xxxxxxxxxx> wrote:

I am working on sideInput support in java reference runner (ULR) JIRA-2928 [1].
Although there are inline code snippet example [2] and unit tests [3], I did not find
a good place showing a working example of SideInput(please correct me if I am wrong).
I am thinking of creating one more WordCount example under example folder [2].
In particular, in this example we show variants of a) sideinputs as a scalar AND multimap, b) from pipeline data or created within code and c) [OPTIONAL?] Streaming versus batch, if there are differences (this I am not sure yet).  

In the meanwhile, JIRA-2928 can also easily rely on such an example to validate behaviors between portable/non-portable runners. 

Would like to double check if is this a reasonable idea.

Even though SideInput is just one of our many many features, my justification is that, it is commonly used, thus having a one-stop example make it easier for new users.  That being said, is there a reason not to have yet another WordCount example? (Another idea is to extend existing WordCount.java, but that breaks its simplicity.)

If it is a good change to have, any suggestion on what else to include? 


[2] sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L160
[3] sdks/java/harness/src/test/java/org/apache/beam/fn/harness/state/MultimapSideInputTest.java
[4] examples/java/src/main/java/org/apache/beam/examples

Ruoyun  Huang

Ruoyun  Huang