Seems to me that Only Map<bytes, ?>'s quality check cannot be solved by deepEquals because Keys cannot be looked up correctly in Map<bytes, ?>. If we cannot have a useful use case for Map<bytes, ?>, we could reject it in Schema and still keep byte.
The option3 needs to find a wrapper of byte that is language-independent & encoding-independent for portability, which seems a hard searching (and not possible?) process.-RuiOn Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov <gleb@xxxxxxxxxxx> wrote:There is an indirect connection to RowCoder because `MapCoder` isn't deterministic, therefore, this doesn't hold:> - also each type (hence Row type) should have portable encoding(s) that respect this equality so shuffling is consistent
I think it's a requirement only for rows we want to shuffle by.> About these specific use cases, how useful is it to support Map<byte, ?> and List<byte>?Not sure about Map, but in BigQuery it's possible to define `ARRAY<BYTES>` type. It can group by BYTES, but not by ARRAYS.On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin <kedin@xxxxxxxxxx> wrote:About these specific use cases, how useful is it to support Map<byte, ?> and List<byte>? These seem pretty exotic (maybe they aren't) and I wonder whether it would make sense to just reject them until we have a solid design.And wouldn't the same problems arise even without RowCoder? Is the path in that case to implement a custom coder?Regards,AntonOn Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles <kenn@xxxxxxxxxx> wrote:I'll summarize my input to the discussion. It is rather high level. But IMO:- even though schemas are part of Beam Java today, I think they should become part of portability when ready- so each type in a schema needs a language-independent & encoding-independent notion of domain of values and equality (so obviously equal bytes are equal)- embedding in any language (hence Row in Java) must have a schema type-driven equality that matches this spec- also each type (hence Row type) should have portable encoding(s) that respect this equality so shuffling is consistent- Row in Java should be able to decode these encodings to different underlying representations and change its strategy over timeKennOn Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov <gleb@xxxxxxxxxxx> wrote:With adding BYTES type, we broke equality. `RowCoder#consistentWithEquals` is always true, but this property doesn't hold for exotic types such as `Map<BYTES, ?>`, `List<BYTES>`. The root cause is `byte`, where `equals` is implemented as reference equality instead of structural.Before we jump into solution mode, let's state what we want to have:- API have stable API and be able to evolve efficient and use-case specific implementations without breaking it- Correctness we can't trade off correctness, a trivial implementation should work- Performance comparing equality is a fundamental operation, and we want to make it cheap1. set `consistentWithEquals` if there is BYTES fieldPros: almost no prosCons: It would introduce a significant number of allocations when comparing rows, so we reject this option.2. implement custom deep equals in `Row#equals`Pros: good performance, doesn't change API, `Row#equals` is correctCons: doesn't work for `Map<byte, ?>`, unless we roll own implementationCons: it's possible to obtain `List<byte>` from `getValue()` that has broken equality, contains, etc, unless we roll own implementationCons: non-trivial and requires ~200LOC to implement3. wrapping byte into Java object with fixed equality (e.g., StructuralByteArray)Pros: good performance and flexible to change how Java wrapper is implementedPros: simple, doesn't require any specialized collections, no surprises, `Map<byte, ?>` and `List<byte>` work.Cons: will change the return type of `Row#getValue`I want to suggest going with option #3. However, it isn't completely clear what wrapper we want to use, either it could be StructuralByteArray, or ByteBuffer. ByteBuffer is more standard. However, it comes with 4 additional integer fields. StructuralByteArray doesn't have anything not necessary. One option would be adding `Row#getByteBuffer` that would be `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can override it for better performance, but `getValue(i)` must return StructuralByteArray.References:- [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845- [BEAM-5646] Fix quality and hashcode for bytes in Row, https://github.com/apache/beam/pull/6765Gleb--Cheers,Gleb