osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] More precision supported by DATETIME field in Schema


We *might* have a few bits left in the WindowedValue representation to
make this backwards compatible if we really wanted.

The use of java.time.instant means that we won't be able to upgrade
(even in v3) our internal timestamps to match without either
internally supporting >64 bits of precision or limiting the date
range. But using the standard Java time does make a lot of sense.
On Fri, Nov 9, 2018 at 12:33 AM Rui Wang <ruwang@xxxxxxxxxx> wrote:
>
> https://github.com/apache/beam/pull/6991
>
> I am using java.time.instant as the internal representation to replace Joda time for DateTime field in the PR. The java.time.instant uses a long to save seconds-after-epoch and a int to save nanoseconds-of-second. Therefore 64 bits are fully used for seconds-after-epoch, which loses nothing.
>
> Comments are very welcome to this PR.
>
> -Rui
>
> On Wed, Nov 7, 2018 at 1:15 AM Reuven Lax <relax@xxxxxxxxxx> wrote:
>>
>> As you said, this would be update incompatible across all streaming pipelines. At the very least this would be a big problem for Dataflow users, and I believe many Flink users as well. I'm not sure the benefit here justifies causing problems for so many users.
>>
>> Reuven
>>
>> On Wed, Nov 7, 2018 at 4:56 PM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
>>>
>>> Yes, microseconds is a good compromise for covering a long enough
>>> timespan that there's little reason it could be hit (even for
>>> processing historical data).
>>>
>>> Regarding backwards compatibility, could we just change the internal
>>> representation of Beam's element timestamps, possibly with new APIs to
>>> access the finer granularity? (True, it may not be upgrade
>>> compatible.)
>>> On Tue, Nov 6, 2018 at 8:46 PM Reuven Lax <relax@xxxxxxxxxx> wrote:
>>> >
>>> > The main difference (though possibly theoretical) is when time runs out. With 64 bits and nanosecond precision, we can only represent times about 244 years in the future (or the past).
>>> >
>>> > On Tue, Nov 6, 2018 at 11:30 AM Kenneth Knowles <kenn@xxxxxxxxxx> wrote:
>>> >>
>>> >> I like nanoseconds as extremely future-proof. What about specing this out in stages (1) domain of values (2) portable encoding that can represent those values (3) language-specific types to embed the values in.
>>> >>
>>> >> 1. If it is a nanosecond-precision absolute time, and we eventually want to migrate event time timestamps to match, then we need values for "end of global window" and "end of time". TBH I am not sure we need both of these any more. We can either define a max on the nanosecond range or create distinguished values.
>>> >>
>>> >> 2. For portability, presumably an order-preserving integer encoding of nanoseconds since epoch with whatever tweaks to allow for representing the end of time. It might be useful to find a way to allow multiple. Not super useful at a particular version, but might have given us a migration path. It would also allow experiments for performance.
>>> >>
>>> >> 3. We could probably find a way to keep user-facing API compatibility here while increasing underlying precision at 1 and 2, but I probably not worth it. A new Java type IMO addresses the lossiness issue because a user would have to explicitly request truncation to assign to a millis event time timestamp.
>>> >>
>>> >> Kenn
>>> >>
>>> >> On Tue, Nov 6, 2018 at 12:55 AM Charles Chen <ccy@xxxxxxxxxx> wrote:
>>> >>>
>>> >>> Is the proposal to do this for both Beam Schema DATETIME fields as well as for Beam timestamps in general?  The latter likely has a bunch of downstream consequences for all runners.
>>> >>>
>>> >>> On Tue, Nov 6, 2018 at 12:38 AM Ismaël Mejía <iemejia@xxxxxxxxx> wrote:
>>> >>>>
>>> >>>> +1 to more precision even to the nano level, probably via Reuven's
>>> >>>> proposal of a different internal representation.
>>> >>>> On Tue, Nov 6, 2018 at 9:19 AM Robert Bradshaw <robertwb@xxxxxxxxxx> wrote:
>>> >>>> >
>>> >>>> > +1 to offering more granular timestamps in general. I think it will be
>>> >>>> > odd if setting the element timestamp from a row DATETIME field is
>>> >>>> > lossy, so we should seriously consider upgrading that as well.
>>> >>>> > On Tue, Nov 6, 2018 at 6:42 AM Charles Chen <ccy@xxxxxxxxxx> wrote:
>>> >>>> > >
>>> >>>> > > One related issue that came up before is that we (perhaps unnecessarily) restrict the precision of timestamps in the Python SDK to milliseconds because of legacy reasons related to the Java runner's use of Joda time.  Perhaps Beam portability should natively use a more granular timestamp unit.
>>> >>>> > >
>>> >>>> > > On Mon, Nov 5, 2018 at 9:34 PM Rui Wang <ruwang@xxxxxxxxxx> wrote:
>>> >>>> > >>
>>> >>>> > >> Thanks Reuven!
>>> >>>> > >>
>>> >>>> > >> I think Reuven gives the third option:
>>> >>>> > >>
>>> >>>> > >> Change internal representation of DATETIME field in Row. Still keep public ReadableDateTime getDateTime(String fieldName) API to be compatible with existing code. And I think we could add one more API to getDataTimeNanosecond. This option is different from the option one because option one actually maintains two implementation of time.
>>> >>>> > >>
>>> >>>> > >> -Rui
>>> >>>> > >>
>>> >>>> > >> On Mon, Nov 5, 2018 at 9:26 PM Reuven Lax <relax@xxxxxxxxxx> wrote:
>>> >>>> > >>>
>>> >>>> > >>> I would vote that we change the internal representation of Row to something other than Joda. Java 8 times would give us at least microseconds, and if we want nanoseconds we could simply store it as a number.
>>> >>>> > >>>
>>> >>>> > >>> We should still keep accessor methods that return and take Joda objects, as the rest of Beam still depends on Joda.
>>> >>>> > >>>
>>> >>>> > >>> Reuven
>>> >>>> > >>>
>>> >>>> > >>> On Mon, Nov 5, 2018 at 9:21 PM Rui Wang <ruwang@xxxxxxxxxx> wrote:
>>> >>>> > >>>>
>>> >>>> > >>>> Hi Community,
>>> >>>> > >>>>
>>> >>>> > >>>> The DATETIME field in Beam Schema/Row is implemented by Joda's Datetime (see Row.java#L611 and Row.java#L169). Joda's Datetime is limited to the precision of millisecond. It has good enough precision to represent timestamp of event time, but it is not enough for the real "time" data. For the "time" type data, we probably need to support even up to the precision of nanosecond.
>>> >>>> > >>>>
>>> >>>> > >>>> Unfortunately, Joda decided to keep the precision of millisecond: https://github.com/JodaOrg/joda-time/issues/139.
>>> >>>> > >>>>
>>> >>>> > >>>> If we want to support the precision of nanosecond, we could have two options:
>>> >>>> > >>>>
>>> >>>> > >>>> Option one: utilize current FieldType's metadata field, such that we could set something into meta data and Row could check the metadata to decide what's saved in DATETIME field: Joda's Datetime or an implementation that supports nanosecond.
>>> >>>> > >>>>
>>> >>>> > >>>> Option two: have another field (maybe called TIMESTAMP field?), to have an implementation to support higher precision of time.
>>> >>>> > >>>>
>>> >>>> > >>>> What do you think about the need of higher precision for time type and which option is preferred?
>>> >>>> > >>>>
>>> >>>> > >>>> -Rui