osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for TIMESTAMP_NANOS in parquet-cpp


hi Roman,

For nanosecond Arrow timestamps, the relevant code path for this is here:

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607

You'll also have to modify some code in parquet/types.*,
parquet/schema.*, parquet/arrow/schema.cc to handle the additional
metadata. If you aren't dealing with Arrow at all, then it should be
sufficient just to modify the handling of the logical types metadata
in parquet/types.*.

So there is a significant complication that I didn't think about yet:
we aren't handling the new logical types union in parquet-cpp yet, so
there's quite a lot of work beyond just dealing with the nanosecond
metadata. I am also not sure what are the implications for backwards
compatibility and haven't had time to look in detail at what needs to
be done since the new metadata structure was added to the Thrift
definition

- Wes
On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter
<roman.karlstetter@xxxxxxxxx> wrote:
>
> I've had the chance to look into this.
> There is one issue that came up which I don't know how to handle. Previously, int96 seems to have been used for nanosecond precision, but this is somewhat deprecated, as far as I understand it.
> So, how should we handle nanoseconds and int96 vs int64 in 1) reading from and b) writing to parquet.
> There seem to be some writer settings, all related to timestamp precision properties. Is there any advise someone of you can give me in that regard?
>
> Thanks,
> Roman
>
> Von: Roman Karlstetter
> Gesendet: Freitag, 9. November 2018 08:38
> An: dev@xxxxxxxxxxxxxxxx
> Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp
>
> I would be willing to implement that. I’ll probably need some advice on my patch though, as I’m fairly new to the parquet code.
>
> Roman
>
> Von: Wes McKinney
> Gesendet: Donnerstag, 8. November 2018 23:22
> An: dev@xxxxxxxxxxxxxxxx
> Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
>
> I opened an issue here
> https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
> welcome
> On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> >
> > hi Roman,
> >
> > We would welcome adding such a document to the Arrow wiki
> > https://cwiki.apache.org/confluence/display/ARROW. As to your other
> > questions, it really depends on whether there is a member of the
> > Parquet community who will do the work. Patches that implement any
> > released functionality in the Parquet format specification are
> > welcome.
> >
> > Thanks
> > Wes
> > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
> > <roman.karlstetter@xxxxxxxxx> wrote:
> > >
> > > Hi everyone,
> > > in parquet-format, there is now support for TIMESTAMP_NANOS: https://github.com/apache/parquet-format/pull/102
> > > For parquet-cpp, this is not yet supported. I have a few questions now:
> > > • is there an overview of what release of parquet-format is currently fully support in parquet-cpp (something like a feature support matrix)?
> > > • how fast are new features in parquet-format adopted?
> > > I think having a document describing the current completeness of implementation of the spec would be very helpful for users of the parquet-cpp library.
> > > Thanks,
> > > Roman
> > >
> > >
>
>