osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Support for TIMESTAMP_NANOS in parquet-cpp


hi Roman,

I agree with you that it is not a small change because of the new
union-based logical type representation, and compatibility for old
Parquet files (as well as an option to write "old" metadata for
compatibility with old Parquet readers).

- Wes
On Tue, Nov 13, 2018 at 10:13 AM Roman Karlstetter
<roman.karlstetter@xxxxxxxxx> wrote:
>
> Hi,
>
> that sounds like the task might not be ideally suited for someone new to implementations of both arrow and parquet, especially since all that compatibility issues should be handled correctly.
> I think it does not make sense for me to continue with this implementation, unless there are some further specifications on how this should be implemented.
>
> Roman
>
> Von: Wes McKinney
> Gesendet: Montag, 12. November 2018 16:50
> An: dev@xxxxxxxxxxxxxxxx
> Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
>
> hi Roman,
>
> For nanosecond Arrow timestamps, the relevant code path for this is here:
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607
>
> You'll also have to modify some code in parquet/types.*,
> parquet/schema.*, parquet/arrow/schema.cc to handle the additional
> metadata. If you aren't dealing with Arrow at all, then it should be
> sufficient just to modify the handling of the logical types metadata
> in parquet/types.*.
>
> So there is a significant complication that I didn't think about yet:
> we aren't handling the new logical types union in parquet-cpp yet, so
> there's quite a lot of work beyond just dealing with the nanosecond
> metadata. I am also not sure what are the implications for backwards
> compatibility and haven't had time to look in detail at what needs to
> be done since the new metadata structure was added to the Thrift
> definition
>
> - Wes
> On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter
> <roman.karlstetter@xxxxxxxxx> wrote:
> >
> > I've had the chance to look into this.
> > There is one issue that came up which I don't know how to handle. Previously, int96 seems to have been used for nanosecond precision, but this is somewhat deprecated, as far as I understand it.
> > So, how should we handle nanoseconds and int96 vs int64 in 1) reading from and b) writing to parquet.
> > There seem to be some writer settings, all related to timestamp precision properties. Is there any advise someone of you can give me in that regard?
> >
> > Thanks,
> > Roman
> >
> > Von: Roman Karlstetter
> > Gesendet: Freitag, 9. November 2018 08:38
> > An: dev@xxxxxxxxxxxxxxxx
> > Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp
> >
> > I would be willing to implement that. I’ll probably need some advice on my patch though, as I’m fairly new to the parquet code.
> >
> > Roman
> >
> > Von: Wes McKinney
> > Gesendet: Donnerstag, 8. November 2018 23:22
> > An: dev@xxxxxxxxxxxxxxxx
> > Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
> >
> > I opened an issue here
> > https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
> > welcome
> > On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
> > >
> > > hi Roman,
> > >
> > > We would welcome adding such a document to the Arrow wiki
> > > https://cwiki.apache.org/confluence/display/ARROW. As to your other
> > > questions, it really depends on whether there is a member of the
> > > Parquet community who will do the work. Patches that implement any
> > > released functionality in the Parquet format specification are
> > > welcome.
> > >
> > > Thanks
> > > Wes
> > > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
> > > <roman.karlstetter@xxxxxxxxx> wrote:
> > > >
> > > > Hi everyone,
> > > > in parquet-format, there is now support for TIMESTAMP_NANOS: https://github.com/apache/parquet-format/pull/102
> > > > For parquet-cpp, this is not yet supported. I have a few questions now:
> > > > • is there an overview of what release of parquet-format is currently fully support in parquet-cpp (something like a feature support matrix)?
> > > > • how fast are new features in parquet-format adopted?
> > > > I think having a document describing the current completeness of implementation of the spec would be very helpful for users of the parquet-cpp library.
> > > > Thanks,
> > > > Roman
> > > >
> > > >
> >
> >
>