osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Charset support proposal


Thanks for the initiative Ted!

The document mentions various interesting features like enriching DDL
operations with charset and collation but I think it is also worth
mentioning a few points in the existing API that are related with these
notions.

The RelDataType (column data type) includes already the notions of charset
(RelDataType#getCharset) and collation (RelDataType#getCollation). Moreover
the RelDataTypeFactory allows creating types with a specific charset and
collation through the method
RelDataTypeFactory#createTypeWithCharsetAndCollation and it also provides a
method to obtain the default charset
(RelDataTypeFactory#getDefaultCharset).

Basically with the above it is already possible for users to pass their own
charset and collation as far as it concerns tables and columns. Also by
providing a custom RelDataTypeFactory they can also control what is the
default charset that is going to used.

Regarding the operations performed on columns with charsets/collations the
SQL standard has specific rules on what should be done or not along with
implicit/explicit conversions [1]. I don't know to what extend these rules
existing in Calcite but by looking in various places they don't seem to be
present.

Apart from the DDL that can be extended relatively easy I think the biggest
challenge is to incorporate the Standard rules in the necessary places.
>From my point of view, it seems more urgent to improve the runtime to
behave correctly in the presence of charset and collations. Overall, I find
the subject very interesting and at the same time very important so thanks
a lot for working on it.

Best,
Stamatis

[1] 4.2.3 Rules determining collating sequence usage
<http://web.cecs.pdx.edu/~len/sql1999.pdf>

Στις Πέμ, 15 Νοε 2018 στις 7:06 μ.μ., ο/η Julian Hyde <jhyde@xxxxxxxxxx>
έγραψε:

> Looks great - thank you for writing this. I have some questions. If they
> are already answered in your document, forgive me, and just say “That’s
> answered in the document."
>
> I very much like the idea of adding default charset and collation to
> RelDataTypeSystem. This will help to carry them to all points in the code
> where they are needed.
>
> I also like the idea of adding charset and collation as table options. It
> seems that this feature is non-essential, and could be done in phase 2, if
> necessary. Also, it mainly applies to SQL DDL, i.e. the “server” module. I
> don’t we need to add default charset and collation to the Table or
> RelOptTable interfaces, just SqlCreateTable.
>
> Regarding the column options. Could charset and collection not be
> specified as part of the column’s data type?
>
> When we are parsing a SQL character literal, the characters of that
> literal are in the same encoding as the SQL string itself. The parser (see
> the line ‘UNICODE_INPUT = true;' generated Parser.jj file) seems to assume
> that input is unicode. That seems fine to me — do you agree?
>
> Unqualified character literals (e.g. ‘hello’, vs qualified _UTF8’hello’)
> are always UTF16. Is that correct? Should we provide a way to change that
> default? Do any major databases provide a way to change that default?
>
> In a scenario where different columns have different charsets/collations,
> I assume that there will be a lot of implicit conversion going on. (Not to
> mention explicit conversion, using CONVERT.) Are there concerns about this?
> Are the rules well-defined if, say we compare a UTF8 with a UTF16 string,
> or concatenate a UTF8 with a UTF16 string?
>
> I saw that MySQL has problems with 3-byte utf8 (aka utf8mb3) and 4-byte
> utf8mb4[1]. Are we going to avoid those problems?
>
> Julian
>
> [1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
>
>
>
> > On Nov 15, 2018, at 4:13 AM, Ted Xu <frankxus@xxxxxxxxx> wrote:
> >
> > Hi folks,
> >
> > I created a design doc
> >
> https://docs.google.com/document/d/1wo5byn_6K_YOKiPdXNav1zgzt9IBC3SbPvpPnIShtXk/edit?usp=sharing
> > for supporting charset in calcite, per previous discussions on this
> topic.
> >
> > One thing I'm not sure is runtime (Codegen on Enumerable and RelExecutor
> > etc) change. Since I/O is decoupled by pluggable points like
> > Schemas#enumerable, that part looks good to me already.
> >
> > I'm sure there are a lot misunderstandings and missing pieces in that doc
> > above, please feel free to leave comments.
>
>