osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Charset support proposal


Thanks for the comments, please find my replies inline.

On Fri, Nov 16, 2018 at 2:06 AM Julian Hyde <jhyde@xxxxxxxxxx> wrote:

> Looks great - thank you for writing this. I have some questions. If they
> are already answered in your document, forgive me, and just say “That’s
> answered in the document."
>
> I very much like the idea of adding default charset and collation to
> RelDataTypeSystem. This will help to carry them to all points in the code
> where they are needed.
>
> I also like the idea of adding charset and collation as table options. It
> seems that this feature is non-essential, and could be done in phase 2, if
> necessary. Also, it mainly applies to SQL DDL, i.e. the “server” module. I
> don’t we need to add default charset and collation to the Table or
> RelOptTable interfaces, just SqlCreateTable.
>

Agreed.


>
> Regarding the column options. Could charset and collection not be
> specified as part of the column’s data type?
>

Yes, if not specified, column charset is deduced from table default, or
session default, or system default charset.


>
> When we are parsing a SQL character literal, the characters of that
> literal are in the same encoding as the SQL string itself. The parser (see
> the line ‘UNICODE_INPUT = true;' generated Parser.jj file) seems to assume
> that input is unicode. That seems fine to me — do you agree?
>

Agreed. By fixing 'core charset' be UTF-16, we have better performance and
lower coding effort.


>
> Unqualified character literals (e.g. ‘hello’, vs qualified _UTF8’hello’)
> are always UTF16. Is that correct? Should we provide a way to change that
> default? Do any major databases provide a way to change that default?
>
>
IMO unqualified characters should have default charset, instead of treating
'hello' as _UTF16'hello', it is more convenient to treat it as
_${DEFAULT_CHARSET}'hello', where DEFAULT_CHARSET is defined by
session/system configuration (connection/startup configuration in mysql
https://goo.gl/67hOXK , or SqlSetOption in Calcite) or type system.

In a scenario where different columns have different charsets/collations, I
> assume that there will be a lot of implicit conversion going on. (Not to
> mention explicit conversion, using CONVERT.) Are there concerns about this?
> Are the rules well-defined if, say we compare a UTF8 with a UTF16 string,
> or concatenate a UTF8 with a UTF16 string?
>

There may be concerns. I've already found 2 points,

1. Sql Function return type inference.
2. RelDataFactory#leastRestrictive

>From which there may impacts on rules like ReduceExpressionRules.


>
> I saw that MySQL has problems with 3-byte utf8 (aka utf8mb3) and 4-byte
> utf8mb4[1]. Are we going to avoid those problems?
>

I'm not sure but Java UTF-8 encoder/decoder look good.


>
> Julian
>
> [1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
>
>
>
> > On Nov 15, 2018, at 4:13 AM, Ted Xu <frankxus@xxxxxxxxx> wrote:
> >
> > Hi folks,
> >
> > I created a design doc
> >
> https://docs.google.com/document/d/1wo5byn_6K_YOKiPdXNav1zgzt9IBC3SbPvpPnIShtXk/edit?usp=sharing
> > for supporting charset in calcite, per previous discussions on this
> topic.
> >
> > One thing I'm not sure is runtime (Codegen on Enumerable and RelExecutor
> > etc) change. Since I/O is decoupled by pluggable points like
> > Schemas#enumerable, that part looks good to me already.
> >
> > I'm sure there are a lot misunderstandings and missing pieces in that doc
> > above, please feel free to leave comments.
>
>