osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Charset support proposal


Looks great - thank you for writing this. I have some questions. If they are already answered in your document, forgive me, and just say “That’s answered in the document."

I very much like the idea of adding default charset and collation to RelDataTypeSystem. This will help to carry them to all points in the code where they are needed.

I also like the idea of adding charset and collation as table options. It seems that this feature is non-essential, and could be done in phase 2, if necessary. Also, it mainly applies to SQL DDL, i.e. the “server” module. I don’t we need to add default charset and collation to the Table or RelOptTable interfaces, just SqlCreateTable.

Regarding the column options. Could charset and collection not be specified as part of the column’s data type?

When we are parsing a SQL character literal, the characters of that literal are in the same encoding as the SQL string itself. The parser (see the line ‘UNICODE_INPUT = true;' generated Parser.jj file) seems to assume that input is unicode. That seems fine to me — do you agree?

Unqualified character literals (e.g. ‘hello’, vs qualified _UTF8’hello’) are always UTF16. Is that correct? Should we provide a way to change that default? Do any major databases provide a way to change that default?

In a scenario where different columns have different charsets/collations, I assume that there will be a lot of implicit conversion going on. (Not to mention explicit conversion, using CONVERT.) Are there concerns about this? Are the rules well-defined if, say we compare a UTF8 with a UTF16 string, or concatenate a UTF8 with a UTF16 string?

I saw that MySQL has problems with 3-byte utf8 (aka utf8mb3) and 4-byte utf8mb4[1]. Are we going to avoid those problems?

Julian

[1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html



> On Nov 15, 2018, at 4:13 AM, Ted Xu <frankxus@xxxxxxxxx> wrote:
> 
> Hi folks,
> 
> I created a design doc
> https://docs.google.com/document/d/1wo5byn_6K_YOKiPdXNav1zgzt9IBC3SbPvpPnIShtXk/edit?usp=sharing
> for supporting charset in calcite, per previous discussions on this topic.
> 
> One thing I'm not sure is runtime (Codegen on Enumerable and RelExecutor
> etc) change. Since I/O is decoupled by pluggable points like
> Schemas#enumerable, that part looks good to me already.
> 
> I'm sure there are a lot misunderstandings and missing pieces in that doc
> above, please feel free to leave comments.