[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Charset support proposal

I agree that a change to RelDataTypeSystem is probably not necessary. But at the code level, we have to solve the following problem: there are many, many places where CHAR or VARCHAR types are created. Each of those will need to be assigned the correct char set. We need to make sure that those char sets are assigned correctly, which in some cases will use the table, session or system default, and in others will be assigned from existing CHAR or VARCHAR types. So, the challenge is to do this without verbosity and minimizing the chances of coder error. After all, the person maintaining this code will likely not be an expert in NLS.

> On Nov 17, 2018, at 5:31 AM, Ted Xu <frankxus@xxxxxxxxx> wrote:
> Thanks Stamatis for the comment and advice!
> You've made a good point on RelDataType. I agree RelDataTypeSystem change
> is not a necessity.
> I also agree with the work prioritization, the DDL changes are not urgent
> compare to core correctness. I'd like to create JIRA tickets on those
> topics which we've already have common agreements like types and rules.
> On Fri, Nov 16, 2018 at 6:29 AM Stamatis Zampetakis <zabetak@xxxxxxxxx>
> wrote:
>> Thanks for the initiative Ted!
>> The document mentions various interesting features like enriching DDL
>> operations with charset and collation but I think it is also worth
>> mentioning a few points in the existing API that are related with these
>> notions.
>> The RelDataType (column data type) includes already the notions of charset
>> (RelDataType#getCharset) and collation (RelDataType#getCollation). Moreover
>> the RelDataTypeFactory allows creating types with a specific charset and
>> collation through the method
>> RelDataTypeFactory#createTypeWithCharsetAndCollation and it also provides a
>> method to obtain the default charset
>> (RelDataTypeFactory#getDefaultCharset).
>> Basically with the above it is already possible for users to pass their own
>> charset and collation as far as it concerns tables and columns. Also by
>> providing a custom RelDataTypeFactory they can also control what is the
>> default charset that is going to used.
>> Regarding the operations performed on columns with charsets/collations the
>> SQL standard has specific rules on what should be done or not along with
>> implicit/explicit conversions [1]. I don't know to what extend these rules
>> existing in Calcite but by looking in various places they don't seem to be
>> present.
>> Apart from the DDL that can be extended relatively easy I think the biggest
>> challenge is to incorporate the Standard rules in the necessary places.
>> From my point of view, it seems more urgent to improve the runtime to
>> behave correctly in the presence of charset and collations. Overall, I find
>> the subject very interesting and at the same time very important so thanks
>> a lot for working on it.
>> Best,
>> Stamatis
>> [1] 4.2.3 Rules determining collating sequence usage
>> <http://web.cecs.pdx.edu/~len/sql1999.pdf>
>> Στις Πέμ, 15 Νοε 2018 στις 7:06 μ.μ., ο/η Julian Hyde <jhyde@xxxxxxxxxx>
>> έγραψε:
>>> Looks great - thank you for writing this. I have some questions. If they
>>> are already answered in your document, forgive me, and just say “That’s
>>> answered in the document."
>>> I very much like the idea of adding default charset and collation to
>>> RelDataTypeSystem. This will help to carry them to all points in the code
>>> where they are needed.
>>> I also like the idea of adding charset and collation as table options. It
>>> seems that this feature is non-essential, and could be done in phase 2,
>> if
>>> necessary. Also, it mainly applies to SQL DDL, i.e. the “server” module.
>> I
>>> don’t we need to add default charset and collation to the Table or
>>> RelOptTable interfaces, just SqlCreateTable.
>>> Regarding the column options. Could charset and collection not be
>>> specified as part of the column’s data type?
>>> When we are parsing a SQL character literal, the characters of that
>>> literal are in the same encoding as the SQL string itself. The parser
>> (see
>>> the line ‘UNICODE_INPUT = true;' generated Parser.jj file) seems to
>> assume
>>> that input is unicode. That seems fine to me — do you agree?
>>> Unqualified character literals (e.g. ‘hello’, vs qualified _UTF8’hello’)
>>> are always UTF16. Is that correct? Should we provide a way to change that
>>> default? Do any major databases provide a way to change that default?
>>> In a scenario where different columns have different charsets/collations,
>>> I assume that there will be a lot of implicit conversion going on. (Not
>> to
>>> mention explicit conversion, using CONVERT.) Are there concerns about
>> this?
>>> Are the rules well-defined if, say we compare a UTF8 with a UTF16 string,
>>> or concatenate a UTF8 with a UTF16 string?
>>> I saw that MySQL has problems with 3-byte utf8 (aka utf8mb3) and 4-byte
>>> utf8mb4[1]. Are we going to avoid those problems?
>>> Julian
>>> [1] https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
>>>> On Nov 15, 2018, at 4:13 AM, Ted Xu <frankxus@xxxxxxxxx> wrote:
>>>> Hi folks,
>>>> I created a design doc
>> https://docs.google.com/document/d/1wo5byn_6K_YOKiPdXNav1zgzt9IBC3SbPvpPnIShtXk/edit?usp=sharing
>>>> for supporting charset in calcite, per previous discussions on this
>>> topic.
>>>> One thing I'm not sure is runtime (Codegen on Enumerable and
>> RelExecutor
>>>> etc) change. Since I/O is decoupled by pluggable points like
>>>> Schemas#enumerable, that part looks good to me already.
>>>> I'm sure there are a lot misunderstandings and missing pieces in that
>> doc
>>>> above, please feel free to leave comments.