|
|
Choosing A Webhost: |
RE: UTF-8 progress: msg#00186db.tds.freetds
> > Freddy, > > Trying to get src/tds/unittests/utf_1.c to work. > > Nice test, by the way. > > When the column metadata arrive, we call > adjust_character_column_size(). > As far as the client's concerned, the column is as wide as > need be, for post-converted data. An nvarchar(10) would have > column_size of 5 for an ISO-8859-1 client, 10 for UCS-2 (one > day), and 40 for UTF-8 (allowing for worst case scenario). > > For nchar/nvarchar, we then allocate a fixed buffer for the > column to read the row data into. We can't do that for > blobs, because their stated maximum length is 2 GB. > > But we were doing something both unnecessary and ugly, > afaict. Instead of passing blob_info->textvalue to > tds_get_char_data() as dest, we cast blob_info to char*. > Then in tds_get_char_data(), reversed the process. > IMHO allocating a buffer for text that can contain any convertible string from n wire bytes is sometimes a waste of memory so I though tds_get_char_data (or read_and_convert for it) should reallocate text buffer, not the caller. Originally we passed buffer for characters so parameter name is dest (perhaps row_buffer is better). In tds_get_char_data if wire data is zero bytes and it's a blob you free buffer however you do not reset text_value causing a pointer to freed memory leading to a successive double free and/or head corruption. > I changed it to pass blob_info->textvalue. And fixed a bunch > of other things. > > utf_1.c now works with nvarchar for all strings[], and with text for: > > english, > spanish, > french, > portuguese > > It breaks on russian. I'm sure that's because text is > single-byte encoded, and Russian can't be represented in my > server's charset. So, I think the test is broken. > I removed the code for partial utf8 (see attachment) and test worked (it used to fail only for text problem). This test use NVARCHAR/NTEXT for server and UTF8 (forced) for client so Russian can be represented without problems. > read_and_convert() is now simpler and more robust (if I do > say so myself). And handles UTF-8, as promised. I haven't > tested the chunk-boundary logic yet; I was kinda hoping the > unit test would do that for me. > Try to --enable-extra-checks and use utf8_2 test you will see some conversion problems. Removing special utf8 code from read_and_convert fix conversion problems (but not error reporting). Also you code do not handle big5 or other strange coding... IMHO use should use tds_iconv for partial conversion and reporting error from read_and_convert (or whatever) in chunk cases. tds parameter in tds_iconv is only used to report errors/warnings so we can use a NULL value to disable errors reporting. In some cases EINVAL overwrite EILSEQ error (see utf_2 results). iconv returns tree type os error (as documentation say): - E2BIG destination buffer too short - EILSEQ invalid multibyte sequence OR impossible to convert buffer (usually not documented but distinction is important) - EINVAL incomplete multibyte sequence About EILSEQ. On inconvertible sequence we can discard a characters and replace it with a '?' however on invalid input sequence (like 0xC2C2 or 0xFE for UTF-8) we can't (and I don't know what should be the correct FreeTDS behaviour in this case...) > I also manually re-indented some header files, so the > comments line up and things like that. Please don't run them > through indent(1) again. :-) > Opss... committed :) Ok, I'll mind. A question about discarding unconverted bytes from server: does server test characters consistency when application store data? I don't know so it's better to discard unconverted bytes. ie: if client store 0xC2C2303030... in a utf8 column and server do not test consistency we get an invalid sequence and conversion stop at first byte. Discarding others wire data keep dialog (FreeTDS <-> sql server) consistency. I did also some fixes in token.c: - adjust_character_column_size should be called after curcol->iconv_info initialization - realloc return NULL if it can't reallocate buffer but never free original buffer so blob_info->text_value = realloc(blob_info->text_value, size) cause a leak in low resource conditions. Usually I use a temporary pointer to prevent this leak. freddy77
FreeTDS mailing list FreeTDS@xxxxxxxxxxxxxxxxx http://lists.ibiblio.org/mailman/listinfo/freetds
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Newbie Install Questions, James K. Lowden |
|---|---|
| Next by Date: | Patch, ZIGLIO Frediano |
| Previous by Thread: | UTF-8 progress, James K. Lowden |
| Next by Thread: | RE: UTF-8 progress, Lowden, James K |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |