|
|
Choosing A Webhost: |
RE: UTF-8 progress: msg#00190db.tds.freetds
> From: ZIGLIO Frediano [mailto:Frediano.Ziglio@xxxxxxxxxxxx] > Sent: November 17, 2003 5:52 AM > > IMHO allocating a buffer for text that can contain any convertible > string from n wire bytes is sometimes a waste of memory so I though > tds_get_char_data (or read_and_convert for it) should reallocate text > buffer, not the caller. Whenever possible, it's best to keep alloc & free together. I'd be happier if tds_get_char_data() didn't call free(3). I was trying (unsuccessfully, it seems) to follow your paradigm. Sorry about that. I see your point about wasting memory. Even if the ntext field has only ISO-8859-1 data, we allocate 4x the memory, just in case it happens to be Ancient Sumerian or something. (N.B. On further reading about UTF-8, it turns out that UCS-2 can *always* be represented in no more than a three-byte UTF-8 sequence. Never four bytes. So we get back 25% of our allocation. For our purposes, UTF-8's max_bytes_per_char is 3, unless the input is UTF-8 or UCS-4(!).) I want the caller to allocate the memory. It's better than propogating special-case memory management all the way down to the wire code, much easier to understand. Unless you want to move *all* buffer allocations down there, instead of in tds_get_data_info(). There are ways to be more intelligent about how much is required. For example, determine_adjusted_size() could be wiser; we know ISO-8859-1 data never need more than 2x as UTF-8. UCS-2 data never need more than 1.5x as UTF-8. I think even "strange" stateful encodings never expand by more than 50% when converted to UCS-2 or UTF-8. Perhaps instead of: size = client_charset.max_bytes_per_char / server_charset.min_bytes_per_char; it could be: size = client_charset.max_bytes_per_char / server_charset.max_bytes_per_char; ^^^ That, and changing UTF-8's max_bytes_per_char to 3 will answer most of your concerns, I think. > I removed the code for partial utf8 (see attachment) and test > worked (it > used to fail only for text problem). > This test use NVARCHAR/NTEXT for server and UTF8 (forced) for > client so Russian can be represented without problems. OK. You know, I worked really hard on that comment. ;-) > IMHO use should use tds_iconv for partial conversion and > reporting error > from read_and_convert (or whatever) in chunk cases. tds_iconv is iconv(3) + memcpy(3) + error messages. We don't want to suppress all messages, only silly ones, so we can't pass a NULL socket. iconv(3) returns EINVAL only when "an incomplete multibyte sequence is encountered in the input, and the input byte sequence terminates after it." I.e., EINVAL is returned only for end of the buffer errors (mid-buffer would be EILSEQ, I guess). Why not this: Let tds_iconv() never emit an EINVAL message. Just propogate errno. The caller will notice that *inbytesleft > 0, will know if it's because of chunking, can move the partial character to &temp[0], and continue. If it's not due to chunking, the caller can emit the message. If there are many places where tds_iconv() is called and an EINVAL error would be potentially ignored, we can make it easy for callers to emit the appropriate message by offering a special-purpose function: tds_complain_einval(). > In some cases EINVAL > overwrite EILSEQ error (see utf_2 results). I'll look into that. > About EILSEQ. On inconvertible sequence we can discard a > characters and > replace it with a '?' however on invalid input sequence (like > 0xC2C2 or > 0xFE for UTF-8) we can't (and I don't know what should be the correct > FreeTDS behaviour in this case...) Sure we can: 0xFE => '?' 0xC2C2 => '?' '?' 0xC2C2C2C2C2C2 => '?' '?' '?' '?' '?' '?' but a valid input sequence lacking merely a corresponding character in the output character set would get just one '?' Every invalid sequence (even sequence of 1) results in a '?'. That's good enough. The data are lost anyway. Idea: We probably don't need skip_one_input_sequence(). Just *inbuf++; *inbytesleft--; and retry. Suppress successive EILSEQ messages until we get a good one. > A question about discarding unconverted bytes from server: does server > test characters consistency when application store data? I > don't know so > it's better to discard unconverted bytes. ie: if client store > 0xC2C2303030... in a utf8 column and server do not test consistency we > get an invalid sequence and conversion stop at first byte. Discarding > others wire data keep dialog (FreeTDS <-> sql server) consistency. I'm not sure I understand. I think what the server stores and verifies is distinct from any protocol issue. > I did also some fixes in token.c: > - adjust_character_column_size should be called after > curcol->iconv_info > initialization > - realloc return NULL if it can't reallocate buffer but never free > original buffer so blob_info->text_value = > realloc(blob_info->text_value, size) cause a leak in low resource > conditions. Usually I use a temporary pointer to prevent this leak. Thank you, and I'm sorry I messed up your code. I don't think we even need the malloc/realloc branch. realloc(3) becomes alloc(3) if its first argument is NULL. --jkl ----------------------------------------- The information contained in this transmission may contain privileged and confidential information and is intended only for the use of the person(s) named above. If you are not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender immediately by reply e-mail and destroy all copies of the original message. Please note that we do not accept account orders and/or instructions by e-mail, and therefore will not be responsible for carrying out such orders and/or instructions.
|
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Newbie Install Questions, Duncan Boan |
|---|---|
| Next by Date: | Re: tsql fails on HP, James K. Lowden |
| Previous by Thread: | RE: UTF-8 progress, ZIGLIO Frediano |
| Next by Thread: | RE: UTF-8 progress, ZIGLIO Frediano |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
Free MagazinesCisco NewsReceive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business. subscribe Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field. subscribe The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business. subscribe Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company. subscribe Total Telecom Total Telecom is "The Economist of the communications industry". subscribe |