Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

RE: UTF-8 progress: msg#00186

db.tds.freetds

Subject: RE: UTF-8 progress

>
> Freddy,
>
> Trying to get src/tds/unittests/utf_1.c to work.
>
> Nice test, by the way.
>
> When the column metadata arrive, we call
> adjust_character_column_size().
> As far as the client's concerned, the column is as wide as
> need be, for post-converted data. An nvarchar(10) would have
> column_size of 5 for an ISO-8859-1 client, 10 for UCS-2 (one
> day), and 40 for UTF-8 (allowing for worst case scenario).
>
> For nchar/nvarchar, we then allocate a fixed buffer for the
> column to read the row data into. We can't do that for
> blobs, because their stated maximum length is 2 GB.
>
> But we were doing something both unnecessary and ugly,
> afaict. Instead of passing blob_info->textvalue to
> tds_get_char_data() as dest, we cast blob_info to char*.
> Then in tds_get_char_data(), reversed the process.
>

IMHO allocating a buffer for text that can contain any convertible
string from n wire bytes is sometimes a waste of memory so I though
tds_get_char_data (or read_and_convert for it) should reallocate text
buffer, not the caller. Originally we passed buffer for characters so
parameter name is dest (perhaps row_buffer is better). In
tds_get_char_data if wire data is zero bytes and it's a blob you free
buffer however you do not reset text_value causing a pointer to freed
memory leading to a successive double free and/or head corruption.

> I changed it to pass blob_info->textvalue. And fixed a bunch
> of other things.
>
> utf_1.c now works with nvarchar for all strings[], and with text for:
>
> english,
> spanish,
> french,
> portuguese
>
> It breaks on russian. I'm sure that's because text is
> single-byte encoded, and Russian can't be represented in my
> server's charset. So, I think the test is broken.
>

I removed the code for partial utf8 (see attachment) and test worked (it
used to fail only for text problem).
This test use NVARCHAR/NTEXT for server and UTF8 (forced) for client so
Russian can be represented without problems.

> read_and_convert() is now simpler and more robust (if I do
> say so myself). And handles UTF-8, as promised. I haven't
> tested the chunk-boundary logic yet; I was kinda hoping the
> unit test would do that for me.
>

Try to --enable-extra-checks and use utf8_2 test you will see some
conversion problems. Removing special utf8 code from read_and_convert
fix conversion problems (but not error reporting). Also you code do not
handle big5 or other strange coding...
IMHO use should use tds_iconv for partial conversion and reporting error
from read_and_convert (or whatever) in chunk cases. tds parameter in
tds_iconv is only used to report errors/warnings so we can use a NULL
value to disable errors reporting. In some cases EINVAL overwrite EILSEQ
error (see utf_2 results).

iconv returns tree type os error (as documentation say):
- E2BIG destination buffer too short
- EILSEQ invalid multibyte sequence OR impossible to convert buffer
(usually not documented but distinction is important)
- EINVAL incomplete multibyte sequence

About EILSEQ. On inconvertible sequence we can discard a characters and
replace it with a '?' however on invalid input sequence (like 0xC2C2 or
0xFE for UTF-8) we can't (and I don't know what should be the correct
FreeTDS behaviour in this case...)

> I also manually re-indented some header files, so the
> comments line up and things like that. Please don't run them
> through indent(1) again. :-)
>

Opss... committed :) Ok, I'll mind.

A question about discarding unconverted bytes from server: does server
test characters consistency when application store data? I don't know so
it's better to discard unconverted bytes. ie: if client store
0xC2C2303030... in a utf8 column and server do not test consistency we
get an invalid sequence and conversion stop at first byte. Discarding
others wire data keep dialog (FreeTDS <-> sql server) consistency.

I did also some fixes in token.c:
- adjust_character_column_size should be called after curcol->iconv_info
initialization
- realloc return NULL if it can't reallocate buffer but never free
original buffer so blob_info->text_value =
realloc(blob_info->text_value, size) cause a leak in low resource
conditions. Usually I use a temporary pointer to prevent this leak.

freddy77

Attachment: vedi.diff.gz
Description: vedi.diff.gz

_______________________________________________
FreeTDS mailing list
FreeTDS@xxxxxxxxxxxxxxxxx
http://lists.ibiblio.org/mailman/listinfo/freetds
<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
hardware.arm.at...    cms.citadel.dev...    video.gstreamer...    java.facelets.u...    misc.basics.qna...    web.wiki.instik...    network.uip.use...    xdg.devel/2003-...    tex.bibtex.bibd...    finance.quotesp...    ietf.zeroconf/2...    redhat.blinux.g...    suse.db2/2003-0...    php.phpesp/2004...    uml.devel/2003-...    gnome.labyrinth...    qnx.openqnx.dev...    boot-loaders.gr...    db.dataperfect....    audio.audacity....    linux.uclinux.m...    editors.j.devel...    os.openbsd.tech...    kde.users.multi...   
Home | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation