Please take our Survey
logo       

Choosing A Webhost:
A web hosting service is a type of Internet hosting service that allows individuals and organizations to provide their own website accessible via the World Wide Web. Web hosts are companies that provide space on a server they own for use by their clients as well as providing Internet connectivity, typically in a data center. Web hosts can also provide data center space and connectivity to the Internet for servers they do not own to be located in their data center, called colocation. more...

RE: UTF-8 progress: msg#00194

db.tds.freetds

Subject: RE: UTF-8 progress

> >
> > IMHO allocating a buffer for text that can contain any convertible
> > string from n wire bytes is sometimes a waste of memory so I though
> > tds_get_char_data (or read_and_convert for it) should
> reallocate text
> > buffer, not the caller.
>
> Whenever possible, it's best to keep alloc & free together.
> I'd be happier if tds_get_char_data() didn't call free(3). I
> was trying (unsuccessfully, it seems) to follow your
> paradigm. Sorry about that.
>

I left tds_get_char_data/tds_get_data in an half implemented state so this
happen, don't mind...

> I see your point about wasting memory. Even if the ntext
> field has only ISO-8859-1 data, we allocate 4x the memory,
> just in case it happens to be Ancient Sumerian or something.
>
> (N.B. On further reading about UTF-8, it turns out that
> UCS-2 can *always* be represented in no more than a
> three-byte UTF-8 sequence. Never four bytes. So we get back
> 25% of our allocation. For our purposes, UTF-8's
> max_bytes_per_char is 3, unless the input is UTF-8 or UCS-4(!).)
>

Many charsets can be converted only using 3 UTF-8 bytes however some (UTF-8,
UCS-4, ISO-2022-CN-EXT, others ?) require 4 bytes. I'll finish my test for test
iconv stuff (I don't know when)...

> I want the caller to allocate the memory. It's better than
> propogating special-case memory management all the way down
> to the wire code, much easier to understand. Unless you want
> to move *all* buffer allocations down there, instead of in
> tds_get_data_info().
>

As you said "it's best to keep alloc & free together". We have a free in
tds_get_char_data and a malloc in tds_get_data. You can reduce memory waste in
many case (like ISO-8859-1 -> UTF-8) however you can't handle all cases (like
SJIS -> BIG5 or whatever) or other common cases (you want to convert an english
ntext to UTF-8, you need 3 bytes per char but the average is usually 0.52 and
not 1.5... this can only be fixed reallocating in tds_get_char_data).

> There are ways to be more intelligent about how much is
> required. For example, determine_adjusted_size() could be
> wiser; we know ISO-8859-1 data never need more than 2x as
> UTF-8. UCS-2 data never need more than 1.5x as UTF-8. I
> think even "strange" stateful encodings never expand by more
> than 50% when converted to UCS-2 or UTF-8.
>
> Perhaps instead of:
> size = client_charset.max_bytes_per_char
> / server_charset.min_bytes_per_char;
> it could be:
> size = client_charset.max_bytes_per_char
> / server_charset.max_bytes_per_char;
> ^^^
> That, and changing UTF-8's max_bytes_per_char to 3 will
> answer most of your concerns, I think.
>

Using ISO-8859-1 as client and UTF-8 as server you will get

size = 1 / 3

This mean that converting an UTF-8 string "123456789" to client you will get
"123" ... I understand that current implementation converting UTF-8 to UTF-8
require

size = 3 / 1

However we can extend test to check memcpy flags in TDSICONVINFO.
About being more wiser in determine_adjusted_size perhaps is not so good to use
max_bytes_per_char/min_bytes_per_char for conversion range computation...
Perhaps adding other fields to TDSICONVINFO and initializing better these
fields could be more simple and clear.

> > I removed the code for partial utf8 (see attachment) and test
> > worked (it
> > used to fail only for text problem).
> > This test use NVARCHAR/NTEXT for server and UTF8 (forced) for
> > client so Russian can be represented without problems.
>
> OK. You know, I worked really hard on that comment. ;-)
>
> > IMHO use should use tds_iconv for partial conversion and
> > reporting error
> > from read_and_convert (or whatever) in chunk cases.
>
> tds_iconv is iconv(3) + memcpy(3) + error messages.
>
> We don't want to suppress all messages, only silly ones, so
> we can't pass a NULL socket.
>
> iconv(3) returns EINVAL only when "an incomplete multibyte
> sequence is encountered in the input, and the input byte
> sequence terminates after it." I.e., EINVAL is returned only
> for end of the buffer errors (mid-buffer would be EILSEQ, I guess).
>
> Why not this: Let tds_iconv() never emit an EINVAL message.
> Just propogate errno. The caller will notice that
> *inbytesleft > 0, will know if it's because of chunking, can
> move the partial character to &temp[0], and continue. If
> it's not due to chunking, the caller can emit the message.
>
> If there are many places where tds_iconv() is called and an
> EINVAL error would be potentially ignored, we can make it
> easy for callers to emit the appropriate message by offering
> a special-purpose function: tds_complain_einval().
>
> > In some cases EINVAL
> > overwrite EILSEQ error (see utf_2 results).
>
> I'll look into that.
>

Assume you have two chunks

1- "bytes. EILSEQ. bytes. EINVAL (unterminated)", tds_iconv DO NOT launch
message and return EINVAL in errno
2- "bytes"

Another situation

1- "bytes. EILSEQ. bytes.", tds_iconv launch message and return EILSEQ in errno
2- "bytes. EILSEQ", tds_iconv launch ANOTHER message and return EILSEQ in errno

(these situations are tested by utf8_2).

> > About EILSEQ. On inconvertible sequence we can discard a
> > characters and
> > replace it with a '?' however on invalid input sequence (like
> > 0xC2C2 or
> > 0xFE for UTF-8) we can't (and I don't know what should be
> the correct
> > FreeTDS behaviour in this case...)
>
> Sure we can:
>
> 0xFE => '?'
> 0xC2C2 => '?' '?'
> 0xC2C2C2C2C2C2 => '?' '?' '?' '?' '?' '?'
> but
> a valid input sequence lacking merely a corresponding
> character in the output character set would get just one '?'
>
> Every invalid sequence (even sequence of 1) results in a '?'.
> That's good enough. The data are lost anyway.
>
> Idea: We probably don't need skip_one_input_sequence(). Just
> *inbuf++; *inbytesleft--; and retry. Suppress successive
> EILSEQ messages until we get a good one.
>

EILSEQ is returned even if iconv(3) is not able to convert (like UTF-8 0xCF8F
to ISO-8859-1).

> > A question about discarding unconverted bytes from server:
> does server
> > test characters consistency when application store data? I
> don't know
> > so it's better to discard unconverted bytes. ie: if client store
> > 0xC2C2303030... in a utf8 column and server do not test
> consistency we
> > get an invalid sequence and conversion stop at first byte.
> Discarding
> > others wire data keep dialog (FreeTDS <-> sql server) consistency.
>
> I'm not sure I understand. I think what the server stores
> and verifies is distinct from any protocol issue.
>

A simply example:
Your db has a UTF-8 field and you are storing 0xC2C2303030 (invalid UTF-8),
does the server refuse insert or insert these bytes just as they are sended
from client? IMHO it store just bytes...

[***]$ perl -e 'print "\x80";' | iconv -t utf-8 -f windows-1252

(the euro symbol)
[***]$ perl -e 'print "\x81";' | iconv -t utf-8 -f winows-1252
iconv: illegal input sequence at position 0
(windows-1252 is the default coding on my test server, 0x81 is an invalid
sequence in windows-1252)

[...]$ ./tsql ***
1> create table #tmp (c varchar(10))
2> go
1> insert into #tmp values (convert(varchar(10), 0x81))
2> go\
1> select convert(varbinary(10), c) from #tmp
2> go

81

As expected server store even invalid data...

>
> Thank you, and I'm sorry I messed up your code. I don't
> think we even need the malloc/realloc branch. realloc(3)
> becomes alloc(3) if its first argument is NULL.
>

Some (old) implementations do not like NULL pointer in realloc like others do
not like NULL in free... Perhaps a test in configure script and a TDS_REALLOC,
TDS_FREE will help (I'll add a TODO for future versions, it's just an
optimization)

freddy77


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

Recently Viewed:
hardware.arm.at...    cms.citadel.dev...    video.gstreamer...    java.facelets.u...    misc.basics.qna...    web.wiki.instik...    network.uip.use...    xdg.devel/2003-...    tex.bibtex.bibd...    finance.quotesp...    ietf.zeroconf/2...    redhat.blinux.g...    suse.db2/2003-0...    php.phpesp/2004...    uml.devel/2003-...    gnome.labyrinth...    qnx.openqnx.dev...    boot-loaders.gr...    db.dataperfect....    audio.audacity....    linux.uclinux.m...    editors.j.devel...    os.openbsd.tech...    kde.users.multi...   
Home | advertise | OSDir is an inevitable website. super tiny logo

Free Magazines

Cisco News
Receive a free quarterly e-newsletter with exclusive articles on how Cisco IT uses its own products and solutions to enable the business.
subscribe

Systems Management News, the newspaper for IT systems administration and data center managers! Each issue of Systems Management News is chock-full of news and analysis to help you understand what's happening in your field.
subscribe

The Enterprise Newsweekly eWeek is the essential technology information source for builders of e-business.
subscribe

Oracle Magazine Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world's largest enterprise software company.
subscribe

Total Telecom Total Telecom is "The Economist of the communications industry".
subscribe

Navigation