|
Re: implementation of UTF-8 conversion for text I/O: iconv vs hand-made: msg#00172lang.haskell.libraries
FWIW, there's a fairly complete pure-Haskell UTF-8 converter implementation in HXML toolbox, which I "stole" and adapted for a version of HaXml; e.g.: http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.12/src/Text/XML/HaXml/Unicode.hs (Please ignore me if I miss your point.) #g -- Bulat Ziganshin wrote: > Hello all > > this letter describes why i think that using hand-made (de)coder for > support of UTF-8 encoded files is better than using iconv. to let > other readers know, iconv is wide-spread C library that performs > buffer-to-buffer conversion between any text encodings (utf-8, utf-16, > latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented > by me is just "converter", i.e. high-order function, between the > getByte/putByte and getChar/putChar operations. so it can be used in > any monad and with any purposes, not only for text I/O > > one can find example of library that uses iconv in the "System\IO\Text.hs" > module from http://haskell.org/~simonmar/new-io.tar.gz and example of > hand-made encoder in module "Data\CharEncoding.hs" > and its usage - in "System\Stream\Transformer\CharEncoding.hs" > from http://freearc.narod.ru/Streams.tar.gz > > i crossposted this letter to Marcin and Simon because you have > discussed with me this question and to Einar because he once asked > me about one specific feature in this area. > > > why iconv is better: > > 1) it's lightning fast, making virtually zero speed overhead > 2) it's robust > 3) it contains already implemented and debugged algorithms for all > possible encodings we can encounter > 4) it has highly developed error processing facilities > (i mean signalling about errors in input data and/or masking them) > > why hand-made conversion is better: > > 1) i don't know whether iconv will be available on every Hugs and GHC > installation? > > 2) Einar once asked me about changing the encoding on the > fly, that is needed for some HTML processing. it is also possible that > some program will need to intersperse text I/O with > buffer/array/byte/bits I/O. it's a sort of things that are absolutely > impossible with iconv > > 3) my library support Streams that works in ANY monad (not only IO, ST > and their derivatives). it's impossible to implement iconv conversion > for such stream types > > as you can see, while the last arguments says about very specific > situations, these situations absolutely can't be handled by iconv, so > we need to implement hand-made conversions anyway. on the other side, > iconv strong points don't have principal meaning - the speed with > hand-made routines will be enough, about several mb/s; all possible > encodings can be implemented and debugged sooner or later; only > processing of errors in input data is weak point of the current design > itself > > moreover, there are implementation issues that make me more enthusiastic > about hand-made solution. it just already implemented and really works. > implementation of the CharEncoding for streams is in module > "System\Stream\Transformer\CharEncoding.hs", which is very trivial. > implementation of different encoders in "Data\CharEncoding.hs" > is slightly more complex, but these routines also used in > "instance Binary String", i.e. to serialize strings. also, i think > that "Data\CharEncoding.hs" module should be a part of standard > Haskell library, so implementation of CharEncoding stream transformer > is almost "free" > > on the other side, implementation of text encoding in "new I/O" > library is about 1000 lines long. while i don't need to copy them all, > using iconv anyway will be much more complex than using hand-made routines. > this include complexity of interaction with iconv itself and complexity of > implementing various I/O operations over the buffer that contains > 4-byte characters. i already implemented 3 buffering transformers and > adding one more buffering scheme is the last thing i want to do. vice > versa - now i'm searching for ways to omit repetitions of code by joining > them all into one. it's very boring - to have 3 or 4 similar things > and replicate every change to them all > > at the same time, the library design is open and it's entirely > possible to have two alternative char encoding transformers. everyone > can develop additional transformers even without interaction with me - > in this case, it should just implement vGetChar/bPutChar operations > via the vGetBuf/vPutBuf ones. i just propose to leave the things as > they are, and go to implementing of iconv-based transformer only when we > will be actually bothered by it's restrictions > > -- Graham Klyne For email: http://www.ninebynine.org/#Contact |
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| Previous by Date: | Re: Data.ByteString candidate 3: 00172, Ketil Malde |
|---|---|
| Next by Date: | Re[2]: Data.ByteString candidate 3: 00172, Bulat Ziganshin |
| Previous by Thread: | Re: implementation of UTF-8 conversion for text I/O: iconv vs hand-madei: 00172, Marcin 'Qrczak' Kowalczyk |
| Next by Thread: | FPS 0.3: 00172, Donald Bruce Stewart |
| Indexes: | [Date] [Thread] [Top] [All Lists] |
| News | FAQ | advertise |