[Python-Dev] The future of the wchar_t cache
Serhiy Storchaka schrieb am 20.10.2018 um 13:06:
> Currently the PyUnicode object contains two caches: for UTF-8
> representation and for wchar_t representation. They are needed not for
> optimization but for supporting C API which returns borrowed references for
> such representations.
> The UTF-8 cache always was in unicode objects (but in Python 2 it was not a
> UTF-8 cache, but a 8-bit representation cache). Initially it was needed for
> compatibility with 8-bit str, for implementing the "s" and "z" format units
> in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and
> The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the
> former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of
> wchar_t, and the C API which returned a pointer to Py_UNICODE content
> returns now a pointer to the cached wchar_t representation. It is the "u"
> and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(),
> PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(),
> PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().
> All this increase the size of the unicode object. It includes the constant
> overhead of additional pointer and size fields, and the overhead of the
> cached representation proportional to the string length. The following
> table contains number of bytes per character for different kinds, with and
> without filling specified caches.
> ?????? raw? +utf8???? +wchar_t?????? +utf8+wchar_t
> ?????????????????? Windows? Linux?? Windows? Linux
> ASCII?? 1???? 1?????? 3?????? 5??????? 3?????? 5
> UCS1??? 1??? 2-3????? 3?????? 5?????? 4-5???? 6-7
> UCS2??? 2??? 3-5????? 2?????? 6?????? 3-5???? 7-9
> UCS4??? 4??? 5-8???? 6-8????? 4?????? 7-12??? 5-8
> There is also a new C API added in 3.3 for getting wchar_t representation
> without using the cache: PyUnicode_AsWideChar() and
> PyUnicode_AsWideCharString(). Currently it uses the cache, this has
> benefits and disadvantages.
> Old Py_UNICODE based API is deprecated, and will be removed eventually.
> I want to ask about the future of the wchar_t cache. Is the benefit of
> caching the wchar_t representation larger the disadvantage of spending more
> memory? The wchar_t representation is so natural for Windows API as the
> UTF8 representation for POSIX API. But in all other cases it is just waste
> of memory. Are there reasons of keeping the wchar_t cache after removing
> the deprecated API?
I'd be happy to get rid of it. But regarding the use under Windows, I
wonder if there's interest in keeping it as a special Windows-only feature,
e.g. to speed up the data exchange with the Win32 APIs. I guess it would
have to provide a visible (performance?) advantage to justify such special
casing over the code removal.