osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Python-Dev] Encoding variable-length integers/counts in pickle


Google Protocol Buffers use something similar, which they call "
<https://developers.google.com/protocol-buffers/docs/encoding#top_of_page>Base
128 Varints"

    https://developers.google.com/protocol-buffers/docs/encoding#varints

I prefer the way this handles negative numbers.

- Andrew

On 10 July 2018 at 02:53, MRAB <python at mrabarnett.plus.com> wrote:

> In the regex module I use an encoding scheme when pickling pattern objects
> which is based on the way MIDI encodes variable-length integers, and I
> think it might have a use in a pickle protocol.
>
> In the basic format, an integer is split up into 7-bit chunks, each chunk
> put into a byte, and the most-significant bit of the byte used to signal
> whether the value continues into the following byte.
>
> And integer must be encoded into the minimum number of bytes, so an
> encoded sequence of bytes would never start with 0x80.
>
> MIDI deviates from the basic idea in order to reduce the number of bytes,
> so as sequence of bytes in MIDI _can_ start with x080; this is fine for
> MIDI because it doesn't need to represent negative integers.
>
> The format I'm suggesting uses an initial 0x80 as a way of letting it
> encode negative integers.
>
> Here are a couple of Python functions that summarise the encoding and
> decoding (minus obvious optimisations for simplicity):
>
>
> def encode_varint(value: int) -> List[int]:
>     negative = value < 0
>     encoded = []
>
>     if negative:
>         final = -1
>     else:
>         final = 0
>
>     while True:
>         encoded.append(0x80 | (value & 0x7F))
>         value >>= 7
>
>         if value == final:
>             break
>
>     if negative:
>         encoded.append(0x80)
>
>     encoded.reverse()
>
>     encoded[-1] &= 0x7F
>
>     return encoded
>
>
> def decode_varint(encoded: Iterable[int]) -> int:
>     byte = next(encoded)
>
>     if byte == 0x80:
>         value = -1
>         byte = next(encoded)
>     else:
>         value = 0
>
>     value = (value << 7) | (byte & 0x7F)
>
>     while (byte & 0x80) != 0:
>         byte = next(encoded)
>         value = (value << 7) | (byte & 0x7F)
>
>     return value
>
>
> The advantage of encoding integers in this way is that there's no limit to
> their size, so there's no need to add a new protocol to support larger
> counts.
>
> They can also make pickles smaller.
>
> Example:
>
>     # Pickle (None, )
>     0: \x80 PROTO      4
>     2: \x95 FRAME      4
>    11: N    NONE
>    12: \x85 TUPLE1
>    13: \x94 MEMOIZE    (as 0)
>    14: .    STOP
>
> Here, FRAME takes an argument of 8 bytes. If you replaced FRAME with a
> version that accepted a variable-length count, you could reduce that
> argument to 1 byte.
>
> You also wouldn't need to have different fixed-length versions of an
> opcode, e.g. BINUNICODE and SHORT_BINUNICODE.
>
> Whether you do anything with this is entirely up to the core devs, I just
> thought someone might find it useful.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/lists%
> 40andros.org.uk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180710/117c6eb9/attachment.html>