Assign/update : NA bitmap vs sentinel
(First post to this mailing list.)
I tweeted here and Wes invited me to follow up on this list :
Wes - it was great to meet you at Stanford in September. There I mentioned
the assign/update aspect which is a downside of bitmap for NA, imho. I did
see your recent article but I didn't see the update aspect mentioned there.
It's rarely about speed for me and I'm uncomfortable when the longest
runtime presented is just 0.04s.
The concern surrounds updating (UPDATE in SQL, sub-assign in Python/R). If
there are no NAs in the column, the first assignment of an NA has to
allocate the bitmap vector and link that properly to the column. This
allocation could fail (albeit rarely) due to out-of-memory so code has to
exist to deal with that properly, from multi-threaded code too. I mentioned
cache lines in the tweet not to focus on speed, but to focus on the
complexity of updating the several places correctly from multi-threaded
code: updating the NA count, the bitmap value and the column itself is 3
assigns all to be correctly wrapped in one critical section?
In contrast, the sentinel approach (INT_MIN) is one assign to one place.
That's simpler. There's less scope for a corruption since it isn't possible
for a junk value to be wrongly included due to the bitmap values not being
set properly, somehow. Simpler means less code. When we're talking about
multi-threaded update to memory-mapped shared data, small simplifications
like this (sentinel) can help a lot for robustness, safety and correctness
while keeping the internal code required to achieve that to a minimum.
Here is an example benchmark which I consider more relevant to make design
decisions based on: https://h2oai.github.io/db-benchmark/. More relevant
because the scale is in minutes and not-working. It's this not-working
aspect that has often been in my mind when making data.table more memory
efficient, for example. Not really raw sub-second speed per se. Bitmap for
NA has a higher chance of resulting in instability (crashes or incorrect
results) than sentinel simply due to there being more parts (three things
to keep in sync when updating, including a possible allocate): more to go
wrong at the low level.
No? I'm interested to hear further and I would like to use Arrow. I really
would. But this is a major sticking point.