Re: Assign/update : NA bitmap vs sentinel
On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle <mattjdowle@xxxxxxxxx> wrote:
> 1. I see. Good idea. Can we assume bitmap is always present in Arrow then?
> I thought I'd seen Wes argue that if there were no NAs, the bitmap doesn't
> need to be allocated. Indeed I wasn't worried about the extra storage,
> although for 10,000 columns I wonder about the number of vectors.
I think different implementations handle this differently at the moment. In
the Java code, we allocate the validity buffer at initial allocation
always. We're also looking to enhance the allocation strategy so the fixed
part of values are always allocated with validity (single allocation) to
avoid any extra object housekeeping.
> 2. It's only subjective until the code complexity is measured, then it's
> not subjective. I suppose after 20 years of using sentinels, I'm used to it
> and trust it. I'll keep an open mind on this.
Yup, fair enough.
> 3. Since I criticized the scale of Wes' benchmark, I felt I should show how
> I do benchmarks myself to show where I'm coming from. Yes none-null,
> some-null and all-null paths offer savings. But that's the same under both
> sentinel and bitmap approaches. Under both approaches, you just need to
> know which case you're in. That involves storing the number of NAs in the
> header/summary which can be done under both approaches.
The item we appreciate is that you can do a single comparison every 64
values to determine which of the three cases you are in (make this a local
decision). This means you don't have to do housekeeping ahead of time. It
also means that the window of choice is narrow, minimizing the penalty in
situations where you have rare invalid values (or rare valid values).