Re: valid NaNs versus invalid NaNs?
On Mon, Dec 10, 2018 at 9:53 AM Rhys Ulerich <Rhys.Ulerich@xxxxxxxxxxxx> wrote:
> Regarding https://arrow.apache.org/docs/memory_layout.html, how should is_valid be interpreted for primitive types that have their own notions of is_valid?
> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float NaN) versus an "invalid NaN" (is valid 0 with float NaN)? In RFC-ese, MUST individual NaNs be valid? Or, MUST floats all be valid by omitting the validity bitset?
In floating point types, NaN is a valid value. I think you're talking
about systems that use sentinel values to represent nulls. The Arrow
columnar format does not have any notion of sentinel values. So if you
want other Arrow systems to recognize your values as being null, then
you must construct the validity bitmap accordingly.
> I ask because otherwise I can see a bunch of different systems interpreting this detail in many different ways. That'd be an interop nightmare. Especially since understanding why NaNs sneak into large datasets is already quite a hassle.
It is up to applications to determine what NaN means. It would not be
appropriate for Arrow to assume anything, particularly since most
database systems (AFAIK) distinguish NaN and NULL.
For example, in Python interop, we recognize NaN as null when
converting to Arrow, but _only_ if the data originated from pandas:
In : import pyarrow as pa
In : import numpy as np
In : arr = np.array([1, np.nan])
In : arr1 = pa.array(arr)
In : arr2 = pa.array(arr, from_pandas=True)
In : arr1
<pyarrow.lib.DoubleArray object at 0x7ffa3c8a1188>
In : arr2
<pyarrow.lib.DoubleArray object at 0x7ffa1ef42bd8>
In : arr1.null_count
In : arr2.null_count
In R, NaN and NA are distinct
> Anyhow, it seems worth addressing this gap at the written specification level.
What would you suggest? We could add a statement to be explicit that
no special / sentinel values (which includes NaN) are recognized as
> (Apologies if this has been discussed previously-- I've found no searchable mailing list archives under http://mail-archives.apache.org/mod_mbox/arrow-dev/ or https://cwiki.apache.org/confluence/display/ARROW.)