osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: valid NaNs versus invalid NaNs?


hi Rhys,

On Mon, Dec 10, 2018 at 9:53 AM Rhys Ulerich <Rhys.Ulerich@xxxxxxxxxxxx> wrote:
>
> 'Morning,
>
>
>
> Regarding https://arrow.apache.org/docs/memory_layout.html, how should is_valid be interpreted for primitive types that have their own notions of is_valid?
>
>
>
> Concretely, how should folks interpret a "valid NaN" (is_valid 1 with float NaN) versus an "invalid NaN" (is valid 0 with float NaN)?  In RFC-ese, MUST individual NaNs be valid?  Or, MUST floats all be valid by omitting the validity bitset?
>

In floating point types, NaN is a valid value. I think you're talking
about systems that use sentinel values to represent nulls. The Arrow
columnar format does not have any notion of sentinel values. So if you
want other Arrow systems to recognize your values as being null, then
you must construct the validity bitmap accordingly.

>
>
> I ask because otherwise I can see a bunch of different systems interpreting this detail in many different ways.  That'd be an interop nightmare.  Especially since understanding why NaNs sneak into large datasets is already quite a hassle.
>

It is up to applications to determine what NaN means. It would not be
appropriate for Arrow to assume anything, particularly since most
database systems (AFAIK) distinguish NaN and NULL.

For example, in Python interop, we recognize NaN as null when
converting to Arrow, but _only_ if the data originated from pandas:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/type_traits.h#L102

In [1]: import pyarrow as pa

In [2]: import numpy as np

In [3]: arr = np.array([1, np.nan])

In [4]: arr1 = pa.array(arr)

In [5]: arr2 = pa.array(arr, from_pandas=True)

In [6]: arr1
Out[6]:
<pyarrow.lib.DoubleArray object at 0x7ffa3c8a1188>
[
  1,
  nan
]

In [7]: arr2
Out[7]:
<pyarrow.lib.DoubleArray object at 0x7ffa1ef42bd8>
[
  1,
  null
]

In [8]: arr1.null_count
Out[8]: 0

In [9]: arr2.null_count
Out[9]: 1

In R, NaN and NA are distinct

https://github.com/apache/arrow/commit/3ab4a0f481211c5d115845519eb9398dc02e2e24#diff-4b43b0aee35624cd95b910189b3dc231

>
>
> Anyhow, it seems worth addressing this gap at the written specification level.
>

What would you suggest? We could add a statement to be explicit that
no special / sentinel values (which includes NaN) are recognized as
null.

- Wes

>
>
> (Apologies if this has been discussed previously-- I've found no searchable mailing list archives under http://mail-archives.apache.org/mod_mbox/arrow-dev/ or https://cwiki.apache.org/confluence/display/ARROW.)
>
>
>
> Thanks,
>
> Rhys