OSDir


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-3210) Creating ParquetDataset with PyArrow creates partitioned ParquetFiles with mismatched Parquet schemas


Ying Wang created ARROW-3210:
--------------------------------

             Summary: Creating ParquetDataset with PyArrow creates partitioned ParquetFiles with mismatched Parquet schemas
                 Key: ARROW-3210
                 URL: https://issues.apache.org/jira/browse/ARROW-3210
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.9.0
         Environment: Ubuntu 16.04 LTS, System76 Oryx Pro
            Reporter: Ying Wang
         Attachments: environment.yml, repro.csv, repro.py, repro_2.py

STEPS TO REPRODUCE:

1. Create a conda environment reflecting [^environment.yml]

2. Execute script [^repro.py], replacing various config variables to create a ParquetDataset on S3 given [^repro.csv]

3. Create reference of ParquetDataset using script [^repro_2.py], again replacing various config variables.

 

EXPECTED:

Reference is created correctly.

GOT:

Mismatched Arrow schemas in validate_schemas() method:

 

```python

*** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different. 
Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: string
TIMESTAMP: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
 b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
 b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
 b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
 b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
 b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
 b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
 b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
 b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
 b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
 b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
 b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
 b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
 b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
 b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
 b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
 b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
 b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
 b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
 b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
 b'data": null}, {"name": "Destination", "field_name": "Destination'
 b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
 b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
 b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
 b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
 b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
 b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"'
 b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST'
 b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":'
 b' null}, {"name": null, "field_name": "__index_level_0__", "panda'
 b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa'
 b'ndas_version": "0.21.0"}'}

vs

Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: null
TIMESTAMP: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
 b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
 b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
 b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
 b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
 b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
 b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
 b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
 b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
 b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
 b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
 b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
 b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
 b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
 b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
 b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
 b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
 b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
 b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
 b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
 b'data": null}, {"name": "Destination", "field_name": "Destination'
 b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
 b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
 b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
 b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
 b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
 b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", '
 b'"metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMESTAM'
 b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n'
 b'ull}, {"name": null, "field_name": "__index_level_0__", "pandas_'
 b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand'
 b'as_version": "0.21.0"}'}

```

The issue is with column *ExtraInfo*, where *pandas_type* is *unicode* in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has *pandas_type* *empty* for that same column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)