[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[jira] [Created] (ARROW-3138) 'Couldn't deserialize thrift' error when reading large binary column


Jeremy Heffner created ARROW-3138:
-------------------------------------

             Summary: 'Couldn't deserialize thrift' error when reading large binary column
                 Key: ARROW-3138
                 URL: https://issues.apache.org/jira/browse/ARROW-3138
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.10.0
         Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
            Reporter: Jeremy Heffner
         Attachments: parquet-issue-example.py

We've run into issues reading Parquet files that contain long binary columns (utf8 strings).  In particular, we were generating WKT representations of polygons that contained ~34 million characters when we ran into the issue. 

The attached example generates a dataframe with one record and one column containing a random string with 10^7 characters.

Pandas (using the default pyarrow engine) successfully writes the file, but fails upon reading the file:
{code:java}
---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-25-25d21204cbad> in <module>()
----> 1 df_read_in = pd.read_parquet('test.parquet')

~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
286 
287 impl = get_engine(engine)
--> 288 return impl.read(path, columns=columns, **kwargs)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
129 kwargs['use_pandas_metadata'] = True
130 result = self.api.parquet.read_table(path, columns=columns,
--> 131 **kwargs).to_pandas()
132 if should_close:
133 try:

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
1044 fs = _get_fs_from_path(source)
1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
-> 1046 use_pandas_metadata=use_pandas_metadata)
1047 
1048 pf = ParquetFile(source, metadata=metadata)

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, nthreads, use_pandas_metadata)
175 filesystem=self)
176 return dataset.read(columns=columns, nthreads=nthreads,
--> 177 use_pandas_metadata=use_pandas_metadata)
178 
179 def open(self, path, mode='rb'):

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
896 partitions=self.partitions,
897 open_file_func=open_file,
--> 898 use_pandas_metadata=use_pandas_metadata)
899 tables.append(table)
900 

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, partitions, open_file_func, file, use_pandas_metadata)
459 table = reader.read_row_group(self.row_group, **options)
460 else:
--> 461 table = reader.read(**options)
462 
463 if len(self.partition_keys) > 0:

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
150 columns, use_pandas_metadata=use_pandas_metadata)
151 return self.reader.read_all(column_indices=column_indices,
--> 152 nthreads=nthreads)
153 
154 def scan_contents(self, columns=None, batch_size=65536):

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()

~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)