osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

(Ab)using parquet files on S3 storage for a huge logging database


Hi,

I'm looking for a way to store huge amounts of logging data in the cloud
from about 100 different data sources, each producing about 50MB/day (so
it's something like 5GB/day).
The target storage would be an S3 object storage for cost-efficiency
reasons.
I would like to be able to store (i.e. append-like) data in realtime, and
retrieve data based on time frame and data source with fast access. I was
thinking of partitioning data based on datasource and calendar day, so to
have one file per day, per data source, each 50MB.

I played around with pyarrow and parquet (using s3fs), and came across the
following limitations:

1) I found no way to append to existing files. I believe that's some
limitation with S3, but it could be worked around by using datasets
instead. In principle, I believe I could also trigger some daily job which
coalesces, today's data into a single file, if having too much
fragmentation causes any disturbance. Would that make any sense?

2) When reading, if I'm only interested in a small portion of the data (for
instance, based on a timestamp field), I obviously wouldn't want to have to
read (i.e. download) the whole file. I believe Parquet was designed to
handle huge amounts of data with relatively fast access. Yet I fail to
understand if there's some way to allow for random access, particularly
when dealing with a file stored within S3.
The following code snippet refers to a 150MB dataset composed of 1000
rowgroups of 150KB each. I was expecting it to run very fast, yet it
apparently downloads the whole file (pyarrow 0.9.0):

fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
client_kwargs=client_kwargs)
with fs.open(bucket_uri) as f:
    pf = pq.ParquetFile(f)
    print(pf.num_row_groups) # yields 1000
    pf.read_row_group(1)

3) I was also expecting to be able to perform some sort of query, but I'm
also failing to see how to specify index columns or such.

What am I missing? Did I get it all wrong?

Thank you!
Gerlando