[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: (Ab)using parquet files on S3 storage for a huge logging database

Gerlando is correct that S3 Objects, once created are immutable.  They cannot updated-in-place, appended to, nor even renamed.   However, S3 supports seeking to offsets within the object being read.  The challenge is knowing where to read within the S3 object, which to perform well will require metadata that can be derived by doing minimal I/O operations prior to seeking/reading the needed parts of the S3 object.



On 9/19/18, 9:23 AM, "Gerlando Falauto" <gerlando.falauto@xxxxxxxxx> wrote:

    I'm looking for a way to store huge amounts of logging data in the cloud
    from about 100 different data sources, each producing about 50MB/day (so
    it's something like 5GB/day).
    The target storage would be an S3 object storage for cost-efficiency
    I would like to be able to store (i.e. append-like) data in realtime, and
    retrieve data based on time frame and data source with fast access. I was
    thinking of partitioning data based on datasource and calendar day, so to
    have one file per day, per data source, each 50MB.
    I played around with pyarrow and parquet (using s3fs), and came across the
    following limitations:
    1) I found no way to append to existing files. I believe that's some
    limitation with S3, but it could be worked around by using datasets
    instead. In principle, I believe I could also trigger some daily job which
    coalesces, today's data into a single file, if having too much
    fragmentation causes any disturbance. Would that make any sense?
    2) When reading, if I'm only interested in a small portion of the data (for
    instance, based on a timestamp field), I obviously wouldn't want to have to
    read (i.e. download) the whole file. I believe Parquet was designed to
    handle huge amounts of data with relatively fast access. Yet I fail to
    understand if there's some way to allow for random access, particularly
    when dealing with a file stored within S3.
    The following code snippet refers to a 150MB dataset composed of 1000
    rowgroups of 150KB each. I was expecting it to run very fast, yet it
    apparently downloads the whole file (pyarrow 0.9.0):
    fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
    with fs.open(bucket_uri) as f:
        pf = pq.ParquetFile(f)
        print(pf.num_row_groups) # yields 1000
    3) I was also expecting to be able to perform some sort of query, but I'm
    also failing to see how to specify index columns or such.
    What am I missing? Did I get it all wrong?
    Thank you!