osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: (Ab)using parquet files on S3 storage for a huge logging database


Hi Paul,
I see your point. I'm probably worrying too much about indices, inasmuch
partitioning just reduces the problem enough, down to a bearable size.
I have to understand better how Apache drill can be interfaced with. If it
could get easily deployed somewhere in the cloud -- where fetching files
from a storage should be quite fast -- and easily queried from a local
machine, perhaps that's more than enough.

Thank you!
Gerlando


Il mer 19 set 2018, 23:04 Paul Rogers <par0328@xxxxxxxxx.invalid> ha
scritto:

> Hi Gerlando,
>
> Parquet does not allow row-level indexing because some data for a row
> might not even exist, it is encoded in data about a group of similar rows.
>
> In the world of Big Data, it seems that the most common practice is to
> simply scan all the data to find the bits you want. Indexing is very hard
> in a distributed system. (See "Building Data Intensive Applications" from
> O'Reilly for a good summary.) Parquet is optimized for this case.
>
> You use partitions to whittle down the haystacks (set of Parquet files)
> you must search. Then you use Drill to scan those haystacks to find the
> needle.
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, September 19, 2018, 12:30:36 PM PDT, Brian Bowman <
> Brian.Bowman@xxxxxxx> wrote:
>
>  Gerlando,
>
> AFAIK Parquet does not yet support indexing.  I believe it does store
> min/max values at the row batch (or maybe it's page) level which may help
> eliminate large "swaths" of data depending on how actual data values
> corresponding to a search predicate are distributed across large Parquet
> files.
>
> I have an interest in the future of indexing within the native Parquet
> structure as well.  It will be interesting to see where this discussion
> goes from here.
>
> -Brian
>
> On 9/19/18, 3:21 PM, "Gerlando Falauto" <gerlando.falauto@xxxxxxxxx>
> wrote:
>
>     EXTERNAL
>
>     Thank you all guys, you've been extremely helpful with your ideas.
>     I'll definitely have a look at all your suggestions to see what others
> have
>     been doing in this respect.
>
>     What I forgot to mention was that while the service uses the S3 API,
> it's
>     not provided by AWS so any solution should be based on a cloud offering
>     from a different big player (it's the three-letter big-blue one, in
> case
>     you're wondering).
>
>     However, I'm still not clear as to how Drill (or pyarrow) would be
> able to
>     gather data with random access. In any database, you just build an
> index on
>     the fields you're going to run most of your queries over, and then the
>     database takes care of everything else.
>
>     With Parquet, as I understand, you can do folder-based partitioning (is
>     that called "hive" partitioning?) so that you can get random access
> over
>     let's say
>     source=source1/date=20180918/*.parquet.
>     I assume drill could be instructed into doing this or even figure it
> out by
>     itself, by just looking at the folder structure.
>     What I still don't get though, is how to "index" the parquet file(s),
> so
>     that random (rather than sequential) access can be performed over the
> whole
>     file.
>     Brian mentioned metadata, I had a quick look at the parquet
> specification
>     and I sortof understand it somehow resembles an index.
>     Yet I fail to understand how such an index could be built (if at all
>     possible), for instance using pyarrow (or any other tool, for that
> matter)
>     for reading and/or writing.
>
>     Thank you!
>     Gerlando
>
>     On Wed, Sep 19, 2018 at 7:55 PM Ted Dunning <ted.dunning@xxxxxxxxx>
> wrote:
>
>     > The effect of rename can be had by handling a small inventory file
> that is
>     > updated atomically.
>     >
>     > Having real file semantics is sooo much nicer, though.
>     >
>     >
>     >
>     > On Wed, Sep 19, 2018 at 1:51 PM Bill Glennon <wglennon@xxxxxxxxx>
> wrote:
>     >
>     > > Also, may want to take a look at https://aws.amazon.com/athena/.
>     > >
>     > > Thanks,
>     > > Bill
>     > >
>     > > On Wed, Sep 19, 2018 at 1:43 PM Paul Rogers
> <par0328@xxxxxxxxx.invalid>
>     > > wrote:
>     > >
>     > > > Hi Gerlando,
>     > > >
>     > > > I believe AWS has entire logging pipeline they offer. If you want
>     > > > something quick, perhaps look into that offering.
>     > > >
>     > > > What you describe is pretty much the classic approach to log
>     > aggregation:
>     > > > partition data, gather data incrementally, then later
> consolidate. A
>     > > while
>     > > > back, someone invented the term "lambda architecture" for this
> idea.
>     > You
>     > > > should be able to find examples of how others have done something
>     > > similar.
>     > > >
>     > > > Drill can scan directories of files. So, in your buckets
> (source-date)
>     > > > directories, you can have multiple files. If you receive data,
> say,
>     > > every 5
>     > > > or 10 minutes, you can just create a separate file for each new
> drop of
>     > > > data. You'll end up with many files, but you can query the data
> as it
>     > > > arrives.
>     > > >
>     > > > Then, later, say once per day, you can consolidate the files
> into a few
>     > > > big files. The only trick is the race condition of doing the
>     > > consolidation
>     > > > while running queries. Not sure how to do that on S3, since you
> can't
>     > > > exploit rename operations as you can on Linux. Anyone have
> suggestions
>     > > for
>     > > > this step?
>     > > >
>     > > > Thanks,
>     > > > - Paul
>     > > >
>     > > >
>     > > >
>     > > >    On Wednesday, September 19, 2018, 6:23:13 AM PDT, Gerlando
> Falauto
>     > <
>     > > > gerlando.falauto@xxxxxxxxx> wrote:
>     > > >
>     > > >  Hi,
>     > > >
>     > > > I'm looking for a way to store huge amounts of logging data in
> the
>     > cloud
>     > > > from about 100 different data sources, each producing about
> 50MB/day
>     > (so
>     > > > it's something like 5GB/day).
>     > > > The target storage would be an S3 object storage for
> cost-efficiency
>     > > > reasons.
>     > > > I would like to be able to store (i.e. append-like) data in
> realtime,
>     > and
>     > > > retrieve data based on time frame and data source with fast
> access. I
>     > was
>     > > > thinking of partitioning data based on datasource and calendar
> day, so
>     > to
>     > > > have one file per day, per data source, each 50MB.
>     > > >
>     > > > I played around with pyarrow and parquet (using s3fs), and came
> across
>     > > the
>     > > > following limitations:
>     > > >
>     > > > 1) I found no way to append to existing files. I believe that's
> some
>     > > > limitation with S3, but it could be worked around by using
> datasets
>     > > > instead. In principle, I believe I could also trigger some daily
> job
>     > > which
>     > > > coalesces, today's data into a single file, if having too much
>     > > > fragmentation causes any disturbance. Would that make any sense?
>     > > >
>     > > > 2) When reading, if I'm only interested in a small portion of
> the data
>     > > (for
>     > > > instance, based on a timestamp field), I obviously wouldn't want
> to
>     > have
>     > > to
>     > > > read (i.e. download) the whole file. I believe Parquet was
> designed to
>     > > > handle huge amounts of data with relatively fast access. Yet I
> fail to
>     > > > understand if there's some way to allow for random access,
> particularly
>     > > > when dealing with a file stored within S3.
>     > > > The following code snippet refers to a 150MB dataset composed of
> 1000
>     > > > rowgroups of 150KB each. I was expecting it to run very fast,
> yet it
>     > > > apparently downloads the whole file (pyarrow 0.9.0):
>     > > >
>     > > > fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
>     > > > client_kwargs=client_kwargs)
>     > > > with fs.open(bucket_uri) as f:
>     > > >    pf = pq.ParquetFile(f)
>     > > >    print(pf.num_row_groups) # yields 1000
>     > > >    pf.read_row_group(1)
>     > > >
>     > > > 3) I was also expecting to be able to perform some sort of
> query, but
>     > I'm
>     > > > also failing to see how to specify index columns or such.
>     > > >
>     > > > What am I missing? Did I get it all wrong?
>     > > >
>     > > > Thank you!
>     > > > Gerlando
>     > > >
>     > >
>     >
>
>
>