[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Does Calcite hold all records output from a node before passing them to a higher node ?

I believe that scan, filter, project do not buffer; aggregate, join and sort do buffer; join perhaps buffers a little more than it should. 

Read methods in EnumerableDefaults, for example EnumerableDefaults.join, to see where a blocking collection is created and from which input.

Ideally the operators would exploit sorted input (e.g. we could have an aggregate that assumes input is sorted by the GROUP BY key and only buffers records that have the same key) but Enumerable does not aim to be a high-performance, scalable engine, so this never got prioritized.

On a related note, I was pleased to see progress on an Arrow adapter and convention in https://issues.apache.org/jira/browse/CALCITE-2173 <https://issues.apache.org/jira/browse/CALCITE-2173>. If we were to write a high-performance engine that scales across many threads, it would be based on Arrow. So anyone with complaints about the performance of Enumerable convention should start contributing to Arrow convention!


> On May 29, 2018, at 7:20 AM, Michael Mior <mmior@xxxxxxxxxx> wrote:
> In theory it certainly should be possible to stream the results. This isn't
> guaranteed however. You would have to look at the entire query pipeline to
> see where things are being materialized. A full stack trace without
> elements removed would be a good start.
> --
> Michael Mior
> mmior@xxxxxxxxxx
> Le lun. 28 mai 2018 à 19:05, Muhammad Gelbana <m.gelbana@xxxxxxxxx> a
> écrit :
>> I'm not sure if I phrased my question correctly so let me explain more.
>> I'm running a (SELECT * FROM TABLE) query against a 50 million records
>> table (Following the BINDABLE convention, so it sends it's rows through a
>> "sink"). Since the extracted rows aren't processed in any way, I was
>> expecting that the output JDBC resultset would be able to enumerate through
>> all the results in a matter of seconds, but instead, my machine didn't
>> print anything. What exactly happens is that
>> (PreparedStatement.executeQuery) doesn't return a resultset promptly even
>> after a few minutes have passed.
>> I tried a table with hundreds of rows and my testing code printed those
>> results right away so it's not something I missed there, but probably a
>> configuration I didn't set ? Or may be that's just how it is ? Does anyone
>> else believe that the behaviour I expected is reasonable ? It would also
>> lower the amount of memory consumed to hold the complete results before
>> bursting them to their final destination, if that's the case in the first
>> place.
>> Thanks,
>> Gelbana