Re: Does Calcite hold all records output from a node before passing them to a higher node ?
Yes, when you use a Sink there is an assumption that there is a Node running that is consuming from the deque. Currently the Interpreter only runs one Node at a time, which means that the full output of that Node sits in a deque for a while.
Clearly the Interpreter has much room for improvement.
> On May 29, 2018, at 3:22 PM, Muhammad Gelbana <m.gelbana@xxxxxxxxx> wrote:
> I found out what was consuming the memory and delaying the results at the
> same time. I was pushing all obtained rows from the datasource into a sink
> creating by this method
> Pushing rows into the sink halts further nodes execution until all rows are
> totally loaded. I thought since the sink is backed by an "ArrayDeque" that
> the rows would be consumed while being pushed to the sink.
> The other approach I applied was to use the "enumerable" method instead.
> This way, returned rows from my nodes are available for successive nodes
> without delay.
> Thank you all and thank you Julian for the Arrow adapter code.
> On Tue, May 29, 2018 at 5:50 PM, Julian Hyde <jhyde@xxxxxxxxxx> wrote:
>> I believe that scan, filter, project do not buffer; aggregate, join and
>> sort do buffer; join perhaps buffers a little more than it should.
>> Read methods in EnumerableDefaults, for example EnumerableDefaults.join,
>> to see where a blocking collection is created and from which input.
>> Ideally the operators would exploit sorted input (e.g. we could have an
>> aggregate that assumes input is sorted by the GROUP BY key and only buffers
>> records that have the same key) but Enumerable does not aim to be a
>> high-performance, scalable engine, so this never got prioritized.
>> On a related note, I was pleased to see progress on an Arrow adapter and
>> convention in https://issues.apache.org/jira/browse/CALCITE-2173 <
>> https://issues.apache.org/jira/browse/CALCITE-2173>. If we were to write
>> a high-performance engine that scales across many threads, it would be
>> based on Arrow. So anyone with complaints about the performance of
>> Enumerable convention should start contributing to Arrow convention!
>>> On May 29, 2018, at 7:20 AM, Michael Mior <mmior@xxxxxxxxxx> wrote:
>>> In theory it certainly should be possible to stream the results. This
>>> guaranteed however. You would have to look at the entire query pipeline
>>> see where things are being materialized. A full stack trace without
>>> elements removed would be a good start.
>>> Michael Mior
>>> Le lun. 28 mai 2018 à 19:05, Muhammad Gelbana <m.gelbana@xxxxxxxxx> a
>>> écrit :
>>>> I'm not sure if I phrased my question correctly so let me explain more.
>>>> I'm running a (SELECT * FROM TABLE) query against a 50 million records
>>>> table (Following the BINDABLE convention, so it sends it's rows through
>>>> "sink"). Since the extracted rows aren't processed in any way, I was
>>>> expecting that the output JDBC resultset would be able to enumerate
>>>> all the results in a matter of seconds, but instead, my machine didn't
>>>> print anything. What exactly happens is that
>>>> (PreparedStatement.executeQuery) doesn't return a resultset promptly
>>>> after a few minutes have passed.
>>>> I tried a table with hundreds of rows and my testing code printed those
>>>> results right away so it's not something I missed there, but probably a
>>>> configuration I didn't set ? Or may be that's just how it is ? Does
>>>> else believe that the behaviour I expected is reasonable ? It would also
>>>> lower the amount of memory consumed to hold the complete results before
>>>> bursting them to their final destination, if that's the case in the