osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scrolling in ElasticSearch adapter


You’re proposing lazy consensus. We don’t do that for commits. (Yeah, we use a mixture of CTR and RTC. But adding lazy consensus muddies things too much, I think.)

For this change, I think you should do RTC, and require a +1 from at least one committer. Therefore, someone please review https://issues.apache.org/jira/browse/CALCITE-2651 <https://issues.apache.org/jira/browse/CALCITE-2651>.


> On Nov 14, 2018, at 1:27 PM, Andrei Sereda <andrei@xxxxxxxxx> wrote:
> 
> Thanks, all, for your input.
> 
> Following this discussion, I have prepared a PR
> https://github.com/apache/calcite/pull/919
> 
> Let me know if you agree with the approach. If there are no objections /
> comments, I will commit it in a couple of days.
> 
> Propagation of Statement.setFetchSize(int) into actual query will be done
> in separate commit (I’m still checking DataContext usages)
> 
> On Fri, Oct 26, 2018 at 10:45 AM Kevin Risden <krisden@xxxxxxxxxx> wrote:
> 
>> #1 sounds reasonable.
>> #2 will have to see what that means for actual aggregations. Sounds like it
>> could work.
>> #3 not sure what is meant here. I think fetch size is a hint and not
>> required. I wouldn't default to 10k though.
>> 
>> Kevin Risden
>> 
>> 
>> On Thu, Oct 25, 2018 at 4:25 PM Andrei Sereda <andrei@xxxxxxxxx> wrote:
>> 
>>> There is new Composite Aggregation
>>> <
>>> 
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
>>>> 
>>> (still in beta) which allows pagination with aggregates. It is available
>> on
>>> versions >= 5.6 (unfortunately we have to support 2.6+).
>>> 
>>> To sum up:
>>> 
>>>   1. For non-aggregate queries always use scrolling or search-after by
>>>   default (not pagination) .
>>>   2. For aggregates use pagination. Either Terms Aggregation (with
>>>   partitioning) or Composite Aggregation.
>>>   3. If possible Retrieve batch size from JDBC api (known as
>>>   Statement.setFetchSize(int)). If not default to 10k ?
>>> 
>>> Does it sound reasonable ?
>>> 
>>> 
>>> On Thu, Oct 25, 2018 at 3:43 PM
>>> Kevin Risden
>>> <krisden@xxxxxxxxxx> wrote:
>>> 
>>>> What I was saying is that scrolling is the only way to ensure you get
>>>> correct results coming back from Elasticsearch if you are going to do
>>> more
>>>> processing.
>>>> 
>>>> 
>>>> 
>>> 
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
>>>> 
>>>> As long as you delete the scroll at the end its not the end of the
>> world.
>>>> There is no deep paging unless you use scrolling or a search after
>> which
>>> is
>>>> basically the same idea as scrolling
>>>> 
>>>> 
>>> 
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
>>>> .
>>>> 
>>>> I would say that scrolling should be the default behavior and then
>> should
>>>> look at optimizing for aggregations, smaller result sets, etc. If you
>>> look
>>>> at scrolling as an optimization the default results won't be correct
>> (ie:
>>>> only returning 10 results when there are a lot more). For default
>>>> pagination there is a limit of 10000 results anyway unless you use
>>>> scrolling or search after.
>>>> 
>>>> Kevin Risden
>>>> 
>>>> 
>>>> On Thu, Oct 25, 2018 at 3:33 PM Andrei Sereda <andrei@xxxxxxxxx>
>> wrote:
>>>> 
>>>>> Hi Kevin,
>>>>> 
>>>>> You suggest to use scrolling for all elastic queries ? Even when
>> there
>>>>> are predicates ?  Some questions :
>>>>> 
>>>>> 1) Scrolling has a runtime overhead for elastic cluster. Having it
>>>>> enabled by default (against vendor recommendation) is risky. Does it
>>>>> (not) cause issues in Solr ?
>>>>> 2) Scrolling does not work with aggregates, which means I'll have to
>>>>> fallback to pagination (for whole result set).
>>>>> 3) When result set is small scrolling / pagination is not necessary.
>>>>> 
>>>>> Having said that I don't see any way to automatically enable /
>> disable
>>>>> scrolling based on RelTree (pls let me know if there is one).
>>>>> 
>>>>> You seem to agree with Stamatis that `select * from elastic` (1)
>>>>> should return all documents (rows) not just 10 (ES default). If so,
>>>>> one either has to use pagination or scrolling.
>>>>> What about queries `select count(*), foo from elastic group by foo`
>>>>> (2) (term aggregation). In this case only pagination works. If
>> calcite
>>>>> returns all rows for (1) it should do the same for (2).  Should one
>>>>> use pagination with aggregates (2) and scrolling for simple queries
>>>>> (1) ?
>>>>> What if user specifies a high `LIMIT` (eg 10M) should we batch the
>>>> result ?
>>>>> Should user explicitly enable scrolling (eg. JDBC api) ? If so,
>> should
>>>>> one throw an exception when scrolling and aggregation is used or
>>>>> quietly default to pagination ?
>>>>> 
>>>>> Thanks everybody for your input, it helps.
>>>>> 
>>>>> Regards,
>>>>> Andrei.
>>>>> 
>>>>> 
>>>>> On Thu, Oct 25, 2018 at 2:08 PM
>>>>> Kevin Risden
>>>>> <krisden@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> There is one more “issue”. Currently select * from elastic
>> returns
>>> at
>>>>> most
>>>>>>> 10 rows (in calcite). This is consistent with elastic behaviour
>>> which
>>>>>>> limits result set to 10 documents (unless size is specified).
>>>>>> 
>>>>>> 
>>>>>> In Solr land for the Calcite integration it uses the /export
>> handler
>>> or
>>>>>> streaming expressions when there is no limit specified (these are
>>> just
>>>>> like
>>>>>> scrolling or called CursorMark in Solr). It falls back to /select
>>> with
>>>>> the
>>>>>> default limit of 10 rows.
>>>>>> 
>>>>>> I think that scroll should be the default almost always since it is
>>> the
>>>>>> only way to return all the results. Even with a limit = 100000
>> would
>>>>> cause
>>>>>> problems with the default query handler. Paging deep into the
>> result
>>>> set
>>>>> is
>>>>>> not something that Lucene (Elasticsearch or Solr) does well.
>>>>>> 
>>>>>> 
>>>>>> Kevin Risden
>>>>>> 
>>>>>> 
>>>>>> On Thu, Oct 25, 2018 at 1:31 PM Julian Hyde <jhyde@xxxxxxxxxx>
>>> wrote:
>>>>>> 
>>>>>>> Do you need to generate a different plan (i.e. a different tree
>> of
>>>>>>> RelNodes) for scrolling vs non-scrolling? If so, it’s certainly
>>>>>>> inconvenient that you don’t know until execute time whether they
>>> want
>>>>>>> scrolling. A possible solution would be to generate TWO plans -
>> one
>>>>>>> scrolling, one non-scrolling inside the prepared statement - and
>>> pick
>>>>> which
>>>>>>> one based on the runtime context.
>>>>>>> 
>>>>>>> Julian
>>>>>>> 
>>>>>>>> On Oct 25, 2018, at 1:42 AM, Christian Beikov <
>>>>>>> christian.beikov@xxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> Hey Andrei,
>>>>>>>> 
>>>>>>>> I don't have an answer for how you can access these settings
>> from
>>>>> within
>>>>>>> the adapter nor how one could do that via RelNodes but the
>>> suggestion
>>>>> to
>>>>>>> use DataContext for that purpose sounds reasonable. Maybe someone
>>>> else
>>>>> has
>>>>>>> an idea?
>>>>>>>> 
>>>>>>>> Anyway, since these are settings that don't affect the general
>>>>> semantics
>>>>>>> of the query/statement and also usually require a special API to
>> be
>>>>> used,
>>>>>>> I'd rather see these aspects not end up in the query string.
>>>>>>>> 
>>>>>>>> Am 25.10.2018 um 02:15 schrieb Andrei Sereda:
>>>>>>>>> Christian,
>>>>>>>>> 
>>>>>>>>> I like TYPE_SCROLL_INSENSITIVE / fetchSize in
>> PreparedStatement
>>>>>>>>> generally but have some reservations (questions) :
>>>>>>>>> 
>>>>>>>>> How to pass resultSetType / fetchSize from PreparedStatement
>> to
>>>>>>> RelNodes ?
>>>>>>>>> What if user doesn’t use JDBC (eg. RelBuilders) ?
>>>>>>>>> On Wed, Oct 24, 2018 at 6:28 PM Christian Beikov
>>>>>>>>> <christian.beikov@xxxxxxxxx> wrote:
>>>>>>>>>> In JDBC one can configure a fetch size which would reflect
>> the
>>>>> amount
>>>>>>> of
>>>>>>>>>> rows to be fetched initially, but also subsequently.
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize(int)
>>>>>>>>>> 
>>>>>>>>>> According to what you are writing, ES behvior is what
>>>>>>>>>> TYPE_SCROLL_INSENSITIVE would do i.e. provide a snapshot view
>>>> that
>>>>>>> isn't
>>>>>>>>>> affected by changes.
>>>>>>>>>> 
>>>>>>>>>> IMO TYPE_SCROLL_SENSITIVE means that if you have rows R1, R2,
>>> R3,
>>>>> R4,
>>>>>>>>>> ... and view R1, R2, then R3 is deleted and you fetch the
>> next
>>>>> rows,
>>>>>>> you
>>>>>>>>>> wouldn't see R3.
>>>>>>>>>> 
>>>>>>>>>> According to the JDBC spec
>>>>>>>>>> (
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int,%20int)
>>>>>>>>>> ) you don't have to support all modes. Usually, user code
>>> doesn't
>>>>> use
>>>>>>>>>> scrolling that much, but at least forward scrolling makes
>>> sense.
>>>>>>>>>> 
>>>>>>>>>> Am 24.10.2018 um 21:38 schrieb Andrei Sereda:
>>>>>>>>>>> Hi Julian,
>>>>>>>>>>> 
>>>>>>>>>>> Scrolling (in elastic) does not only mean “open a cursor”
>> but
>>>> also
>>>>>>> iterate
>>>>>>>>>>> over consistent snapshot. From docs:
>>>>>>>>>>> 
>>>>>>>>>>> The results that are returned from a scroll request reflect
>>> the
>>>>> state
>>>>>>> of
>>>>>>>>>>> the index at the time that the initial search request was
>>> made,
>>>>> like a
>>>>>>>>>>> snapshot in time. Subsequent changes to documents (index,
>>> update
>>>>> or
>>>>>>> delete)
>>>>>>>>>>> will only affect later search requests.
>>>>>>>>>>> 
>>>>>>>>>>> So pagination (fetch / offset) can’t exactly replicate this
>>>>>>> functionality.
>>>>>>>>>>> 
>>>>>>>>>>> The problem with scrolling (in elastic) is that it is
>>> expensive
>>>>> and
>>>>>>> can’t
>>>>>>>>>>> (shouldn’t) be enabled it by default.
>>>>>>>>>>> 
>>>>>>>>>>> There is one more “issue”. Currently select * from elastic
>>>>> returns at
>>>>>>> most
>>>>>>>>>>> 10 rows (in calcite). This is consistent with elastic
>>> behaviour
>>>>> which
>>>>>>>>>>> limits result set to 10 documents (unless size is
>> specified).
>>>> When
>>>>>>>>>>> returning a cursor (eg. using JDBC TYPE_SCROLL_SENSITIVE
>>>>>>>>>>> <
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html#TYPE_SCROLL_SENSITIVE
>>>>>>>> 
>>>>>>>>>>> or SQL hint) does it mean return whole elastic index ? I’m
>> not
>>>> at
>>>>>>> ease with
>>>>>>>>>>> returning different results based on hints or cursor
>> settings.
>>>>>>>>>>> 
>>>>>>>>>>> Andrei.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Oct 24, 2018 at 3:02 PM Julian Hyde <
>>>>> jhyde.apache@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> It seems to me that Elasticsearch scroll means return a
>>> cursor
>>>> -
>>>>> a
>>>>>>>>>>>> collection of rows that you iterate over, and you may not
>>> read
>>>>> all
>>>>>>> of them.
>>>>>>>>>>>> This is the default operation of JDBC.
>>>>>>>>>>>> 
>>>>>>>>>>>> So, I guess we need to give the user a way to signal their
>>>>> intent to
>>>>>>> read
>>>>>>>>>>>> all rows versus only the first few. Oracle’s FIRST_ROWS and
>>>>> ALL_ROWS
>>>>>>>>>>>> hints[1] seem close to this. We would want the hints to be
>>>> acted
>>>>>>> upon by
>>>>>>>>>>>> both the optimizer and the JDBC transport.
>>>>>>>>>>>> 
>>>>>>>>>>>> Related is pagination. SQL has FETCH and OFFSET, which
>> allow
>>>> you
>>>>> to
>>>>>>>>>>>> retrieve different pieces of a large result set in separate
>>>>>>> statements or
>>>>>>>>>>>> (using query parameters) executions. It would be useful if
>>> the
>>>>>>> server could
>>>>>>>>>>>> be given a hint to cache a statement across page requests.
>>>>>>>>>>>> 
>>>>>>>>>>>> Julian
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4924
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Oct 24, 2018, at 11:19 AM, Christian Beikov <
>>>>>>>>>>>> christian.beikov@xxxxxxxxx> wrote:
>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> not sure if this should be an SQL keyword. JDBC specifies
>>>>> various
>>>>>>>>>>>> constants that can be used at statement creation time:
>>>>>>>>>>>> 
>>>>> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html
>>>>>>>>>>>>> Not sure though if or how these configurations are
>>> accessible
>>>>> for
>>>>>>> data
>>>>>>>>>>>> stores or dialects, but IMO using these would be the proper
>>>> way.
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Christian
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am 24.10.2018 um 18:44 schrieb Andrei Sereda:
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I was thinking about adding [scrolling functionality](
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
>>>>>>>>>>>> )
>>>>>>>>>>>>>> to elastic search adapter. Since scrolling has
>>> non-negligible
>>>>>>> effect on
>>>>>>>>>>>> the
>>>>>>>>>>>>>> cluster it should be selectively enabled on per query
>>> basis.
>>>>> So,
>>>>>>> likely,
>>>>>>>>>>>>>> user has to explicitly set "scroll flag" somewhere.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Most natural way seems in SQL. [Calcite sql grammar](
>>>>>>>>>>>>>> https://calcite.apache.org/docs/reference.html) has
>>> `SCROLL`
>>>>>>> keyword
>>>>>>>>>>>>>> (unused to my knowledge). There were also discussions
>> about
>>>>> adding
>>>>>>>>>>>> hints to
>>>>>>>>>>>>>> Calcite.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ### Examples
>>>>>>>>>>>>>> ```sql
>>>>>>>>>>>>>> -- special sql keyword ?
>>>>>>>>>>>>>> SCROLL select * from elastic;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- assuming hints are available in calcite
>>>>>>>>>>>>>> /* HINT: scroll */ select * from elastic;
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What people think about this use-case ? Are there better
>>>> ideas
>>>>> ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Andrei.
>>>>>>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>>