osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: scrolling in ElasticSearch adapter


Thanks, all, for your input.

Following this discussion, I have prepared a PR
https://github.com/apache/calcite/pull/919

Let me know if you agree with the approach. If there are no objections /
comments, I will commit it in a couple of days.

Propagation of Statement.setFetchSize(int) into actual query will be done
in separate commit (I’m still checking DataContext usages)

On Fri, Oct 26, 2018 at 10:45 AM Kevin Risden <krisden@xxxxxxxxxx> wrote:

> #1 sounds reasonable.
> #2 will have to see what that means for actual aggregations. Sounds like it
> could work.
> #3 not sure what is meant here. I think fetch size is a hint and not
> required. I wouldn't default to 10k though.
>
> Kevin Risden
>
>
> On Thu, Oct 25, 2018 at 4:25 PM Andrei Sereda <andrei@xxxxxxxxx> wrote:
>
> > There is new Composite Aggregation
> > <
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html
> > >
> > (still in beta) which allows pagination with aggregates. It is available
> on
> > versions >= 5.6 (unfortunately we have to support 2.6+).
> >
> > To sum up:
> >
> >    1. For non-aggregate queries always use scrolling or search-after by
> >    default (not pagination) .
> >    2. For aggregates use pagination. Either Terms Aggregation (with
> >    partitioning) or Composite Aggregation.
> >    3. If possible Retrieve batch size from JDBC api (known as
> >    Statement.setFetchSize(int)). If not default to 10k ?
> >
> > Does it sound reasonable ?
> >
> >
> > On Thu, Oct 25, 2018 at 3:43 PM
> > Kevin Risden
> > <krisden@xxxxxxxxxx> wrote:
> >
> > > What I was saying is that scrolling is the only way to ensure you get
> > > correct results coming back from Elasticsearch if you are going to do
> > more
> > > processing.
> > >
> > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
> > >
> > > As long as you delete the scroll at the end its not the end of the
> world.
> > > There is no deep paging unless you use scrolling or a search after
> which
> > is
> > > basically the same idea as scrolling
> > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
> > > .
> > >
> > > I would say that scrolling should be the default behavior and then
> should
> > > look at optimizing for aggregations, smaller result sets, etc. If you
> > look
> > > at scrolling as an optimization the default results won't be correct
> (ie:
> > > only returning 10 results when there are a lot more). For default
> > > pagination there is a limit of 10000 results anyway unless you use
> > > scrolling or search after.
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Oct 25, 2018 at 3:33 PM Andrei Sereda <andrei@xxxxxxxxx>
> wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > You suggest to use scrolling for all elastic queries ? Even when
> there
> > > > are predicates ?  Some questions :
> > > >
> > > > 1) Scrolling has a runtime overhead for elastic cluster. Having it
> > > > enabled by default (against vendor recommendation) is risky. Does it
> > > > (not) cause issues in Solr ?
> > > > 2) Scrolling does not work with aggregates, which means I'll have to
> > > > fallback to pagination (for whole result set).
> > > > 3) When result set is small scrolling / pagination is not necessary.
> > > >
> > > > Having said that I don't see any way to automatically enable /
> disable
> > > > scrolling based on RelTree (pls let me know if there is one).
> > > >
> > > > You seem to agree with Stamatis that `select * from elastic` (1)
> > > > should return all documents (rows) not just 10 (ES default). If so,
> > > > one either has to use pagination or scrolling.
> > > > What about queries `select count(*), foo from elastic group by foo`
> > > > (2) (term aggregation). In this case only pagination works. If
> calcite
> > > > returns all rows for (1) it should do the same for (2).  Should one
> > > > use pagination with aggregates (2) and scrolling for simple queries
> > > > (1) ?
> > > > What if user specifies a high `LIMIT` (eg 10M) should we batch the
> > > result ?
> > > > Should user explicitly enable scrolling (eg. JDBC api) ? If so,
> should
> > > > one throw an exception when scrolling and aggregation is used or
> > > > quietly default to pagination ?
> > > >
> > > > Thanks everybody for your input, it helps.
> > > >
> > > > Regards,
> > > > Andrei.
> > > >
> > > >
> > > > On Thu, Oct 25, 2018 at 2:08 PM
> > > > Kevin Risden
> > > > <krisden@xxxxxxxxxx> wrote:
> > > > >
> > > > > >
> > > > > > There is one more “issue”. Currently select * from elastic
> returns
> > at
> > > > most
> > > > > > 10 rows (in calcite). This is consistent with elastic behaviour
> > which
> > > > > > limits result set to 10 documents (unless size is specified).
> > > > >
> > > > >
> > > > > In Solr land for the Calcite integration it uses the /export
> handler
> > or
> > > > > streaming expressions when there is no limit specified (these are
> > just
> > > > like
> > > > > scrolling or called CursorMark in Solr). It falls back to /select
> > with
> > > > the
> > > > > default limit of 10 rows.
> > > > >
> > > > > I think that scroll should be the default almost always since it is
> > the
> > > > > only way to return all the results. Even with a limit = 100000
> would
> > > > cause
> > > > > problems with the default query handler. Paging deep into the
> result
> > > set
> > > > is
> > > > > not something that Lucene (Elasticsearch or Solr) does well.
> > > > >
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Oct 25, 2018 at 1:31 PM Julian Hyde <jhyde@xxxxxxxxxx>
> > wrote:
> > > > >
> > > > > > Do you need to generate a different plan (i.e. a different tree
> of
> > > > > > RelNodes) for scrolling vs non-scrolling? If so, it’s certainly
> > > > > > inconvenient that you don’t know until execute time whether they
> > want
> > > > > > scrolling. A possible solution would be to generate TWO plans -
> one
> > > > > > scrolling, one non-scrolling inside the prepared statement - and
> > pick
> > > > which
> > > > > > one based on the runtime context.
> > > > > >
> > > > > > Julian
> > > > > >
> > > > > > > On Oct 25, 2018, at 1:42 AM, Christian Beikov <
> > > > > > christian.beikov@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Hey Andrei,
> > > > > > >
> > > > > > > I don't have an answer for how you can access these settings
> from
> > > > within
> > > > > > the adapter nor how one could do that via RelNodes but the
> > suggestion
> > > > to
> > > > > > use DataContext for that purpose sounds reasonable. Maybe someone
> > > else
> > > > has
> > > > > > an idea?
> > > > > > >
> > > > > > > Anyway, since these are settings that don't affect the general
> > > > semantics
> > > > > > of the query/statement and also usually require a special API to
> be
> > > > used,
> > > > > > I'd rather see these aspects not end up in the query string.
> > > > > > >
> > > > > > > Am 25.10.2018 um 02:15 schrieb Andrei Sereda:
> > > > > > >> Christian,
> > > > > > >>
> > > > > > >> I like TYPE_SCROLL_INSENSITIVE / fetchSize in
> PreparedStatement
> > > > > > >> generally but have some reservations (questions) :
> > > > > > >>
> > > > > > >> How to pass resultSetType / fetchSize from PreparedStatement
> to
> > > > > > RelNodes ?
> > > > > > >> What if user doesn’t use JDBC (eg. RelBuilders) ?
> > > > > > >> On Wed, Oct 24, 2018 at 6:28 PM Christian Beikov
> > > > > > >> <christian.beikov@xxxxxxxxx> wrote:
> > > > > > >>> In JDBC one can configure a fetch size which would reflect
> the
> > > > amount
> > > > > > of
> > > > > > >>> rows to be fetched initially, but also subsequently.
> > > > > > >>>
> > > > > >
> > > >
> > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize(int)
> > > > > > >>>
> > > > > > >>> According to what you are writing, ES behvior is what
> > > > > > >>> TYPE_SCROLL_INSENSITIVE would do i.e. provide a snapshot view
> > > that
> > > > > > isn't
> > > > > > >>> affected by changes.
> > > > > > >>>
> > > > > > >>> IMO TYPE_SCROLL_SENSITIVE means that if you have rows R1, R2,
> > R3,
> > > > R4,
> > > > > > >>> ... and view R1, R2, then R3 is deleted and you fetch the
> next
> > > > rows,
> > > > > > you
> > > > > > >>> wouldn't see R3.
> > > > > > >>>
> > > > > > >>> According to the JDBC spec
> > > > > > >>> (
> > > > > >
> > > >
> > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/Connection.html#prepareStatement(java.lang.String,%20int,%20int)
> > > > > > >>> ) you don't have to support all modes. Usually, user code
> > doesn't
> > > > use
> > > > > > >>> scrolling that much, but at least forward scrolling makes
> > sense.
> > > > > > >>>
> > > > > > >>> Am 24.10.2018 um 21:38 schrieb Andrei Sereda:
> > > > > > >>>> Hi Julian,
> > > > > > >>>>
> > > > > > >>>> Scrolling (in elastic) does not only mean “open a cursor”
> but
> > > also
> > > > > > iterate
> > > > > > >>>> over consistent snapshot. From docs:
> > > > > > >>>>
> > > > > > >>>> The results that are returned from a scroll request reflect
> > the
> > > > state
> > > > > > of
> > > > > > >>>> the index at the time that the initial search request was
> > made,
> > > > like a
> > > > > > >>>> snapshot in time. Subsequent changes to documents (index,
> > update
> > > > or
> > > > > > delete)
> > > > > > >>>> will only affect later search requests.
> > > > > > >>>>
> > > > > > >>>> So pagination (fetch / offset) can’t exactly replicate this
> > > > > > functionality.
> > > > > > >>>>
> > > > > > >>>> The problem with scrolling (in elastic) is that it is
> > expensive
> > > > and
> > > > > > can’t
> > > > > > >>>> (shouldn’t) be enabled it by default.
> > > > > > >>>>
> > > > > > >>>> There is one more “issue”. Currently select * from elastic
> > > > returns at
> > > > > > most
> > > > > > >>>> 10 rows (in calcite). This is consistent with elastic
> > behaviour
> > > > which
> > > > > > >>>> limits result set to 10 documents (unless size is
> specified).
> > > When
> > > > > > >>>> returning a cursor (eg. using JDBC TYPE_SCROLL_SENSITIVE
> > > > > > >>>> <
> > > > > >
> > > >
> > >
> >
> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html#TYPE_SCROLL_SENSITIVE
> > > > > > >
> > > > > > >>>> or SQL hint) does it mean return whole elastic index ? I’m
> not
> > > at
> > > > > > ease with
> > > > > > >>>> returning different results based on hints or cursor
> settings.
> > > > > > >>>>
> > > > > > >>>> Andrei.
> > > > > > >>>>
> > > > > > >>>> On Wed, Oct 24, 2018 at 3:02 PM Julian Hyde <
> > > > jhyde.apache@xxxxxxxxx>
> > > > > > wrote:
> > > > > > >>>>
> > > > > > >>>>> It seems to me that Elasticsearch scroll means return a
> > cursor
> > > -
> > > > a
> > > > > > >>>>> collection of rows that you iterate over, and you may not
> > read
> > > > all
> > > > > > of them.
> > > > > > >>>>> This is the default operation of JDBC.
> > > > > > >>>>>
> > > > > > >>>>> So, I guess we need to give the user a way to signal their
> > > > intent to
> > > > > > read
> > > > > > >>>>> all rows versus only the first few. Oracle’s FIRST_ROWS and
> > > > ALL_ROWS
> > > > > > >>>>> hints[1] seem close to this. We would want the hints to be
> > > acted
> > > > > > upon by
> > > > > > >>>>> both the optimizer and the JDBC transport.
> > > > > > >>>>>
> > > > > > >>>>> Related is pagination. SQL has FETCH and OFFSET, which
> allow
> > > you
> > > > to
> > > > > > >>>>> retrieve different pieces of a large result set in separate
> > > > > > statements or
> > > > > > >>>>> (using query parameters) executions. It would be useful if
> > the
> > > > > > server could
> > > > > > >>>>> be given a hint to cache a statement across page requests.
> > > > > > >>>>>
> > > > > > >>>>> Julian
> > > > > > >>>>>
> > > > > > >>>>> [1]
> > > > > > >>>>>
> > > > > >
> > > >
> > https://docs.oracle.com/cd/B10500_01/server.920/a96533/hintsref.htm#4924
> > > > > > >>>>>
> > > > > > >>>>>> On Oct 24, 2018, at 11:19 AM, Christian Beikov <
> > > > > > >>>>> christian.beikov@xxxxxxxxx> wrote:
> > > > > > >>>>>> Hey,
> > > > > > >>>>>>
> > > > > > >>>>>> not sure if this should be an SQL keyword. JDBC specifies
> > > > various
> > > > > > >>>>> constants that can be used at statement creation time:
> > > > > > >>>>>
> > > > https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html
> > > > > > >>>>>> Not sure though if or how these configurations are
> > accessible
> > > > for
> > > > > > data
> > > > > > >>>>> stores or dialects, but IMO using these would be the proper
> > > way.
> > > > > > >>>>>> Regards
> > > > > > >>>>>>
> > > > > > >>>>>> Christian
> > > > > > >>>>>>
> > > > > > >>>>>>> Am 24.10.2018 um 18:44 schrieb Andrei Sereda:
> > > > > > >>>>>>> Hello,
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>> I was thinking about adding [scrolling functionality](
> > > > > > >>>>>>>
> > > > > > >>>>>
> > > > > >
> > > >
> > >
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
> > > > > > >>>>> )
> > > > > > >>>>>>> to elastic search adapter. Since scrolling has
> > non-negligible
> > > > > > effect on
> > > > > > >>>>> the
> > > > > > >>>>>>> cluster it should be selectively enabled on per query
> > basis.
> > > > So,
> > > > > > likely,
> > > > > > >>>>>>> user has to explicitly set "scroll flag" somewhere.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Most natural way seems in SQL. [Calcite sql grammar](
> > > > > > >>>>>>> https://calcite.apache.org/docs/reference.html) has
> > `SCROLL`
> > > > > > keyword
> > > > > > >>>>>>> (unused to my knowledge). There were also discussions
> about
> > > > adding
> > > > > > >>>>> hints to
> > > > > > >>>>>>> Calcite.
> > > > > > >>>>>>>
> > > > > > >>>>>>> ### Examples
> > > > > > >>>>>>> ```sql
> > > > > > >>>>>>> -- special sql keyword ?
> > > > > > >>>>>>> SCROLL select * from elastic;
> > > > > > >>>>>>>
> > > > > > >>>>>>> -- assuming hints are available in calcite
> > > > > > >>>>>>> /* HINT: scroll */ select * from elastic;
> > > > > > >>>>>>> ```
> > > > > > >>>>>>>
> > > > > > >>>>>>> What people think about this use-case ? Are there better
> > > ideas
> > > > ?
> > > > > > >>>>>>>
> > > > > > >>>>>>> Regards,
> > > > > > >>>>>>> Andrei.
> > > > > > >>>>>>>
> > > > > >
> > > > > >
> > > >
> > >
> >
>