osdir.com

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Relational algebra and signal processing


I think the difficulty with JulianF’s signal processing domain is that he needs there to be precisely one record at every clock tick (or more generally, at every point in an N-dimensional discrete space).

Consider stock trading. A stock trade is an event that happens in continuous time, say

  (9:58:02 ORCL 41), (10:01:55 ORCL 43)

Our query wants to know the stock price at 10:00 (or at any 1-minute interval). Therefore we have to convert the event-oriented data into an array:

  (9:59 ORCL 41), (10:00 ORCL 41), (10:01 ORCL 41), (10:02 ORCL 43).

JulianF’s domain may be more naturally in the realm of array databases [1] but there are a lot of advantages to relational algebra and SQL, not least that we have reasonable story for streaming data, so let’s try to bridge the gap. Suppose we add a FILL operator that converts an event-based relation into a dense array:

 SELECT *
 FROM TABLE(FILL(Trades, ‘ROWTIME’, INTERVAL ‘1’ MINUTE))

Now we can safely join with other data at the same granularity.

Is this a step in the right direction?

Julian

[1] https://en.wikipedia.org/wiki/Array_DBMS

> On Dec 18, 2018, at 7:05 AM, Michael Mior <mmior@xxxxxxxxxx> wrote:
> 
> I would say a similar theory applies. Some things are different when you're
> dealing with streams. Mainly joins and aggregations. Semantics are
> necessarily different whenever you have operations involving more than one
> row at a time from the input stream. When dealing with a relation an
> aggregation is straightforward since you just consume the entire input, and
> output the result of the aggregation. Since streams don't end, you need to
> decide how this is handled which usually amounts to a choice of windowing
> algorithm. There are a few other things to think about. The presentation
> linked below from Julian Hyde has a nice overview
> 
> https://www.slideshare.net/julianhyde/streaming-sql-62376119
> 
> --
> Michael Mior
> mmior@xxxxxxxxxx
> 
> 
> Le mar. 18 déc. 2018 à 02:28, Julian Feinauer <j.feinauer@xxxxxxxxxxxxxxxxx>
> a écrit :
> 
>> Hi Michael,
>> 
>> yes, our workloads are usually in the context of streaming (but for replay
>> or so we also use batch).
>> But, if I understand it correctly, the same theory applies to both, tables
>> ("relations") and streaming tables, or?
>> I hope to find time soon to write a PLC4X - Calicte source which creates
>> one or many streams based on readings from a plc.
>> 
>> Julian
>> 
>> Am 18.12.18, 03:19 schrieb "Michael Mior" <mmior@xxxxxxxxxx>:
>> 
>>    Perhaps you've thought of this already, but it sounds like streaming
>>    relational algebra could be a good fit here.
>> 
>>    https://calcite.apache.org/docs/stream.html
>>    --
>>    Michael Mior
>>    mmior@xxxxxxxxxx
>> 
>> 
>>    Le dim. 16 déc. 2018 à 18:39, Julian Feinauer <
>> j.feinauer@xxxxxxxxxxxxxxxxx>
>>    a écrit :
>> 
>>> Hi Calcite-devs,
>>> 
>>> I just had a very interesting mail exchange with Julian (Hyde) on the
>>> incubator list [1]. It was about our project CRUNCH (which is mostly
>> about
>>> time series analyses and signal processing) and its relation to
>> relational
>>> algebra and I wanted to bring the discussion to this list to
>> continue here.
>>> We already had some discussion about how time series would work in
>> calcite
>>> [2] and it’s closely related to MATCH_RECOGNIZE.
>>> 
>>> But, I have a more general question in mind, to ask the experts here
>> on
>>> the list.
>>> I ask myself if we can see the signal processing and analysis tasks
>> as
>>> proper application of relational algebra.
>>> Disclaimer, I’m mathematician, so I know the formals of (relational)
>>> algebra pretty well but I’m lacking a lot of experience and
>> knowledge in
>>> the database theory. Most of my knowledge there comes from Calcites
>> source
>>> code and the book from Garcia-Molina and Ullman).
>>> 
>>> So if we take, for example, a stream of signals from a sensor, then
>> we can
>>> of course do filtering or smoothing on it and this can be seen as a
>> mapping
>>> between the input relation and the output relation. But as we
>> usually need
>>> more than just one tuple at a time we lose many of the advantages of
>> the
>>> relational theory. And then, if we analyze the signal, we can again
>> model
>>> it as a mapping between relations, but the input relation is a “time
>>> series” and the output relation consists of “events”, so these are
>> in some
>>> way different dimensions. In this situation it becomes mostly
>> obvious where
>>> the main differences between time series and relational algebra are.
>> Think
>>> of something simple, an event should be registered, whenever the
>> signal
>>> switches from FALSE to TRUE (so not for every TRUE). This could also
>> be
>>> modelled with MATCH_RECOGNIZE pretty easily. But, for me it feels
>>> “unnatural” because we cannot use any indices (we don’t care about
>> the
>>> ratio of TRUE and FALSE in the DB, except for probably some very
>> rough
>>> outer bounds). And we are lacking the “right” information for the
>> optimizer
>>> like estimations on the number of analysis results.
>>> It gets even more complicated when moving to continuous valued
>> signals
>>> (INT, DOUBLE, …), e.g., temperature readings or something.
>>> If we want to analyze the number of times where we have a temperature
>>> change of more than 5 degrees in under 4 hours, this should also be
>> doable
>>> with MATCH_RECOGNIZE but again, there is no index to help us and we
>> have no
>>> information for the optimizer, so it feels very “black box” for the
>>> relational algebra.
>>> 
>>> I’m not sure if you get my point, but for me, the elegance of
>> relational
>>> algebra was always this optimization stuff, which comes from
>> declarative
>>> and ends in an “optimal” physical plan. And I do not see how we can
>> use
>>> much of this for the examples given above.
>>> 
>>> Perhaps, one solution would be to do the same as for spatial queries
>> (or
>>> the JSON / JSONB support in postgres, [3]) to add specialized
>> indices,
>>> statistics and optimizer rules. Then, this would make it more
>> “relational
>>> algebra”-esque in the sense that there really is a possibility to
>> apply
>>> transformations to a given query.
>>> 
>>> What do you think? Do I see things to complicated or am I missing
>>> something?
>>> 
>>> Julian
>>> 
>>> [1]
>>> 
>> https://lists.apache.org/thread.html/1d5a5aae1d4f5f5a966438a2850860420b674f98b0db7353e7b476f2@%3Cgeneral.incubator.apache.org%3E
>>> [2]
>>> 
>> https://lists.apache.org/thread.html/250575a56165851ab55351b90a26eaa30e84d5bbe2b31203daaaefb9@%3Cdev.calcite.apache.org%3E
>>> [3] https://www.postgresql.org/docs/9.4/datatype-json.html
>>> 
>>> 
>> 
>> 
>>