logo       

rx_compile and FSAs: msg#00304

lang.perl.perl6.internals

Subject: rx_compile and FSAs

I've done quite a lot of thinking about Parrot's rx_compile op, as I was
thinking about implementing it. However, I've come to the conclusion
that the definition of the op as it stands is too shallow. Please
consider this definition and let me know if implementing it would be
worth it to Parrot as a whole, or if this is a case of being too
generic.

rx_compile out P0, in S0, in I0

Produce an FSA continuation (see fsa_to_continuation) in P0
which matches the regular expression in S0. The syntax of the
regular expression is determined by the type value in I0. This
type must be a valid type as returned by rx_load_type. This op
is simply a combination of rx_to_fsa and fsa_to_continuation

rx_to_fsa out P0, in S0, in I0

Pass the regular expression string parameter to the specfied
regular expression parser based on the type parameter (as
returned from rx_load_type). Returns an FSA PMC.

rx_load_type out I0, in S0

Dynamically load an rx compiler by name and set the first
parameter to the identifier for that compiler (for use with
rx_compile). The default, minimalistic regular expression syntax
has identifier 0. The compiler itself must invoke fsa_new at
some point in order to generate its return value.

fsa_new out P0, in I0, in P1, in P3, in I1

Given inputs: max state, alphabet, transition matrix and start
state, produce an FSA output object which implements the
requirements. The alphabet can be one of several values, TBD,
but noted that "characters" are only one possible type of
alphabet. The transition matrix is an array of arrays, each of
which contains 3 to 5 values: start state, input range(s),
target state and an optional integer set to 1 or 0 to indicate
if this is a final state or not and an optional integer set to
0, 1 or 2 to indicate if this state consumes its input and
records it (0), consumes its input and does not record it (1) or
does not consume the input token. Input ranges are going to be
similar values to alphabet. More detail may be added to the
transition matrix as needed. Note that target state may end up
needing to be either an integer state value or a continuation.

fsa_to_continuation out P0, in P1

Take as input a valid FSA and return a continuation which, when
invoked, will simulate the FSA on its input parameter, and
return an FSA status PMC. If the FSA has never been compiled to
Parrot, this op compiles it, otherwise the same continuation is
returned.

For all intents and purposes, assume that this is a black box,
and although in most cases the continuation returned will invoke
newly generated bytecode (representing the FSA's state
transitions), that need not be the case.

fsa_minimize out P0, in P1

Attempt to minimize the FSA and return a new FSA as output.

The goal here is to provide a generic FSA mechanism which can accomodate
the various regular expression syntaxes as input, or can be generated
"by hand" by a compiler writer who wishes to have control over the
details (or provide FSA semantics over more complex tokens than just
characters). Either way, the result is an FSA which can be executed (aka
"simulated") by invoking its continuation.

--
Aaron Sherman <ajs@xxxxxxx>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback





<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | FAQ | advertise