logo       
Bookmark and Share

a self-optimizing alternative parser: msg#00025

parsers.spirit.devel

Subject: a self-optimizing alternative parser

Hi,

I find this idea so great, that I hope it's worth a forward and some more
thoughts...
It's inspired by Elviin's post and Carl Barron's response on spirit.general
lately (conversation and names got cut).

Tobias Schwinger wrote:

We're talking about a self-adjusting alternative parser -- one that tries the most probable (based on initialization and experience) alternative first?

That's pretty cool of a thing!

<snip code>

OK. I might have a simpler algorithm for you:

First of all we, store the number of hits rather than probabilities.
The probabilities can be calculated like this:

prob(parser[i]) = hits(parser[i]) / sum(j: hits(parser[j]))

We start with a sorted array of hit_count-parser-pairs (where hit_count is the sorting criterion):

hits | parser
--------------------------
10 | pizza_parser
9 | gnocci_parser
9 | lasagne_parser
8 | spaghetti_parser
--------------------------
sum: 36

Sidenote: say the values come from pre-initialization, here -- zero values would work too, but I'd have to run the algorithm several times to make things transparent for the example.

Sidenote: the probability of e.g. "pizza_parser" would be currently at 10/36 ~= 0.28.

OK. Now we process the table from top to bottom. Every time the hit count changes (compared to the predecessor) we remember the current index:

We start with 10 and the index of "pizza_parser" (0). Then we remember the index of the "gnocci_parser" (1) because its hit count is less than the hit count of its "pizza_parser". The remembered index is left unchanged when we step forward to "lasagne_parser", because it has the same hit count than "gnocci_parser".

Let's say "lasagne_parser" matches our input. Now the hit count increments to 10. We swap positions with "gnocci_parser" (at our remembered index) and this sorts our table, then we increment the total hit count (so we can calculate the probability in O(1), later) and leave -- our parse was successful.

If none of our parsers matches, our parse failed. The table is left unchanged in this case.

Pseudo code:

swap_index = 0;
current_hit_count = 0;
for(i = 0; i < size(table); ++i)
{
if (table[i].hit_count != current_hit_count)
{
current_hit_count = table[i].hit_count;
swap_index = i;
}

if (table[i].parser.parse( /* [...] */ ))
{
++table[i].hit_count;
swap(table[i],table[swap_index]);

++total_hit_count;

return match;
}
}
return no_match;

With tons of input we might want to add some code to scale down the counters before there's an overflow. Further note, that I did not address backtracking issues in the pseudo code (as Spirit 1.2 would require us to restore the scanner position on no match).

struct grammar_base
{
virtual int which() const = 0;
vitual ~grammar_base(){}
};


Given a scanner type (talking Spirit 1.X), a pointer to a function (template instantiation) that contains a static cast and the invocation of parse could be used instead. This way we could implement the polymorphism without the requirement of a common base class and with more efficiency.


The information in this last paragraph might be too dense and maybe some code
is useful to illustrate what I'm talking about:

template<typename Parser, typename Scanner>
bool parse_function(void* parser, Scanner & scan)
{
return static_cast<Parser*>(parser)->parse(scan).hit;
}

Now the expression "& parse_function<Parser,Scanner>" gives us a function
pointer
of the type "bool(*)(void*,Scanner & scan)".

A probability parser instance would use a fixed scanner type (given as template argument,
just like "rule") so the scanner type is known and the parser type vanishes, so
parsers can be stored as a void pointer (this technique must be applied again for
destruction and maybe copying, I don't bother here for the sake of simplicity).

In Spirit-2 (based on the information that there is so far) we're talking about
iterators instead of scanners (it makes the client interface more convenient).
Still, fixed scanner type or fixed iterator type -- it's still somewhat
inconvenient. Further there's one dynamic call that won't go away.

Given we have a tuple of parsers we can use a (runtime) index instead of a void
pointer to as a reference to a parser and code like that:

switch(parser_index)
{
case 0: return at_c<Parsers,0>(parsers)->parse [...]
case 1: return at_c<Parsers,1>(parsers)->parse [...]
case 2: return at_c<Parsers,2>(parsers)->parse [...]
// [...]
}
// naive code for illustration (fusion::accumulate // allows us to implement a scalable version)

The nicest thing about it is: we don't have to hard-wire scanner or iterator.

For a small number of component parsers we might have more efficient dispatch
(especially on modern CPUs with BTC), however, if the number of parsers gets
too high, the dynamic call O(1) becomes faster than the cascade of conditional
branches O(N).

Since the O(1) version implies a less convenient interface we should push
things even further by bringing the dispatcher into a tree structure to have
O(log N) so things still work well with comparatively many component parsers.

Comments welcome!


Regards,

Tobias



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642


<Prev in Thread] Current Thread [Next in Thread>
Google Custom Search

News | Mail Home | sitemap | FAQ | advertise