[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Pythonic custom multi-line parsers

On 7/10/19 10:50 AM, Johannes Bauer wrote:
> Hi list,
> I'm looking for ideas as to a pretty, Pythonic solution for a specific
> problem that I am solving over and over but where I'm never happy about
> the solution in the end. It always works, but never is pretty. So see
> this as an open-ended brainstorming question.
> Here's the task: There's a custom file format. Each line can be parsed
> individually and, given the current context, the meaning of each
> individual line is always clearly distinguishable. I'll give an easy
> example to demonstrate:
> moo = koo
> bar = foo
> foo :=
>     abc
>     def
> baz = abc
> Let's say the root context knows only two regexes and give them names:
> keyvalue: \w+ = \w+
> start-multiblock: \w+ :=
> The keyvalue is contained in itself, when the line is successfully
> parsed all the information is present. The start-multiblock however
> gives us only part of the puzzle, namely the name of the following
> block. In the multiblock context, there's different regexes that can
> happen (actually only one):
> multiblock-item: \s\w+
> Now obviously whe the block is finished, there's no delimiter. It's
> implicit by the multiblock-item regex not matching and therefore we
> backtrack to the previous parser (root parser) and can successfully
> parse the last line baz = abc.
> Especially consider that even though this is a simple example, generally
> you'll have multiple contexts, many more regexes and especially nesting
> inside these contexts.
> Without having to use a parser generator (for those the examples I deal
> with are usually too much overhead) what I usually end up doing is
> building a state machine by hand. I.e., I memorize the context, match
> those and upon no match manually delegate the input data to backtracked
> matchers.
> This results in AWFULLY ugly code. I'm wondering what your ideas are to
> solve this neatly in a Pythonic fashion without having to rely on
> third-party dependencies.
> Cheers,
> Joe

That's pretty much what I do.  I generally make the parser a class and each 
state a method.  Every line the parser takes out of the file it passes to 
self.statefn, which processes the line in the current context and updates 
self.statefn to a different method if necessary.

Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.