[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

"Data blocks" syntax specification draft

On Tue, May 22, 2018 at 9:01 AM, Christian Gollwitzer <auriocus at gmx.de> wrote:
> Am 22.05.18 um 04:17 schrieb Mikhail V:
>>> YAML comes to mind
>> Actually plugging a data syntax in existing language is not a new idea.
>> Though I don't know real success stories.
> Thing is, you can do it already now in the script, without modifying the
> Python interpreter, by parsing a triple-quoted string. See the examples
> right here: http://pyyaml.org/wiki/PyYAMLDocumentation

Yes. That is exactly what I wanted to discuss actually.
So the feature, which makes it possible in this case  is
triple quote string (TQS).

I think it would be appropriate to propose an alternative
to TQS for this specific purposes. Namely for making it
easier to implement parsers and embedded syntaxes.

So what do I have now with triple quoted strings -
a simple example:

if 1:
    s = """\
    print ("\n") \\
        foo = 5

So there is a _possibility_ in the sense it is possible to do, so
let's say I have a lib with a parser, etc. Though now a developer
and a user will face quite real issues:

- TQS itself has its specific purpose already in many contents,
  which may mean for example hard-coded syntax highlighting
- there a lot of things happening here: e.g. in the above example
  I use "\n" which I assume a part of string, or \\ - but it is interpreted.
  Maybe some other things regarding escaping. This particular
  issue maybe a blocker for making use of TQS in some data cases,
  Say if the target source text need these very characters.

- indentation is the part of TQS. That is of couse by design
  so and it's quite logical, though it is hard-coded behaviour and thus
  does not make the presentation a natural part of blocks containing
  this string.
- appearance: imagine you have some small chunks of embedded
  code parts and you will still have the closing """ everywhere -
  that would be really hairy.

The alternative proposal therefore comes down to a "data block" syntax,
without much assumption about the contents of the block.

This should be simpler to implement, because it should not need a lot
of parsing rules - only some basic options. At the same time it enables
the 'embedding' of user-defined blocks/syntax more naturally
looking than TQS.

My thoughts on possible solution.

Problem one: make it look natural inside python source.
Current Python behaviour: very simply speaking, first leading white space
on a line is taken and compared with the one from the next line.
Its Okay for statements, but not okay for raw text data -
because probably I want custom leading whitespaces:

string =

(So the TQS takes it simple - grabs it from the line beginning)

So the idea:
- add such a block  to syntax
- *force explicit parameter for the indent charaters.*

[here i'll use same symbol /// for the data entry point, but of course it can
be changed if a better idea comes later. Also for now, just for simplicity -
the rule is that the contents of a block starts always on the new line.

So, e.g. this:

data = /// s4
    first line
    last line
the rest python code

- will parse the block and knock out leading 4 spaces.
i.e. if the first line has 5 leading spaces then 1 space will be left
in the string. Block parsing terminates when the next line does not
satisfy the indent sequence (4 spaces in this case).
Another obvious type: tabs:

data = /// t1
    first line
    last line
the rest python code

Will do the same but with one tabstop character.

Actually that's it!
Some further ideas:

data = /// ts
- "any whitespace" (mimic current Python behaviour)

data = /// s        # or
data = /// t
- simply count amount of spaces (tabs) from first
  line and proceed, otherwise terminate.

data = /// "???"
??? abc foo bar

- defines indent character by string: crazy idea but why not.

Language  parameter, e.g.:
data = /// t1."yaml"

-this can be reserved for future usage by code analysis tools
or dynamic syntax highlighting.

That's just a rough specification.

What should it give as result:

1. No clash with current TQS rules - less worries
  about reserved characters.

2. Built-in indentation parsing parameter makes it more or
  less natural continuation of Python blocks and is char-precise,
  which is very important here.

3. Independent of the indent of containing block!

4. Parameter descriptor can be developed in such manner
   that it allows more customisation and additions in the future.

Does seem to be more generalized problem-solving here.

One problem, as usual - tabs may be implicitly converted
to spaces by some software. That obviously could brake
something, but so is with any tabs, and its not related to
Python problem.

Is there something I miss here?
What caveats can be with such approach?