[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

On 5/17/2018 3:01 PM, Larry Hastings wrote:
> I fed this into tokenize.tokenize():
>     b''' x = "\u1234" '''
> I was a bit surprised to see \Uxxxx in the output.? Particularly because 
> the output (t.string) was a *string* and not *bytes*.

For those (like me) who have no idea how to use tokenize.tokenize's 
wacky interface, the test code is:

list(tokenize.tokenize(io.BytesIO(b''' x = "\u1234" ''').readline))

> Maybe I'm making a parade of my ignorance, but I assumed that string 
> literals were parsed by the parser--just like everything else is parsed 
> by the parser, hey it seems like a good place for it--and in particular 
> that the escape sequence substitutions would be done in the tokenizer.  
> Having stared at it a little, I now detect a whiff of "this design 
> solved a real problem".? So... what was the problem, and how does this 
> design solve it?

I assume the intent is to not throw away any information in the lexer, 
and give the parser full access to the original string. But that's just 
a guess.

> BTW, my use case is that I hoped to use CPython's tokenizer to parse 
> some Python-ish-looking text and handle double-quoted strings for me.  
> *Especially* all the escape sequences--leveraging all CPython's support 
> for funny things like \U{penguin}.? The current behavior of the 
> tokenizer makes me think it'd be easier to roll my own!

Can you feed the token text to the ast?

 >>> ast.literal_eval('"\u1234"')