Skip to content

Commit 2bfe6d5

Browse files
committed
PEP 701: Incorporate more feedback from the discussion thread
Signed-off-by: Pablo Galindo <[email protected]>
1 parent b173099 commit 2bfe6d5

File tree

1 file changed

+102
-31
lines changed

1 file changed

+102
-31
lines changed

pep-0701.rst

Lines changed: 102 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -271,17 +271,19 @@ New tokens
271271
----------
272272

273273
Three new tokens are introduced: ``FSTRING_START``, ``FSTRING_MIDDLE`` and
274-
``FSTRING_END``. This PEP does not mandate the precise definitions of these tokens
275-
as different lexers may have different implementations that may be more efficient
276-
than the ones proposed here given the context of the particular implementation. However,
277-
the following definitions are provided as a reference so that the reader can have a
278-
better understanding of the proposed grammar changes and how the tokens are used:
274+
``FSTRING_END``.Different lexers may have different implementations that may be
275+
more efficient than the ones proposed here given the context of the particular
276+
implementation. However, the following definitions will be used as part of the
277+
public APIs of CPython (such as the ``tokenize`` module) and are also provided
278+
as a reference so that the reader can have a better understanding of the
279+
proposed grammar changes and how the tokens are used:
279280

280281
* ``FSTRING_START``: This token includes f-string character (``f``/``F``) and the open quote(s).
281-
* ``FSTRING_MIDDLE``: This token includes the text between the opening quote
282-
and the first expression brace (``{``) and the text between two expression braces (``}`` and ``{``).
283-
* ``FSTRING_END``: This token includes everything after the last expression brace (or the whole literal part
284-
if no expression exists) until the closing quote.
282+
* ``FSTRING_MIDDLE``: This token includes a portion of text inside the string that's not part of the
283+
expression part and isn't an opening or closing brace. This can include the text between the opening quote
284+
and the first expression brace (``{``), the text between two expression braces (``}`` and ``{``) and the text
285+
between the last expression brace (``}``) and the closing quote.
286+
* ``FSTRING_END``: This token includes the closing quote.
285287

286288
These tokens are always string parts and they are semantically equivalent to the
287289
``STRING`` token with the restrictions specified. These tokens must be produced by the lexer
@@ -292,7 +294,7 @@ differently to the one used by the PEG parser).
292294

293295
As an example::
294296

295-
f'some words {a+b} more words {c+d} final words'
297+
f'some words {a+b:.3f} more words {c+d=} final words'
296298

297299
will be tokenized as::
298300

@@ -302,33 +304,85 @@ will be tokenized as::
302304
NAME - 'a'
303305
PLUS - '+'
304306
NAME - 'b'
307+
OP - ':'
308+
FSTRING_MIDDLE - '.3f'
305309
RBRACE - '}'
306310
FSTRING_MIDDLE - ' more words '
307311
LBRACE - '{'
308312
NAME - 'c'
309313
PLUS - '+'
310314
NAME - 'd'
315+
OP - '='
311316
RBRACE - '}'
312-
FSTRING_END - ' final words' (without the end quote)
317+
FSTRING_MIDDLE - ' final words'
318+
FSTRING_END - "'"
313319

314320
while ``f"""some words"""`` will be tokenized simply as::
315321

316322
FSTRING_START - 'f"""'
317-
FSTRING_END - 'some words'
318-
319-
One way existing lexers can be adapted to emit these tokens is to incorporate a stack of "lexer modes"
320-
or to use a stack of different lexers. This is because the lexer needs to switch from "regular Python
321-
lexing" to "f-string lexing" when it encounters an f-string start token and as f-strings can be nested,
322-
the context needs to be preserved until the f-string closes. Also, the "lexer mode" inside an f-string
323-
expression part needs to behave as a "super-set" of the regular Python lexer (as it needs to be able to
324-
switch back to f-string lexing when it encounters the ``}`` terminator for the expression part as well
325-
as handling f-string formatting and debug expressions). Of course, as mentioned before, is not possible to
326-
provide a precise specification of how this should be done as it will depend on the specific implementation
327-
and nature of the lexer to be changed.
328-
329-
The specifics of how (or if) the ``tokenize`` module will emit these tokens (or others) and what
330-
is included in the emitted tokens are left out of this document and must be decided later in a regular
331-
CPython issue.
323+
FSTRING_MIDDLE - 'some words'
324+
FSTRING_END - '"""'
325+
326+
Changes to the tokenize module
327+
------------------------------
328+
329+
The ``tokenize`` module will be adapted to emit these tokens as described in the previous section
330+
when parsing f-strings so tools can take advantage of this new tokenization schema and avoid having
331+
to implement their own f-string tokenizer and parser.
332+
333+
How to produce these new tokens
334+
-------------------------------
335+
336+
One way existing lexers can be adapted to emit these tokens is to incorporate a
337+
stack of "lexer modes" or to use a stack of different lexers. This is because
338+
the lexer needs to switch from "regular Python lexing" to "f-string lexing" when
339+
it encounters an f-string start token and as f-strings can be nested, the
340+
context needs to be preserved until the f-string closes. Also, the "lexer mode"
341+
inside an f-string expression part needs to behave as a "super-set" of the
342+
regular Python lexer (as it needs to be able to switch back to f-string lexing
343+
when it encounters the ``}`` terminator for the expression part as well as
344+
handling f-string formatting and debug expressions). For reference, here is a
345+
draft of the algorithm to modify a CPython-like tokenizer to emit these new
346+
tokens:
347+
348+
1. If the lexer detects that an f-string is starting (by detecting the letter
349+
'f/F' and one of the possible quotes) keep advancing until a character that is
350+
not equal to the first character after the 'f/F' and emit a ``FSTRING_START`` token
351+
with the contents captured (the 'f/F' and the starting quote) Push a new tokenizer
352+
mode to the tokenizer mode stack for "F-string tokenization". Go to step 2.
353+
2. Keep consuming tokens until a one of the following is encountered:
354+
355+
* A closing quote equal to the opening quote.
356+
* An opening brace (``{``) or a closing brace (``}``) that is not immediately
357+
followed by another opening/closing brace.
358+
359+
In all cases, if the character buffer is not empty, emit a ``FSTRING_MIDDLE``
360+
token with the contents captured so far but transform any double
361+
opening/closing braces into single opening/closing braces. Now, proceed as
362+
follows depending on the character encountered:
363+
364+
* If a closing quote matching the opening quite is encountered go to step 4.
365+
* If a opening bracket (not immediately followed by another opening bracket)
366+
is encountered, go to step 3.
367+
* If a closing bracket (not immediately followed by another closing bracket)
368+
is encountered, emit a token for the closing bracket and go to step 2.
369+
370+
3. Push a new tokenizer mode to the tokenizer mode stack for "Regular Python
371+
tokenization withing f-string" and proceed to tokenize with it. This mode
372+
tokenizes as the "Regular Python tokenization" until a ``!``, ``:``, ``=``
373+
character is encountered or if a ``}`` character is encountered with the same
374+
level of nesting as the opening bracket token that was pushed when we enter the
375+
f-string part. Using this mode, emit tokens until one of the stop points are
376+
reached. When this happens, emit the corresponding token for the stopping
377+
character encountered and, pop the current tokenizer mode from the tokenizer mode
378+
stack and go to step 2.
379+
4. Emit a ``FSTRING_END`` token with the contents captured and pop the current
380+
tokenizer mode (corresponding to "F-string tokenization") and go back to
381+
"Regular Python mode".
382+
383+
Of course, as mentioned before, is not possible to provide a precise
384+
specification of how this should be done for an arbitrary tokenizer as it will
385+
depend on the specific implementation and nature of the lexer to be changed.
332386

333387
Consequences of the new grammar
334388
-------------------------------
@@ -340,11 +394,22 @@ All restrictions mentioned in the PEP are lifted from f-string literals, as expl
340394
* Backslashes may now appear within expressions just like anywhere else in
341395
Python code. In case of strings nested within f-string literals, escape sequences are
342396
expanded when the innermost string is evaluated.
343-
* Comments, using the ``#`` character, are possible only in multi-line f-string literals,
344-
since comments are terminated by the end of the line (which makes closing a
345-
single-line f-string literal impossible). Comments in multi-line f-string literals require
346-
the closing ``{`` of the expression part to be present in a different line as the one the
347-
comment is in.
397+
* New lines are now allowed within expression brackets. This means that these are now allowed::
398+
399+
>>> x = 1
400+
>>> f"___{
401+
... x
402+
... }___"
403+
'___1___'
404+
405+
>>> f"___{(
406+
... x
407+
... )}___"
408+
'___1___'
409+
410+
* Comments, using the ``#`` character, are allowed within the expression part of an f-string.
411+
Note that comments require that the closing bracket (``}``) of the expression part to be present in
412+
a different line as the one the comment is in or otherwise it will be ignored as part of the comment.
348413

349414
.. _701-considerations-of-quote-reuse:
350415

@@ -499,6 +564,12 @@ Rejected Ideas
499564

500565
>>> f'Useless use of lambdas: { (lambda x: x*2) }'
501566

567+
#. We have decided not to allow for the time being using scaped braces (``\{`` and ``\}``)
568+
in addition of the ``{{`` and ``}}`` syntax. Although the authors of the PEP believe that
569+
this is a good idea, we have decided to not include it in this PEP as it is not strictly
570+
necessary for the formalization of f-strings proposed here and it can be
571+
added independently in a regular Python issue.
572+
502573
Open Issues
503574
===========
504575

0 commit comments

Comments
 (0)