@@ -271,17 +271,19 @@ New tokens
271271----------
272272
273273Three new tokens are introduced: ``FSTRING_START ``, ``FSTRING_MIDDLE `` and
274- ``FSTRING_END ``. This PEP does not mandate the precise definitions of these tokens
275- as different lexers may have different implementations that may be more efficient
276- than the ones proposed here given the context of the particular implementation. However,
277- the following definitions are provided as a reference so that the reader can have a
278- better understanding of the proposed grammar changes and how the tokens are used:
274+ ``FSTRING_END ``.Different lexers may have different implementations that may be
275+ more efficient than the ones proposed here given the context of the particular
276+ implementation. However, the following definitions will be used as part of the
277+ public APIs of CPython (such as the ``tokenize `` module) and are also provided
278+ as a reference so that the reader can have a better understanding of the
279+ proposed grammar changes and how the tokens are used:
279280
280281* ``FSTRING_START ``: This token includes f-string character (``f ``/``F ``) and the open quote(s).
281- * ``FSTRING_MIDDLE ``: This token includes the text between the opening quote
282- and the first expression brace (``{ ``) and the text between two expression braces (``} `` and ``{ ``).
283- * ``FSTRING_END ``: This token includes everything after the last expression brace (or the whole literal part
284- if no expression exists) until the closing quote.
282+ * ``FSTRING_MIDDLE ``: This token includes a portion of text inside the string that's not part of the
283+ expression part and isn't an opening or closing brace. This can include the text between the opening quote
284+ and the first expression brace (``{ ``), the text between two expression braces (``} `` and ``{ ``) and the text
285+ between the last expression brace (``} ``) and the closing quote.
286+ * ``FSTRING_END ``: This token includes the closing quote.
285287
286288These tokens are always string parts and they are semantically equivalent to the
287289``STRING `` token with the restrictions specified. These tokens must be produced by the lexer
@@ -292,7 +294,7 @@ differently to the one used by the PEG parser).
292294
293295As an example::
294296
295- f'some words {a+b} more words {c+d} final words'
297+ f'some words {a+b:.3f } more words {c+d= } final words'
296298
297299will be tokenized as::
298300
@@ -302,33 +304,85 @@ will be tokenized as::
302304 NAME - 'a'
303305 PLUS - '+'
304306 NAME - 'b'
307+ OP - ':'
308+ FSTRING_MIDDLE - '.3f'
305309 RBRACE - '}'
306310 FSTRING_MIDDLE - ' more words '
307311 LBRACE - '{'
308312 NAME - 'c'
309313 PLUS - '+'
310314 NAME - 'd'
315+ OP - '='
311316 RBRACE - '}'
312- FSTRING_END - ' final words' (without the end quote)
317+ FSTRING_MIDDLE - ' final words'
318+ FSTRING_END - "'"
313319
314320while ``f"""some words""" `` will be tokenized simply as::
315321
316322 FSTRING_START - 'f"""'
317- FSTRING_END - 'some words'
318-
319- One way existing lexers can be adapted to emit these tokens is to incorporate a stack of "lexer modes"
320- or to use a stack of different lexers. This is because the lexer needs to switch from "regular Python
321- lexing" to "f-string lexing" when it encounters an f-string start token and as f-strings can be nested,
322- the context needs to be preserved until the f-string closes. Also, the "lexer mode" inside an f-string
323- expression part needs to behave as a "super-set" of the regular Python lexer (as it needs to be able to
324- switch back to f-string lexing when it encounters the ``} `` terminator for the expression part as well
325- as handling f-string formatting and debug expressions). Of course, as mentioned before, is not possible to
326- provide a precise specification of how this should be done as it will depend on the specific implementation
327- and nature of the lexer to be changed.
328-
329- The specifics of how (or if) the ``tokenize `` module will emit these tokens (or others) and what
330- is included in the emitted tokens are left out of this document and must be decided later in a regular
331- CPython issue.
323+ FSTRING_MIDDLE - 'some words'
324+ FSTRING_END - '"""'
325+
326+ Changes to the tokenize module
327+ ------------------------------
328+
329+ The ``tokenize `` module will be adapted to emit these tokens as described in the previous section
330+ when parsing f-strings so tools can take advantage of this new tokenization schema and avoid having
331+ to implement their own f-string tokenizer and parser.
332+
333+ How to produce these new tokens
334+ -------------------------------
335+
336+ One way existing lexers can be adapted to emit these tokens is to incorporate a
337+ stack of "lexer modes" or to use a stack of different lexers. This is because
338+ the lexer needs to switch from "regular Python lexing" to "f-string lexing" when
339+ it encounters an f-string start token and as f-strings can be nested, the
340+ context needs to be preserved until the f-string closes. Also, the "lexer mode"
341+ inside an f-string expression part needs to behave as a "super-set" of the
342+ regular Python lexer (as it needs to be able to switch back to f-string lexing
343+ when it encounters the ``} `` terminator for the expression part as well as
344+ handling f-string formatting and debug expressions). For reference, here is a
345+ draft of the algorithm to modify a CPython-like tokenizer to emit these new
346+ tokens:
347+
348+ 1. If the lexer detects that an f-string is starting (by detecting the letter
349+ 'f/F' and one of the possible quotes) keep advancing until a character that is
350+ not equal to the first character after the 'f/F' and emit a ``FSTRING_START `` token
351+ with the contents captured (the 'f/F' and the starting quote) Push a new tokenizer
352+ mode to the tokenizer mode stack for "F-string tokenization". Go to step 2.
353+ 2. Keep consuming tokens until a one of the following is encountered:
354+
355+ * A closing quote equal to the opening quote.
356+ * An opening brace (``{ ``) or a closing brace (``} ``) that is not immediately
357+ followed by another opening/closing brace.
358+
359+ In all cases, if the character buffer is not empty, emit a ``FSTRING_MIDDLE ``
360+ token with the contents captured so far but transform any double
361+ opening/closing braces into single opening/closing braces. Now, proceed as
362+ follows depending on the character encountered:
363+
364+ * If a closing quote matching the opening quite is encountered go to step 4.
365+ * If a opening bracket (not immediately followed by another opening bracket)
366+ is encountered, go to step 3.
367+ * If a closing bracket (not immediately followed by another closing bracket)
368+ is encountered, emit a token for the closing bracket and go to step 2.
369+
370+ 3. Push a new tokenizer mode to the tokenizer mode stack for "Regular Python
371+ tokenization withing f-string" and proceed to tokenize with it. This mode
372+ tokenizes as the "Regular Python tokenization" until a ``! ``, ``: ``, ``= ``
373+ character is encountered or if a ``} `` character is encountered with the same
374+ level of nesting as the opening bracket token that was pushed when we enter the
375+ f-string part. Using this mode, emit tokens until one of the stop points are
376+ reached. When this happens, emit the corresponding token for the stopping
377+ character encountered and, pop the current tokenizer mode from the tokenizer mode
378+ stack and go to step 2.
379+ 4. Emit a ``FSTRING_END `` token with the contents captured and pop the current
380+ tokenizer mode (corresponding to "F-string tokenization") and go back to
381+ "Regular Python mode".
382+
383+ Of course, as mentioned before, is not possible to provide a precise
384+ specification of how this should be done for an arbitrary tokenizer as it will
385+ depend on the specific implementation and nature of the lexer to be changed.
332386
333387Consequences of the new grammar
334388-------------------------------
@@ -340,11 +394,22 @@ All restrictions mentioned in the PEP are lifted from f-string literals, as expl
340394* Backslashes may now appear within expressions just like anywhere else in
341395 Python code. In case of strings nested within f-string literals, escape sequences are
342396 expanded when the innermost string is evaluated.
343- * Comments, using the ``# `` character, are possible only in multi-line f-string literals,
344- since comments are terminated by the end of the line (which makes closing a
345- single-line f-string literal impossible). Comments in multi-line f-string literals require
346- the closing ``{ `` of the expression part to be present in a different line as the one the
347- comment is in.
397+ * New lines are now allowed within expression brackets. This means that these are now allowed::
398+
399+ >>> x = 1
400+ >>> f"___{
401+ ... x
402+ ... }___"
403+ '___1___'
404+
405+ >>> f"___{(
406+ ... x
407+ ... )}___"
408+ '___1___'
409+
410+ * Comments, using the ``# `` character, are allowed within the expression part of an f-string.
411+ Note that comments require that the closing bracket (``} ``) of the expression part to be present in
412+ a different line as the one the comment is in or otherwise it will be ignored as part of the comment.
348413
349414.. _701-considerations-of-quote-reuse :
350415
@@ -499,6 +564,12 @@ Rejected Ideas
499564
500565 >>> f'Useless use of lambdas: { (lambda x: x*2) }'
501566
567+ #. We have decided not to allow for the time being using scaped braces (``\{ `` and ``\} ``)
568+ in addition of the ``{{ `` and ``}} `` syntax. Although the authors of the PEP believe that
569+ this is a good idea, we have decided to not include it in this PEP as it is not strictly
570+ necessary for the formalization of f-strings proposed here and it can be
571+ added independently in a regular Python issue.
572+
502573Open Issues
503574===========
504575
0 commit comments