ERR5RS:Lexical Syntax

From SchemePunks

Jump to: navigation, search

Contents

[edit] Requirements and Recommendations

[edit] Formal Account

<lexemes> →
         | <lexeme> <lexemes>
         | <atmosphere> <lexemes>

<lexeme> → <identifier> | <boolean> | <number> | <character> | <string> | <directive>
         | ( | ) | #( | ’ | ‘ | , | ,@ | .
<delimiter> → ( | ) | " | ; | # | <whitespace>

<atmosphere> → <whitespace> | <comment>
<whitespace> → <space> | <character tabulation>
         | <line feed> | <line tabulation> | <form feed>
         | <carriage return> | <next line>
         | <any other character whose category is Zs, Zl, or Zp>
<comment> → ; 〈all subsequent characters up to a <line ending>〉
         | #; <atmosphere>* <datum>
<line ending> → <line feed> | <carriage return>  | <next line>  | <line separator>
         | <carriage return> <line feed> | <carriage return> <next line>

<identifier>, <boolean>, <number>, <character>, <directive>, and . (dot) lexemes must be followed immediately by a <delimiter> or by the end of the input.

The following characters are reserved for future extensions to the language: { } | [ ]

<identifier> → <initial> <subsequent>* | <peculiar identifier>
<initial> → <constituent> | <special initial> | <inline hex escape>
<constituent> → <letter> | 〈any character whose Unicode scalar value is greater than
             127, and whose category is Lu, Ll, Lt, Lm, Lo, Mn,
             Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co〉
<letter> → a | b | c | d | e | f | g | h | i | j | k | l | m 
         | n | o | p | q | r | s | t | u | v | w | x | y | z
         | A | B | C | D | E | F | G | H | I | J | K | L | M
         | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
<special initial> → ! | $ | % | & | * | / | : | < | = | > | ? | ^ | _ | ~
<subsequent> → <initial> | <digit> | <any character whose category is Nd, Mc, or Me> 
         | <special subsequent>
<digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<special subsequent> → + | - | . | @
<peculiar identifier> → + | - | ...
<inline hex escape> → \x<hex scalar value>;
<hex value> → <digit 16>+

<boolean> → #<t> | #<f>

<character> → #\<any character> | #\<character name> | #\x<hex value>
<character name> → <n><u><l> 
         | <a><l><a><r><m>
         | <b><a><c><k><s><p><a><c><e>
         | <t><a><b>
         | <l><i><n><e><f><e><e><d>
         | <n><e><w><l><i><n><e>
         | <v><t><a><b>
         | <p><a><g><e>
         | <r><e><t><u><r><n>
         | <e><s><c>
         | <s><p><a><c><e>
         | <d><e><l><e><t><e>

<string> → " <string element>* "
<string element> → <any character other than " or \> 
         | \a | \b | \t | \n | \v | \f | \r| \" | \\ 
         | \<intraline whitespace>*<line ending><intraline whitespace>*
         | <inline hex escape>
<intraline whitespace> → <space> | <character tabulation>

<directive> → #!fold-case | #!no-fold-case

The rules for <num R>, <complex R>, <real R>, <ureal R>, <uinteger R>, and <prefix R> below should be replicated for R = 2, 8, 10, and 16. There are no rules for <decimal 2>, <decimal 8>, and <decimal 16>, which means that number representations containing decimal points or exponents must be in decimal radix.

<number> → <num 2> | <num 8> | <num 10> | <num 16>
<num R> → <prefix R> <complex R>
<complex R> → <real R> 
         | <real R> @ <real R>
         | <real R> + <ureal R> <i> | <real R> - <ureal R> <i>
         | <real R> + <i> | <real R> - <i>
         | + <ureal R> <i> | - <ureal R> <i>
         | + <i> | - <i>
<real R> → <sign> <ureal R>
         | + <naninf> | - <naninf>
<naninf> → nan.0 | inf.0
<ureal R> → <uinteger R>
         | <uinteger R> / <uinteger R>
         | <decimal R>
<decimal 10> → <uinteger 10> <suffix>
         | . <digit 10>+ #* <suffix>
         | <digit 10>+ . <digit 10>* #* <suffix>
         | <digit 10>+ #* . #* <suffix>
<uinteger R> → <digit R>+ #*
<prefix R> → <radix R> <exactness> | <exactness> <radix R>

<suffix> → <empty> | <exponent marker> <sign> <digit 10>+
<exponent marker> → <e> | <s> | <f> | <d> | <l>
<sign> → <empty> | + | -
<exactness> → <empty> | #<i> | #<e>
<radix 2> → #<b>
<radix 8> → #<o>
<radix 10> → <empty> | #<d>
<radix 16> → #<x>
<digit 2> → 0 | 1
<digit 8> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
<digit 10> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<digit 16> → <digit 10> | <a> | <b> | <c> | <d> | <e> | <f>
<a> → a | A
<b> → b | B
<c> → c | C
...
<z> → z | Z

[edit] Line Endings

Line endings are significant in Scheme in comments and within string literals. In Scheme source code, a line feed (U+000A), carriage return (U+000D), next line (U+0085), or line separator (U+2028) character marks the end of a line. Moreover, the two-character line endings <carriage return> <line feed> (U+000D U+000A) and <carriage return> <next line> (U+000D U+0085) counts as a single line ending.

In a string literal, a <line ending> not preceded by a \ stands for a line feed character, which is the standard line-ending character of Scheme. A <line ending> preceded by a \ is elided, along with any surrounding <intraline whitespace> characters.

[edit] Whitespace and Comments

The whitespace characters are space (U+0020), character tabulation (U+0009), line feed (U+000A), line tabulation (U+000B), form feed (U+000D), carriage return (U+000D), next line (U+0085), and any other character whose category is Zs, Zl, or Zp. Whitespace is used for improved readability and as necessary to separate lexemes from each other. Whitespace may occur between any two lexemes, but not within a lexeme. Whitespace may also occur inside a string, where it is significant.

The lexical syntax includes two comment forms. In all cases, comments are invisible to Scheme, except that they act as delimiters, so, for example, a comment cannot appear in the middle of an identifier or representation of a number object.

A semicolon (;) indicates the start of a line comment. The comment continues to the end of the line on which the semicolon appears.

The octothorpe followed by a semicolon (#;) indicates the start of a line comment. The comment includes any subsequent sequence of <atmosphere> and continues to the end of the next <datum>. This notation is useful for “commenting out” sections of code.

[edit] Identifiers

Most identifiers allowed by other programming languages are also acceptable to Scheme. In general, a sequence of letters, digits, and “extended alphabetic characters” is an identifier when it begins with a character that cannot begin a representation of a number object. In addition, +, -, and ... are identifiers. Here are some examples of identifiers:

lambda         q                soup
list->vector   +                V17a
<=             a34kTMNs         the-word-recursion-has-many-meanings

Extended alphabetic characters may be used within identifiers as if they were letters. The following are extended alphabetic characters:

! $ % & * + - . / : < = > ? @ ^ _ ~ 

Moreover, all characters whose Unicode scalar values are greater than 127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co are treated like letter characters and can appear anywhere in an identifier and all characters whose Unicode scalar values are greater than 127 and whose Unicode category is Nd, Mc, or Me are treated like <number> characters and can appear after the first character in an identifier. In addition, any character can be used anywhere in an identifier when specified via an <inline hex escape>. For example, the identifier H\x65;llo is the same as the identifier Hello, and the identifier \x3BB; is the same as the identifier λ.

Implementations of ERR5RS may by default choose to fold the case or not fold the case of characters in identifiers. This can be overridden by the #!fold-case and #!no-fold-case directives described below.

[edit] Booleans

The standard boolean objects for true and false have external representations #t or #T and #f or #F. Case is not significant in external representations of boolean objects.

[edit] Numbers

The syntax of external representations for number objects is described formally by the <number> rule in the formal grammar. Case is not significant in external representations of number objects.

A representation of a number object may be written in binary, octal, decimal, or hexadecimal by the use of a radix prefix. The radix prefixes are #b or #B (binary), #o or #O (octal), #d or #D (decimal), and #x or #X (hexadecimal). With no radix prefix, a representation of a number object is assumed to be expressed in decimal.

A representation of a number object may be specified to be either exact or inexact by a prefix. The prefixes are #e or #E for exact and #i or #I for inexact. An exactness prefix may appear before or after any radix prefix that is used. If the representation of a number object has no exactness prefix, the constant is inexact if it contains a decimal point, an exponent, or a “#” character in the place of a digit; otherwise it is exact.

For the purposes of determining the value of a number, a # in the place of a digit should be interpreted as 0.

In systems with inexact number objects of varying precisions, it may be useful to specify the precision of a constant. For this purpose, representations of number objects may be written with an exponent marker that indicates the desired precision of the inexact representation. The letters s or S, f or F, d or D, and l or L specify the use of short, single, double, and long precision, respectively. (When fewer than four internal inexact representations exist, the four size specifications are mapped onto those available. For example, an implementation with two internal representations may map short and single together and long and double together.) In addition, the exponent marker e or E specifies the default precision for the implementation. The default precision has at least as much precision as double, but implementations may wish to allow this default to be set by the user.

3.1415926535898F0
Round to single, perhaps 3.141593
0.6L0
Extend to long, perhaps .600000000000000

The literals +inf.0 and -inf.0 represent positive and negative infinity, respectively. The +nan.0 and -nan.0 literals represent the NaN that is the result of (/ 0.0 0.0), and may represent other NaNs as well.

If x is an external representation of an inexact real number object and contains no exponent marker other than e, the inexact real number object it represents is a flonum. Some or all of the other external representations of inexact real number objects may also represent flonums, but that is not required.

[edit] Characters

Characters are represented using the notation #\<character>, #\<character name>, or #\x<hex value>.

RepresentationCharacter
#\a lower case letter a
#\A upper case letter A
#\( left parenthesis (
#\ space SP
#\nul U+0000 NUL
#\alarm U+0007 BEL
#\backspace U+0008 BS
#\tab U+0009 TAB
#\linefeed U+000A LF
#\newline U+000A LF
#\vtab U+000B VT
#\page U+000C FF
#\return U+000D CR
#\esc U+001B ESC
#\space U+0020 SP
#\SPACE U+0020 SP
#\delete U+007F DEL
#\x7f U+007F DEL
#\xA U+000A LF
#\x41 U+0041 A

Case is significant in #\<character> and in the x in #\x<hex value> but not in #\<character name> or in the <hex value> in #\x<hex value>. A <character> must be followed by a <delimiter> or by the end of the input.

[edit] Strings

String are represented by sequences of characters enclosed within double quotes ("). Within a string literal, various escape sequences represent characters other than themselves. Escape sequences always start with a backslash (\).

Escape sequenceCharacter
\aU+0007 BEL
\bU+0008 BS
\tU+0009 TAB
\nU+000A LF
\vU+000B VT
\fU+000C FF
\rU+000D CR
\"U+0022 "
\\U+005C \
\x7f; U+007F DEL
\xA; U+000A LF
\x41; U+0041 A

A <line ending> not preceded by a \ stands for a line feed character, which is the standard line-ending character of Scheme. A <line ending> preceded by a \ is elided, along with any surrounding <intraline whitespace> characters.

Scheme does not specify the effect of a backslash within a string that is not part of one of the above escape sequences.

Case is significant in strings except in the <hex value> in <inline hex escapes>.

[edit] Directives

When either of these two directives is read from a port by the read procedure, the read procedure returns unspecified values but also, as a side effect, folds the case (#!fold-case) or does not fold the case (#!no-fold-case) of characters in identifiers read from that port by the read procedure. This effect lasts until another of the two directives is read from the port by the read procedure.

To prevent the read procedure from returning the unspecified values of the #!fold-case or #!no-fold-case directives, they may be converted into datum comments by prefixing them with #;. Their effect on case-folding is the same when read as a datum comment as when read as a datum.

Since the #!fold-case and #!no-fold-case may read as zero values or as more than one value, those directives cannot be embedded within compound data such as lists or vectors; they are effectively restricted to the outermost level of programs and data. As datum comments, however, they may be embedded within compound data.

Case is significant in directives.

[edit] Rationale and Notes

[edit] Background

The R5RS lexical syntax has been both complex and limiting. When writing portable code, R5RS programmers have had to restrict themselves to a subset of the Ascii characters, not all of which can appear within string literals or identifiers. The variety and popularity of implementation-specific extensions that remove some of these lexical restrictions have been obstacles to portability.

The portability of R5RS programs has also been impeded by several other implementation-dependent extensions and deviations, notably case-sensitivity, square brackets, notations for NaNs and infinities, and nonstandard external notations for vectors.

[edit] Principles? In a syntax fight?

ERR5RS addresses these problems by adding some R6RS extensions to the R5RS lexical syntax. Some of these extensions are essential, while others are recommended. The essential extensions may be relied upon by any ERR5RS program, while recommended extensions improve the portability of ERR5RS programs between systems that support the recommended portions of the R6RS syntax.

As a rule, ERR5RS adopts or recommends the non-controversial lexical extensions proposed by the R6RS, while ignoring the more controversial lexical extensions. Ignoring certain lexical extensions would be as controversial as adopting them, however, so the general rule does not suffice to decide issues such as case sensitivity, square brackets, and Unicode.

[edit] Unicode and Portability

The lexical syntax includes R6RS-style escapes that allow programs written in the ASCII character set to express any Unicode scalar value as a character or element of a string or identifier. A program that expresses non-ASCII characters may encounter implementation restrictions in systems that don't fully support Unicode, but the lexical syntax allows such programs to be portable between systems that do support Unicode.

[edit] Case Sensitivity

This section is a stub.

We need to consider case sensitivity for

  • #t and #f (insensitive in R6RS)
  • prefixes and exponent markers in numbers (insensitive in R6RS)
  • i (insensitive in R6RS)
  • character names (sensitive in R6RS)
  • hex digits (insensitive in R6RS)
  • identifiers (sensitive in R6RS in the absence of #!fold-case)
  • string escapes (sensitive in R6RS)
  • the x in hex characters (insensitive in R6RS) and inline character escapes in strings (sensitive in R6RS) and identifiers (unclear in R6RS).

[edit] Square Brackets

This section is a stub.

Use of matching square brackets in place of parentheses should be standard.
Matching square brackets have been used at least as far back as Scheme84 (MIT 1984). Matching square brackets are standard in R6RS and supported today in Bigloo, Chez, Chicken, Gambit, Gosh, PLT/mzscheme, and Stklos Scheme implementations. Forbidding use of square brackets would break a significant amount of pre-R6 code. After nearly a quarter century of use in Schemes, it is time to let this de facto standard become a recognized standard. User: kend

[edit] Characters

The origin of the character names is not clear to me; they do not correspond to either the ASCII or Unicode names (e.g., #\page represents the character named FF in ASCII and form feed in Unicode).

[edit] Strings

The string escapes are adopted from C.

[edit] Lexical Syntax of the R5RS and R6RS

[edit] R5RS Lexical Syntax

The rest of this subsection is taken directly from R5RS 7.1.1.

<Intertoken space> may occur on either side of any token, but not within a token.
Tokens which require implicit termination (identifiers, numbers, characters, and dot) may be terminated by any <delimiter>, but not necessarily by anything else.
The following five characters are reserved for future extensions to the language: [ ] { } |
<token> --> <identifier> | <boolean> | <number>
     | <character> | <string>
     | ( | ) | #( | ' | ` | , | ,@ | .
<delimiter> --> <whitespace> | ( | ) | " | ;
<whitespace> --> <space or newline>
<comment> --> ;  <all subsequent characters up to a
                 line break>
<atmosphere> --> <whitespace> | <comment>
<intertoken space> --> <atmosphere>*

<identifier> --> <initial> <subsequent>*
     | <peculiar identifier>
<initial> --> <letter> | <special initial>
<letter> --> a | b | c | ... | z

<special initial> --> ! | $ | % | & | * | / | : | < | =
     | > | ? | ^ | _ | ~
<subsequent> --> <initial> | <digit>
     | <special subsequent>
<digit> --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<special subsequent> --> + | - | . | @
<peculiar identifier> --> + | - | ...
<syntactic keyword> --> <expression keyword>
     | else | => | define 
     | unquote | unquote-splicing
<expression keyword> --> quote | lambda | if
     | set! | begin | cond | and | or | case
     | let | let* | letrec | do | delay
     | quasiquote

`<variable> => <'any <identifier> that isn't
                also a <syntactic keyword>>

<boolean> --> #t | #f
<character> --> #\ <any character>
     | #\ <character name>
<character name> --> space | newline

<string> --> " <string element>* "
<string element> --> <any character other than " or \>
     | \" | \\ 

<number> --> <num 2>| <num 8>
     | <num 10>| <num 16>
The following rules for <num R>, <complex R>, <real R>, <ureal R>, <uinteger R>, and <prefix R> should be replicated for R = 2, 8, 10, and 16. There are no rules for <decimal 2>, <decimal 8>, and <decimal 16>, which means that numbers containing decimal points or exponents must be in decimal radix.
<num R> --> <prefix R> <complex R>
<complex R> --> <real R> | <real R> @ <real R>
    | <real R> + <ureal R> i | <real R> - <ureal R> i
    | <real R> + i | <real R> - i
    | + <ureal R> i | - <ureal R> i | + i | - i
<real R> --> <sign> <ureal R>
<ureal R> --> <uinteger R>
    | <uinteger R> / <uinteger R>
    | <decimal R>
<decimal 10> --> <uinteger 10> <suffix>
    | . <digit 10>+ #* <suffix>
    | <digit 10>+ . <digit 10>* #* <suffix>
    | <digit 10>+ #+ . #* <suffix>
<uinteger R> --> <digit R>+ #*
<prefix R> --> <radix R> <exactness>
    | <exactness> <radix R>

<suffix> --> <empty> 
    | <exponent marker> <sign> <digit 10>+
<exponent marker> --> e | s | f | d | l
<sign> --> <empty>  | + |  -
<exactness> --> <empty> | #i | #e
<radix 2> --> #b
<radix 8> --> #o
<radix 10> --> <empty> | #d
<radix 16> --> #x
<digit 2> --> 0 | 1
<digit 8> --> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
<digit 10> --> <digit>
<digit 16> --> <digit 10> | a | b | c | d | e | f 

[edit] R6RS Lexical Syntax

The rest of this subsection is taken from directly from R6RS 4.2 and 4.3.

Case is significant except in representations of booleans, number objects, and in hexadecimal numbers specifying Unicode scalar values. For example, #x1A and #X1a are equivalent. The identifier Foo is, however, distinct from the identifier FOO.
Case is significant except in representations of booleans, number objects, and in hexadecimal numbers specifying Unicode scalar values. For example, #x1A and #X1a are equivalent. The identifier Foo is, however, distinct from the identifier FOO.
<Interlexeme space> may occur on either side of any lexeme, but not within a lexeme.
<Identifier>s, ., <number>s, <character>s, and <boolean>s, must be terminated by a <delimiter> or by the end of the input.
The following two characters are reserved for future extensions to the language: { }
<lexeme> → <identifier> | <boolean> | <number>
         | <character> | <string>
         | ( | ) | [ | ] | #( | #vu8( | ’ | ‘ | , | ,@ | .
         | #’ | #‘ | #, | #,@
<delimiter> → ( | ) | [ | ] | " | ; | #
         | <whitespace>
<whitespace> → <character tabulation>
         | <linefeed> | <line tabulation> | <form feed>
         | <carriage return> | <next line>
         | <any character whose category is Zs, Zl, or Zp>
<line ending> → <linefeed> | <carriage return>
         | <carriage return> <linefeed> | <next line>
         | <carriage return> <next line> | <line separator>
<comment> → ; 〈all subsequent characters up to a
         <line ending> or <paragraph separator>〉
         | <nested comment>
         | #; <interlexeme space> <datum>
         | #!r6rs
<nested comment> → #| <comment text>
         <comment cont>* |#
<comment text> → 〈character sequence not containing
         #| or |#〉
<comment cont> → <nested comment> <comment text>
<atmosphere> → <whitespace> | <comment>
<interlexeme space> → <atmosphere>*

<identifier> → <initial> <subsequent>*
         | <peculiar identifier>
<initial> → <constituent> | <special initial>
         | <inline hex escape>
<letter> → a | b | c | ... | z
         | A | B | C | ... | Z
<constituent> → <letter>
         | 〈any character whose Unicode scalar value is greater than
             127, and whose category is Lu, Ll, Lt, Lm, Lo, Mn,
             Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co〉
<special initial> → ! | $ | % | & | * | / | : | < | =
         | > | ? | ^ | _ | ~
<subsequent> → <initial> | <digit>
         | <any character whose category is Nd, Mc, or Me>
         | <special subsequent>
<digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<hex digit> → <digit>
         | a | A | b | B | c | C | d | D | e | E | f | F
<special subsequent> → + | - | . | @
<inline hex escape> → \x<hex scalar value>;
<hex scalar value> → <hex digit>+
<peculiar identifier> → + | - | ... | -> <subsequent>*
<boolean> → #t | #T | #f | #F
<character> → #\<any character>
         | #\<character name>
         | #\x<hex scalar value>
<character name> → nul | alarm | backspace | tab
         | linefeed | newline | vtab | page | return
         | esc | space | delete
<string> → " <string element>* "
<string element> → <any character other than " or \>
         | \a | \b | \t | \n | \v | \f | \r
         | \" | \\
         | \<intraline whitespace><line ending>
            <intraline whitespace>
         | <inline hex escape>
<intraline whitespace> → <character tabulation>
         | <any character whose category is Zs>
A <hex scalar value> represents a Unicode scalar value between 0 and #x10FFFF, excluding the range [#xD800, #xDFFF].
The rules for <num R>, <complex R>, <real R>, <ureal R>, <uinteger R>, and <prefix R> below should be replicated for R = 2, 8, 10, and 16. There are no rules for <decimal 2>, <decimal 8>, and <decimal 16>, which means that number representations containing decimal points or exponents must be in decimal radix.
<number> → <num 2> | <num 8>
         | <num 10> | <num 16>
<num R> → <prefix R> <complex R>
<complex R> → <real R> | <real R> @ <real R>
         | <real R> + <ureal R> i | <real R> - <ureal R> i
         | <real R> + <naninf> i | <real R> - <naninf> i
         | <real R> + i | <real R> - i
         | + <ureal R> i | - <ureal R> i
         | + <naninf> i | - <naninf> i
         | + i | - i
<real R> → <sign> <ureal R>
         | + <naninf> | - <naninf>
<naninf> → nan.0 | inf.0
<ureal R> → <uinteger R>
         | <uinteger R> / <uinteger R>
         | <decimal R> <mantissa width>
<decimal 10> → <uinteger 10> <suffix>
         | . <digit 10>+ <suffix>
         | <digit 10>+ . <digit 10>* <suffix>
         | <digit 10>+ . <suffix>
<uinteger R> → <digit R>+
<prefix R> → <radix R> <exactness>
         | <exactness> <radix R>

<suffix> → <empty>
         | <exponent marker> <sign> <digit 10>+
<exponent marker> → e | E | s | S | f | F
         | d | D | l | L
<mantissa width> → <empty>
         | | <digit 10>+
<sign> → <empty> | + | -
<exactness> → <empty>
         | #i| #I | #e| #E
<radix 2> → #b| #B
<radix 8> → #o| #O
<radix 10> → <empty> | #d | #D
<radix 16> → #x| #X
<digit 2> → 0 | 1
<digit 8> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
<digit 10> → <digit>
<digit 16> → <hex digit>

The following grammar describes the syntax of syntactic data in terms of various kinds of lexemes defined in the grammar in section 4.2:

<datum> → <lexeme datum>
         | <compound datum>
<lexeme datum> → <boolean> | <number>
         | <character> | <string> | <symbol>
<symbol> → <identifier>
<compound datum> → <list> | <vector> | <bytevector>
<list> → (<datum>*) | [<datum>*]
         | (<datum>+ . <datum>) | [<datum>+ . <datum>]
         | <abbreviation>
<abbreviation> → <abbrev prefix> <datum>
<abbrev prefix> → ’ | ‘ | , | ,@
         | #’ | #‘ | #, | #,@
<vector> → #(<datum>*)
<bytevector> → #vu8(<u8>*)
<u8> → 〈any <number> representing an exact
                   integer in {0, ..., 255}〉

[edit] Changes That Should Have Been Made by the R6RS

ERR5RS should correct these errors in the R5RS lexical syntax, even though the R6RS didn't.

  • The R6RS defines <interlexeme space> to be zero or more occurrences of <atmosphere>, so every adjacent pair of characters is separated by <interlexeme space> even though <interlexeme space> should not occur within lexemes.
The ERR5RS does not follow the R6RS here, but instead eliminates <interlexeme space> and incorporates <atmosphere> directly in the <lexemes> non-terminal.
  • Several other ambiguities in the R5RS lexical syntax were left ambiguous by the R6RS.

[edit] Meaningless Changes Made by the R6RS

ERR5RS should probably ignore these changes, but listing them here might make it easier for people to compare the R5RS lexical syntax with that of the R6RS.

  • The R6RS speaks of lexemes, the R5RS of tokens.
The ERR5RS follows the R6RS here.

[edit] Uncontroversial Changes Made by the R6RS

ERR5RS should probably require these changes.

  • The R6RS requires identifiers, numbers, characters, and booleans to be terminated by a <delimiter> or by the end of input.
The ERR5RS follows the R6RS here.
  • The R6RS allows several new kinds of whitespace: tabs, line feeds, vertical tabs, form feeds, carriage returns, the Unicode next line character, and all Unicode scalar values whose Unicode general category is Zs, Zl, or Zp. (The R5RS allows only spaces and newlines; in the R6RS, newlines become synonymous with line feeds.)
The ERR5RS follows the R6RS here.
  • The R6RS allows identifiers to contain <inline hex escape>s.
The ERR5RS follows the R6RS here.
  • The R6RS adds a new syntax for character literals that can express any Unicode character by specifying its scalar value in hexadecimal, e.g. #\x3bb.
The ERR5RS follows the R6RS here.
  • The R6RS adds ten new character names, e.g. #\linefeed.
The ERR5RS follows the R6RS here, but allows case-insensitive names for backwards compatibility with #\SPACE and #\NEWLINE.
  • The R6RS adds seven new single-character escapes for use in string literals, e.g. "\n".
The ERR5RS follows the R6RS here.
  • The R6RS allows end-of-line sequences to appear within string literals. (The six specific end-of-line sequences might be slightly controversial; see below.)
The ERR5RS follows the R6RS here.
  • The R6RS allows <inline hex escape>s within string literals.
The ERR5RS follows the R6RS here.

[edit] Slightly Controversial Changes Made by the R6RS

ERR5RS might require some of these while recommending others.

  • The R6RS begins a bytevector literal with the #vu8( token. ERR5RS should require this token if and only if bytevectors are an essential data type of ERR5RS.
The ERR5RS does not follow the R6RS here.
  • The R6RS adds four tokens for the convenience of syntax-case macros: #' #` #, #,@.
The ERR5RS does not follow the R6RS here.
  • The R6RS treats # as a delimiter.
The ERR5RS follows the R6RS here.
Below, it is noted that #s may be used in numbers. I would not think # can be an exactness specifier as well as a delimiter. User:kend
  • The R6RS requires six specific character sequences to be recognized as line endings that terminate single-line comments or count as a single linefeed character when embedded within a string literal.
The ERR5RS follows the R6RS here.
  • The R6RS allows identifiers to begin with any character whose Unicode scalar value is greater than 127 and whose Unicode general category is Lu, Ll, Lt, Lm, Lo, Mn, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co.
The ERR5RS follows the R6RS here.
  • The R6RS allows the subsequent characters of an identifier to include any character whose Unicode general category is Nd, Mc, or Me.
The ERR5RS follows the R6RS here.
  • The numerical value of an <inline hex escape> is restricted to Unicode scalar values.
The ERR5RS does not follow the R6RS here.
  • The R6RS adds nan.0 and inf.0 tokens, which must be preceded by explicit signs when used within numeric literals.
The ERR5RS follows the R6RS here.
  • The R6RS removes the # notation for insignificant digits.
The ERR5RS does not follow the R6RS here and retains the # notation for insignificant digits.

[edit] Controversial Changes Made by the R6RS

ERR5RS should probably not require these changes, although it might recommend some of them.

  • The R6RS reserves [ and ] for use as synonyms of ( and ).
The ERR5RS does not follow the R6RS here.
  • The R6RS reserves the vertical bar for specification of mantissa widths.
The ERR5RS reserves the vertical bar, but does not assign it any particular meaning.
  • The R6RS adds nested comments of the form #| ... |#.
The ERR5RS does not follow the R6RS here.
  • The R6RS adds datum comments of the form #;<datum>.
The RR5RS follows the R6RS here.
  • The R6RS adds #!r6rs comments.
The ERR5RS does not follow the R6RS here.
  • The R6RS adds an infinite set of peculiar identifiers that begin with a hyphen followed immediately by a greater-than sign.
The ERR5RS does not follow the R6RS here.
  • The R6RS adds optional mantissa widths to numeric literals.
The ERR5RS does not follow the R6RS here.
  • The R6RS only recognizes #\space and #\newline in lower case.
The ERR5RS does not follow the R6RS here, but has case-insensitive character names to preserve backwards compatibility with the R5RS.
  • The R6RS suggests that implementors may wish to implement #!fold-case and !no-fold-case directives, which act as comments but as a side effect fold and do not fold the case of characters in both identifiers and named characters (e.g, #\SPACE).
The ERR5RS adopts these directives, but not as comments. Instead, they return unspecified values. Their values, but not their side effect, may be hidden my making them into datum comments. They do not effect the interpretation of character names, as these are already case insensitive in the ERR5RS.
Personal tools