ERR5RS talk:Lexical Syntax
From SchemePunks
I have copied the structure of the page to facilitate discussion of the corresponding parts of the article. The intent was to copy the discussion bits and leave the draft bits in the article. --OLW 16:59, 10 September 2007 (PDT)
- Kumoyuki 03:21, 11 September 2007 (PDT) - I might have waited until some more substantive discussion had taken place, but whatever. The question is who's going to talk where?
Contents
|
[edit] Lexical Syntax
[edit] Background
[edit] Principles? In a syntax fight?
[edit] Unicode
Kumoyuki 03:37, 11 September 2007 (PDT) - It's worth noting that Bear has real issues with the language used for the adoption of Unicode in R6RS. Having thought about it a bit, I believe that the real problem is that the specification of Unicode is proceeding the wrong-way around. Instead of explicitly adopting Unicode as the universal set for characters, I think that the correct thing is to specify a limited set of specific glyphs which have lexical significance, and then delegate responsibility to the implementation concerning the mapping of those glyphs to chars and integers. This will allow oddball character sets like EBCDIC and/or Shift-JIS to host conformant implementations.
There are two issues that come immediately to mind.
- Characters which have been serialized using char->integer may not produce consistent results on implementations with divergent character sets.
- Source code is not truly portable - EBCDIC source may not even compile on an ASCII system.
Of these the latter is more serious (mostly because I consider the first to be a case of sloppy coding). But I'm not sure that source portability without transcoding is possible in any case.
- Has anyone out there done i18n (I assume) work in the glyph centric style that Bear likes? When I last looked hard at this, I didn't have a chance to go beyond planning, but it looked to be the right approach for the reasons outlined above. And I suspect but don't know that there are some number of systems that still have rather messy legacy character systems like Shift-JIS.
- The glyph-centric style - in which each "character" consists of the equivalent of a unicode base codepont plus nondefective sequence of modifiers and variant selectors, has several advantages:
- Firstly, it eliminates several ways for character sequences to contain nonsense. All characters are maintained in normalized form all the time. This heads off a fair number of other problems before they happen, including string matching and the existence of characters with canonical decompositions in strings.
- Secondly, it buries the normalization form during string manipulation below programmer notice, which is exactly where it belongs. If you are doing something where you care, then you're not using the entities as characters at all; it is strictly a binary-format issue and should be expressed as operations on binary blobs. In fact it is reasonable to suppose that I/O is the only time when anyone should ever care, and that normalization form info can and should be buried in the ports code.
- Thirdly, it is much more stable in character counts under case changes and other linguistic operations. With only a single exception, (German Eszett), the number of glyph-characters does not change at all in case changes. Some may point at specialcasing.txt thinking that it contains exceptions, but every single character in that file except for Eszett ceases to exist in an eagerly normalized environment because they all have canonical decompositions.
- Fourthly, and I believe most important, it makes the abstraction much more closely match the model of language favored by end-users and linguists, who don't give a flying damn how many codepoints of some particular encoding they are talking about when they mention a particular character. These characters existed long before Unicode came along, and they were and are single entities in the minds of the people who use them. The fact that Unicode happens to take several codepoints to represent some of them does not change that fact in any way. A mismatch to the end-users' abstraction is now, and will always remain, a "speedbump" in learning and using the language - not a terribly serious hazard, but a completely unnecessary one that will hit every single user again and again and again. --Bear 14:46, 17 September 2007 (PDT)
- EBCDIC: how much is that needed nowadays? Linux on zSeries is said to be native Unicode and ASCII, so unless a Scheme program needs to talk to the legacy part of the world, it wouldn't seem that EBCDIC would be in the picture, but I have no idea what's common practice today. Hga 04:29, 11 September 2007 (PDT)
- I think EBCDIC is used as an example of a "gratuitously incompatible character set" rather than as an argument about a particular thing that we will have a real need to support. The point is not that the standard needs to support EBCDIC; the point is that if someone actually is, Gods help him or her, working in an EBCDIC environment, then the standard should not *forbid* a native scheme system working in that character set. EBCDIC may be just for the legacy dinosaur iron, but shift-JIS and etc, which also aren't unicode, are highly current and suffer from the same ban.
- With respect to issue 1: I don't know what David meant by "serialized using
char->integer", but I do know that no portable R5RS code can possibly rely onchar->integerreturning non-Unicode results, so the proposed requirement thatchar->integerreturn Unicode scalar values cannot possibly break any portable R5RS code. It could conceivably break some implementation-dependent code, which would have to be revised to use some implementation-specific procedure such aschar->ebcdic, but I don't know of any implementations that have both of the properties that are necessary to create a problem: (1) the implementation'schar->integeris currently returning non-Unicode results, and (2) the implementation permits programmers to rely on those results by guaranteeing they will remain the same from session to session or from version to version. Until someone can identify an R5RS-conforming system with those properties, I will consider this to be a non-issue. (Wclinger 11:24, 12 September 2007 (PDT))
- With respect to issue 1: I don't know what David meant by "serialized using
- The problem is that there are good technical reasons for using a different character set. If
char->integeris specified as returning a unicode scalar value, what does it do when I feed it a character that can't be expressed as a unicode scalar value? MITscheme for example has used keystroke-descriptions such as ctrl-alt-J as part of its character set forever, and those flatly don't exist in unicode. There are also the infinite number of characters that unicode expresses as multiple codepoints, which although they exist in unicode aren't unicode scalar values. Finally, you have single shift-JIS characters and sinograms which may be rendered as any of several different unicode scalar values, because the unicode committee copied them from several different sources. --Bear 14:46, 17 September 2007 (PDT)
- The problem is that there are good technical reasons for using a different character set. If
- The character set has nothing to do with the specification of
char->integerthat has been proposed for ERR5RS. Ifchar->integeris passed an argument that doesn't correspond to a Unicode scalar value, then it returns a result that is outside the range of Unicode scalar values; since there are infinitely many such possible results, there is no conflict with any of the scenarios that concern you. (Wclinger 13:15, 20 September 2007 (PDT))
- The character set has nothing to do with the specification of
- Anyway, if we want to specify unicode behavior, I say it should be in a library and 'Unicode' should be in its name somewhere.
char->unicodeis the proper name for what you want, and it should be in a unicode support library.
- Anyway, if we want to specify unicode behavior, I say it should be in a library and 'Unicode' should be in its name somewhere.
- You are free to say that, but the goal here is to maximize compatibility with both the R5RS and R6RS specifications of
char->integer. You are basically arguing about whether we should call this thingchar->integerorchar->unicode. As I have noted previously, no portable R5RS code can possibly depend uponchar->integerbehaving differently from the behavior proposed for ERR5RS, and no known implementation of R5RS Scheme has guaranteed its users thatchar->integerwill behave differently from the behavior proposed for ERR5RS, so the proper name for this thing is the one used by both the R5RS and the R6RS for its closest equivalent, namelychar->integer. (Wclinger 13:15, 20 September 2007 (PDT))
- You are free to say that, but the goal here is to maximize compatibility with both the R5RS and R6RS specifications of
- Kumoyuki 14:35, 3 October 2007 (PDT) - actually since
char->integeris massively underspecified in R5RS and conversely overspecified in R6RS I'm not sure that the compatibility is maximized by defining in either direction. How about the 'recommends but not requires' fudge that we are using elsewhere in thie ERR5RS discussion.
- Kumoyuki 14:35, 3 October 2007 (PDT) - actually since
- Kumoyuki 14:15, 13 September 2007 (PDT) - basically, yes. but this has to do more with multiple implementations in a networked system. It would actually be a colossally stupid way to do data interchange, but one that I can imagine some of my past colleagues using. I can think of one or two variants which might not quite be colossally stupid, as well. But if implementation A encodes char->integer one way and implementation B does it differently there is a potential data file format mismatch - which the programmer soundly deserves. This straw man was the best I could come up with for portability problems in external-representations of data if the mapping of char->integer was not clearly codified.
- With respect to issue 2: The statement of that issue confuses specification with representation. Nothing that has been proposed for ERR5RS would prevent source code from being represented using non-Unicode encodings, just as nothing that has been proposed would rule out any of the seven Unicode encoding forms: UTF-8, UTF-16, UTF-16le, UTF-16be, UTF-32, UTF-32le, UTF-32be. With the R5RS, implementations are responsible for decoding the OS-specific representation of the source code into some internal representation (possibly but not necessarily the same one) that can be mapped onto the character set assumed by the R5RS. ERR5RS does not propose to change that in any way. (Wclinger 11:24, 12 September 2007 (PDT))
- Kumoyuki 14:15, 13 September 2007 (PDT) - You are correct, and that is a common confusion. It arises because we expect that READ groks source code the same way that the compiler does. As you point out, that does not have to be the case, but it *is* a usability issue connecting to why char->integer should remain underspecified. There is a natural character set and character-set encoding (different things!) for any given O/S, filesystem and implementation combination. I would expect any implementation to choose the most harmonious (and efficient) character set and encoding for its target environment because the representation of characters extends all the way to userland. And users will feel the pain if there is a mismatch. Issues of character set and encoding are best left in the individual programmer's hands. I do think that char->unicode is a highly useful addition, but not if it's going to be used to test for particular glyphs.
- And now that I've gotten that far, perhaps the real argument in favor of standardizing scheme characters as Unicode scalar values is that it gives a clear and portable meaning to the generic character escaping mechanisms for strings and symbols, but in the interest of full disclosure, I'm not entirely sure that I think being able to syntactically embed calls to char->integer in strings and symbols is a Good Thing (TM).
- ...and I'd really like to get Bear's opinions on all of this, as well
If we are going to allow character sets that are subsets or supersets of Unicode, we should consider what happens when integer->char is passed an invalid integer and whether we wish to provide a means to determine which integers are valid. In this context, "valid" means an integer that can be returned by "char->integer" in a given implementation. In the R5RS, the behaviour of integer->char on invalid integers is not well defined, but I think the combined requirements on integer->char and char->integer in §6.3.4 mean that it cannot return a character. Therefore, an unobtrusive change might be to require that integer->char return a non-character, or possibly #f, for invalid integers. However, this would mean that integer->char would be forbidden from signaling an error for invalid integers, and this will probably be unacceptable to many. One alternative would be the leave the specification of integer->char as it is and invent a new procedure to determine whether an integer is valid. A couple of ideas are "integer->possible-char" which would return #f for invalid integers but otherwise behave as integer->char (i.e., it would behave as the modified integer->char mentioned previously) and "valid-for-integer->char?" which would return #t or #f depending on whether its argument is valid. I am sure someone with more creativity than I can come up with better names. Alan Watson
What happens if an interpretation encounters a character or hex-escape that it does not support? Do we require it to signal an error or do we leave this unspecified? Alan Watson
[edit] Case Sensitivity
ERR5RS should require implementations to support the two flags recommended by the R6RS non-normative appendix B while requiring portable programs:
- to use consistent case for all identifiers and symbols, so the program will run in case-sensitive systems
- not to use distinct identifiers and symbols that differ only by case, so the program will run in case-insensitive systems
(Wclinger 11:38, 10 September 2007 (PDT))
- Kumoyuki 14:03, 10 September 2007 (PDT) - I think ERR5RS should require the flags, not recommend them. That will ensure a greater uptake among the R6RS implementations as well, and while I don't like case-folding identifiers, Brian Harvey made a good case for them in an educational setting
- That's okay with me, provided ERR5RS doesn't require those flags to be treated as comments, and doesn't require the the peculiar semantics that was originally proposed for them when they are embedded within an external representation. The R5.97RS draft fixed the second problem but not the first. As for why comments that have side effects are a bad idea: (1) They violate the Law of Least Astonishment, because programmers don't expect comments to have side effects. (2) Comments with side effects may require hand-coding within the lexer, because scanner-generation tools generally aren't set up to handle such things; this point has special relevance to Scheme because its lexical syntax is so complex that hand-coded implementations, such as the reference implementation of the R6RS reader, are likely to contain errors. (Wclinger 11:41, 12 September 2007 (PDT))
- I confess that I think I must still be taking the nap I just woke up from, and having a dream (or rather nightmare) in which you just said the R6RS specifies comments with side effects (that aren't e.g. directives to EMACS)....
- I find this to be in strong competition for the most astonishing thing in the R6RS, this is so self-evidently wrong that words fail (although yours are a good start). Hga 13:16, 12 September 2007 (PDT)
- I would suggest that #!fold-case and #!no-fold-case be restricted to appear only at the top-level and to not be comments but rather evaluate to something (perhaps symbols of the same spelling). Will has explained above why the should not be comments. Nevertheless, making them non-comments is incompatible with R6RS. However, if they can only appear at the top-level (i.e., as a <command or definition> in §7.1.6 of the R5RS), then the observable effects of that incompatibility is minimized and I think most real uses would still be covered. Alan Watson 13:57, 13 September 2007 (PDT)
- My suggestion would probably work quite well for programs, but would not work well for data. On the other hand, we could have a variant of read that folded case for symbols. If we were starting from scratch, combining my suggestion for programs with such a read might well have been the best solution (not least because it does not require modification to data).
- Restricting
#!fold-caseand#!no-fold-caseto the top level makes just as much sense for data as for programs. Indeed, bothreadandget-datummust recognize the difference between top-level data and embedded data in order to know whether to raise an exception on end-of-file or to return an end-of-file object. (Wclinger 12:46, 16 November 2007 (PST))
- Restricting
- However, I worry about incompatibilities with the R6RS. Semantic comments in programs are quite common; for a recent example see OpenMP. Programs and data are not the same, of course. Are semantic comments really such a problem for parser generating tools? Which ones? If they are a problem, is this not a weakness in the tools rather than in the formal grammar? Is it not possible to have the lexer return magic tokens for the flags, which cause the read procedure to switch state and call the lexer again to obtain a real datum? Alan Watson
- The R6RS doesn't require
#!fold-caseand#!no-fold-caseto be comments; it doesn't even require implementations to accept those flags at all. They are mentioned only in a non-normative appendix, which R6RS-conforming implementations are free to ignore and ERR5RS is even freer to ignore. As for difficulty of implementation, I don't know how many parser generation tools allow action routines to call the parser recursively; some do while others probably don't. Anything out of the ordinary in an action routine makes parsers harder to construct; this isn't a show-stopper, but it is definitely a consideration. Consider Ikarus, for example, which is being marketed on the basis of R6RS conformance but has yet to implement the mandatory#!r6rsflag, presumably because the implementation of semantic comments takes some extra effort with the tools Aziz is using. (Wclinger 12:46, 16 November 2007 (PST))
- The R6RS doesn't require
- How about restricting the flags to the top level but making them comments nevertheless. (In my earlier comment, I was thinking about restricting the flags to the top level but not making them comments; this indeed causes problems in data.) With this, there is no need for recursion in the (potentially machine-generated) lexer. The read procedure calls the lexer; if the lexer returns one of the flags, the read procedure changes the state appropriately and tail-calls itself. Alan Watson 13:40, 16 November 2007 (PST)
- Then we're back to comments with side effects. That wouldn't be so bad by itself, but the flag syntax is the only form of extension permitted by the R6RS, so implementations of the R6RS have to use flags for things like
#!unspecified, which is definitely not a comment. Using flags for both comments and non-comments is confusing. The R6RS is stuck with that confusion because of#!r6rs, but ERR5RS should not make the same mistake. I have added the two flags to the essential syntax of ERR5RS, but they denote unspecified values (which effectively restricts them to top level) and are specified to have the same side effect even when they appear as datum comments (which I also added to ERR5RS syntax). (Wclinger 08:00, 17 November 2007 (PST))
- Then we're back to comments with side effects. That wouldn't be so bad by itself, but the flag syntax is the only form of extension permitted by the R6RS, so implementations of the R6RS have to use flags for things like
- I am not thrilled by this. First of all, I think the ERR5RS should avoid needless incompatibilities with both the R5RS and the R6RS. Making the case-selection flags return datums is an incompatibility with the R5RS and is needless, in my opinion. Second, the ERR5RS does not have to follow the R6RS and restrict extensions to be of the form #!. The ERR5RS could allow implementations to use other lexemes for extensions (such as the common #<foo> form) and, for example, could allow or encourage them to treat all lexemes of the form #! be comments, which would remove one source of confusion. Third, you use datum comments to hide the incompatibility between your suggestion and the R6RS, but I am against requiring datum comments in the ERR5RS. If implementations want to include them, I have no problem, but I do not think they need standardizing, both because they are mainly of use in development and because they require that the whole parser be reentrant (unless I am mistaken), and you have reminded us that this could be a problem for some parsers.
- My suggestion would be to include the flags as atmosphere and treat them as comments. Given the grammar of <lexemes>, this would restrict them to the top level, would be compatible with their use at the top level in the R6RS, and would avoid the nasty scoping issues of allowing them to appear in compound data. Alan Watson
- The light may be dawning. I think there are two distinct problems here: reading the flags and acting on them. Therefore, I suggest the following: the flags can appear anywhere a normal datum can; each returns a datum that is distinct from the standard R5RS data and from the end-of-file object; they have no effect on the standard read procedure; but if one of these datums appears at the top level in a program, then the compiler/interpreter (and not the reader) treats subsequent identifiers appropriately. This allows me to write a procedure to transform another program and duplicate these flags in the output and it allows me to write a procedure that emulates the R6RS read procedure with respect to its treatment of these flags. The ERR5RS might provide this procedure; in any case it should provide both a case-sensitive and a case-insensitive read procedures. Alan Watson
- Alan claimed that "Making the case-selection flags return datums is an incompatibility with the R5RS...". That is untrue. The R5RS does not have any case-selection flags, does not specify any lexical syntax of the form
#!..., and explicitly permits lexical extensions that do not conflict with the lexical syntaxes it specifies. (Wclinger 11:06, 19 November 2007 (PST))
- Alan claimed that "Making the case-selection flags return datums is an incompatibility with the R5RS...". That is untrue. The R5RS does not have any case-selection flags, does not specify any lexical syntax of the form
- I meant that it was an incompatibility with the appendix to the R6RS, which I suspect will be a de facto standard. Sorry for the confusion. Alan Watson
- I agree that ERR5RS should not restrict extensions to be of the form
#!.... If we are designing a lexical extension we expect to use in both ERR5RS and R6RS data, however, then it will have to be an extension the R6RS will allow. (Wclinger 11:06, 19 November 2007 (PST))
- I agree that ERR5RS should not restrict extensions to be of the form
- Case-insensitive reading of symbols is a lossy operation. No program, not even a compiler/interpreter, can recover the original case once it has been folded by the reader. If we want ERR5RS programs to be able to read both R5RS and R6RS data (and we do), then there has to be a way to force the reader not to fold case. Case-folding should be a property of ports, not readers; otherwise data won't be able to tell readers how it wants to be read. (Wclinger 11:06, 19 November 2007 (PST))
- With my latest suggestion, that the flags return data and have no side effects on the standard reader, a program that wanted to see the original case of the data would have to arrange to read the data in a case-sensitive manner. I do not see this as a problem. I am happy for case-sensitivity to be a (perhaps changeable) property of ports. I believe my suggestion combined with selectable case-sensitivity in ports allows us: to specify the case-sensitivity of programs; to write a R6RS-appendix-compatible read procedure; and to easily specify the case-sensitivity with which data should be read. I think these are the problems we are trying to solve. Furthermore, I think it does it in a Schemey manner by avoiding having magic comments or flags with side effects; the full expression of the program or the data is also data. Alan Watson
- I understand the need for selective case insensitivity in programs. However, I am still waiting for someone to explain to me why case-sensitivity flags are needed for data. Can't this be handled at the level of the port or reader? Without such a justification, Will's combination of yucky datum comments and yuckier magical-side-effect flags leaves me reaching for the Pepto Bismol. Alan Watson
- Yes, it can be handled at the port level, and I believe the
#!fold-caseand#!no-fold-caseflags that are suggested by a non-normative appendix of the R6RS should affect only the port from which they are read, not the global state of thereadand/orget-datumprocedures. (Wclinger 17:59, 24 November 2007 (PST))
- Yes, it can be handled at the port level, and I believe the
- By the way, it was Chris Hanson who suggested those flags, not me. (Wclinger 17:59, 24 November 2007 (PST))
Here are some relevant URLs: http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/000922.html http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/000923.html http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/000950.html http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/001045.html http://lists.r6rs.org/pipermail/r6rs-discuss/2006-November/001088.html http://practical-scheme.net/wiliki/schemexref.cgi?Concept%3aCaseSensitivity http://www.cs.utah.edu/~mflatt/scheme-case-sensitivity-poll/
- You have suggested that they be adopted in the ERR5RS. Perhaps I am wrong, but the sense I get from Chris Hanson's original suggestion cited above is that he was most concerned with case-sensitivity in programs rather than data. Perhaps I have not explained myself clearly enough. I am suggesting: (a) including the flags in the reader and having them correspond to values (as in your suggestion); (b) not having them produce side effects on ports or the reader; (c) requiring that their presence at the top level in programs produce an observed behaviour compatible with the appendix to the R6RS; and (d) having a pair of procedures that take a port and cause it to be case-sensitive or case-insensitive (or some equivalent interface). This has useful compatibility with the appendix to the R6RS, but does not require that the flags be comments or have a special significance to the reader, does not require datum comments, and allows one, if one so desires, to write a reader that is completely compatible with the appendix to the R6RS. It has the disadvantage that for standard use the programmer must know whether the data should be read in a case-sensitive or case-insensitive manner (and convert the port appropriately), but I think this will apply in most cases and, as I have mentioned, it is possible to implement a fully-compatible reader if necessary. I think this suggestion covers the most common cases simply and elegantly but allows for generality. It can be criticized because it is not completely compatible with the appendix to the R6RS, but the other suggestions are also incompatible. Alan Watson 10:14, 26 November 2007 (PST)
My thinking is that ERR5RS should recommend that implementations support the two flags recommended by the R6RS non-normative appendix B while requiring portable programs:
- Kumoyuki 14:03, 10 September 2007 (PDT) - I think ERR5RS should require the flags, not recommend them. That will ensure a greater uptake among the R6RS implementations as well, and while I don't like case-folding identifiers, Brian Harvey made a good case for them in an educational setting
- to use consistent case for all identifiers and symbols, so the program will run in case-sensitive systems
- not to use distinct identifiers and symbols that differ only by case, so the program will run in case-insensitive systems
(Wclinger 11:38, 10 September 2007 (PDT))
- Rblaa 2007/09/15 - The standard case against case-folding is that in the presence of Unicode, folding is not sensical for all characters. These problems are avoided by using case-sensitive identifiers. Regarding the use of case folding to aid in education, presumably this is to emphasize certain symbols via the use of upper case, etc. How are case-sensitive languages like Java taught? Why can't one simply use color, italics, or bolding to emphasize the desired identifiers when presenting or discussing? At any rate, if R6RS defined a way for case-folding to be turned on or off, then that is probably a reasonable compromise, especially if one needs to ensure correct execution of legacy programs.
- Although this is indeed a standard argument against case-folding, it does not stand up under examination. Case-insensitivity depends upon case-folding of strings, not characters, and locale-independent case-folding of strings is not only sensible but is moderately well-defined by the Unicode standard and standard annexes. Furthermore we have a reference implementation of locale-independent case-folding written in portable R5RS/R6RS Scheme. (Wclinger 10:42, 18 September 2007 (PDT))
- Another argument against case-folding is the linking to external symbols in e.g. C and machine language programs, which traditionally are case sensitive. The presence of case-folding requires mangling schemes to be needed.
- If Unicode was allowed in chars and strings but not in identifiers, the problem would vanish. IMO restricting identifiers to ASCII is a good idea anyway. Unicode is hard to edit and adds lots of extra visual ambiguities like '1' versus 'l'. To support Unicode (or other character sets) as data, it would be sufficient to extend the domain and range of
integer->charandchar->integer. --Nmh 00:56, 16 September 2007 (PDT)
- If Unicode was allowed in chars and strings but not in identifiers, the problem would vanish. IMO restricting identifiers to ASCII is a good idea anyway. Unicode is hard to edit and adds lots of extra visual ambiguities like '1' versus 'l'. To support Unicode (or other character sets) as data, it would be sufficient to extend the domain and range of
- Do non-Latin cultures, e.g. the Japanese, program in ASCII (possibly romanji) or their own glyph sets (when they have a choice? kana and possibly kanji in the case of the Japanese)? If they will use native glyphs if given a choice, if it's not too much trouble ERR5RS should support that.
- If no one knows of an affirmative answer to the above, we could ask the author of Gauche, it is a design goal of his implementation, see this Google machine translated page. Hga 03:59, 16 September 2007 (PDT)
[edit] Square Brackets
My thinking is that ERR5RS should recommend that ERR5RS programs not use square brackets, and should also recommend that implementations allow square brackets as in the R6RS. (Wclinger 11:38, 10 September 2007 (PDT))
- Kumoyuki 14:07, 10 September 2007 (PDT) - I hate square brackets in Lisp code, does anyone outside of the PLT community use them heavily? That said, I think this may be a useful middle ground for ERR5RS. IIRC, R5RS allowed them through silence. R6RS explicitly requires them. Clearly something needs to be said at this point.
- Sjamaan 05:33, 13 September 2007 (PDT) - What about supporting them in a library that works by extending the reader? That way it can be kept out of the core. This would need very extensive reader macro support, but if quote/quasiquote can be generalized with reader support, surely this can be done too.
- The quote/quasiquote/etc abbreviations are special purpose hacks that are generally built into a Scheme reader, not added as reader macros. Adding a special hack that treats square brackets as an abbreviation would work for ERR5RS code, but would cause problems for external representations of data that contain square brackets.
- Kumoyuki 15:01, 3 October 2007 (PDT) - I really hate this. I have written libraries in C# and Java to function as Scheme readers and, while it's not a big deal to support squares in addition to parens, it is an annoyance that does nothing to help the essential simplicity of S-expressions as a representation of structured data (including programs). It is probably not practical to ban squares, but I wish we could.
- For the reader macro approach to work, ERR5RS would have to specify an extensible reader as an essential component of ERR5RS. That is problematic because (1) several implementors of the R5RS already have extensible readers, and would be likely to resist any mandate for a second, probably incompatible form of extensibility; (2) the reader's performance exerts a strong influence on users' impressions of the overall performance of a Scheme system, and extensible readers may not perform as well as non-extensible readers; (3) some implementors of the R5RS that would otherwise consider implementing ERR5RS have written their reader in languages other than Scheme (because a reader written in Scheme would be too slow for interpreted systems or for some of the slower compiled systems), so a reference implementation written in Scheme wouldn't entirely solve the problem. It's hard to argue that square brackets (by themselves) are important enough to justify the pain that would be involved in standardizing on an extensible reader and implementing it in all those implementation languages. (Wclinger 11:27, 18 September 2007 (PDT))
- OLW 16:59, 10 September 2007 (PDT) - There is a strong culture of using square brackets at Indiana University. The PLT practice probably originated there. Recommending against them will not go down well at one of Scheme's traditional strongholds, so I recommend against recommending against square brackets.
- Kumoyuki 03:21, 11 September 2007 (PDT) - Now that Will posted the actual square-bracket text from R5RS, I'm less happy about this than I was. I'm thinking very hard about how to build a user-level extensible reader and only having curlies left feels ... restrictive. OTOH, the #token( convention is fairly well accepted. Just not sure how to painlessly extend that to the syntax for numbers :P
- If I understand correctly, if implementations are only recommended to support square brackets, then portable ERR5RS programs must be required not to use square brackets. It will be good if there is a consistent criterion for ERR5RS enumerating a specific subset of non-portable programs, given a full constructive description of non-portability would of course be infinite. I am a little worried about the fact that that if ERR5RS contains 16 binary recommendations, there will be 2^16 possible combinations of these that any implementation might support, so the spectrum of being close to portable but not quite becomes large and may be confusing for the user. (AvT)
[edit] Formal Grammar
[edit] R5RS Lexical Grammar
[edit] Changes That Should Have Been Made by the R6RS
Will wrote: "Every adjacent pair of characters is separated by <intertoken space>." I don't understand this. Would someone enlighten me, please? Alan Watson
- The formal grammar in R6RS 4.2.1 defines
<interlexeme space>as zero or more occurrences of<atmosphere>. Zero occurrences of anything is nothing. When nothing separates a pair of adjacent characters, therefore, those characters are separated by<interlexeme space>. That's nuts. The problem was reported by formal comment 242 ( http://www.r6rs.org/formal-comments/comment-272.txt ), but the editors didn't want to fix it "this late in the process". (Wclinger 12:54, 16 November 2007 (PST))
[edit] Meaningless Changes Made by the R6RS
[edit] Uncontroversial Changes Made by the R6RS
[edit] Slightly Controversial Changes Made by the R6RS
[edit] # as a Delimiter
Why is this controversial? Alan Watson
I doubt whether it's very controversial at all, but the fact that some implementations have allowed # to appear within identifiers means the programmers who have introduced latent portability bugs into their programs by using that extension are going to be inconvenienced if ERR5RS doesn't allow it. In my opinion, however, ERR5RS should not be overly constrained by latent bugs in existing programs. Compatibility with R5RS-conforming code is far more important than bug-compatibility with R5RS-violating code. (Wclinger 13:03, 16 November 2007 (PST))
[edit] Insignificant Digits Marked by #
The draft R6RS removes these. I think part of the reason is that at least some of the editors thought that the |p notation was an adequate replacement. I disagree. Removing # seems like needless tinkering. Alan Watson
[edit] Controversial Changes Made by the R6RS
Kumoyuki 03:44, 11 September 2007 (PDT) - Nestable block comments are stupid. Datum comments are the answer to gently excising that annoying debug code. I vote for *not* recommending the former and requiring the latter :)
- Re: "stupid"---well, maybe, by my personal (and not exactly defendable) style when doing some sorts of C/C++ work was #if 0/#endif comments, sometimes nested. This is easy, fast, sometimes very useful, and in the #|/|# form a lot more pretty. Is there real harm to implementations in recommending or requiring nestable block comments?
- Note that I don't use these for debug code, but rather in the debugging process when I want an informal and very fast way to stub out some code. Why is this disliked? Hga 03:57, 11 September 2007 (PDT)
- Kumoyuki 14:50, 11 September 2007 (PDT) - I dislike it because it interferes with virtually any closing-parenthesis placement style. Datum comments don't. And lisp is suposed to be s-expression centric, innit?
- True enough. When I think about it, my preference for line oriented block comments is partly aesthetic (and therefore not subject to useful discussion), partly habit from C and Lisp Machine/Common Lisp. Neither is a strong argument, and I can do fine with datum comments. Hga 18:44, 11 September 2007 (PDT)
[edit] Token versus Lexeme
R5RS uses token. R6RS uses lexeme. I've plumped to go with the R6RS term, not because I think it is better, but to avoid (what I hope will be) two similar standards using different terms for the same thing. (Something the editors of the R6RS might have borne in mind.) If you disagree, shout. Alan Watson
[edit] Datum Comments
The proposed ERR5RS syntax allows <atmosphere> (including comments) between the #; and the datum; the R6RS syntax only allows <interlexeme space> (excluding comments). Is this intended? Does it create problems with nested comments, e.g., "#; #; <datum> <datum>"? Alan Watson 14:28, 3 December 2007 (PST)
- My mistake. The R6RS definition of <interlexeme space> includes comments. Therefore, the ERR5RS and R6RS syntax for datum comments are effectively the same. Alan Watson 12:39, 20 December 2007 (PST)

