Posted to tcl by apn at Sat Jul 08 04:31:00 GMT 2023view raw

  1. Jan,
  2.  
  3. Please see responses below.
  4.  
  5. > -----Original Message-----
  6. > From: Jan Nijtmans <jan.nijtmans@gmail.com>
  7. > Yes, I think that's the point in TIP #671: If glob encounters a
  8. > filename "\xE6", it will be auto-corrected by Tcl to mean "æ".
  9.  
  10. Auto "correct" is not the term I would use for this behavior!
  11.  
  12. > However, when trying to open this file:
  13. > % open æ r
  14. > couldn't open "æ": no such file or directory
  15. > Tcl will never produce such filenames, when the system-encoding
  16. > is UTF-8. So - to me - this issue has only minor consequences,
  17. > I don't think anyone filed a bug-report on this, even though this
  18. > 'problem' is already present in Tcl for ages.
  19.  
  20. It is a fair position to take (though not one I agree with) that if no one
  21. has complained about a bug it is not worth fixing. However, there is
  22. always the danger that the behavior is so egregious on far eastern
  23. locales for example that folks simply do not bother to use Tcl and
  24. thus not report it. I dunno. May be the above inconsistency bothers
  25. me more than it should.
  26.  
  27. > First, let's comment on the discussion going on between
  28. > Ashok and Nathan. I'm reading it with joy, not because
  29. > it's a useful discussion: I see it as a fight throwing mud
  30. > at each other without really going into the main point.
  31.  
  32. Glad we brought some joy to you! But I cannot recall
  33. any mud slinging. If I did, I apologize to Nathan. And the debate
  34. was very much on point. I proposed a solution to a specific
  35. problem. Nathan suggested it was not necessary because
  36. of the iso8859-1 wrapping idiom. I disagreed and so there
  37. has been a lot of back and forth. If that's not relevant
  38. discussion, I don't know what is.
  39. >
  40. > So, let's start criticising the TIP:
  41. > 1) First, I like the name "lossless" more than "passthrough". The
  42. > reason is that "passthrough" suggests that the bytes/codepoints
  43. > are just passed through, which isn't what this profile does.
  44.  
  45. Opinion. Which is fine.
  46.  
  47. > 2) The TIP doesn't describe well what "lossless" does in other
  48. > encodings than UTF-8.
  49.  
  50. The TIP does not talk about utf-8 at all outside of the examples
  51. given so I do understand that comment. It distinguishes (at
  52. the top of the Specification section) ASCII-compatible
  53. versus non-ASCII compatible encodings and defines the behavior
  54. for each. The two classes obviously cover all encodings so I am
  55. at a loss as to what was unclear. I can clarify if you do.
  56.  
  57. > I think that "lossless" can (and
  58. > should) be implemented for all encodings, but just in one
  59. > direction: so
  60. > encoding convertto -profile lossless $enc \
  61. > [encoding convertfrom -profile lossless $enc $data]
  62. > should always return the unmodified $data for any encoding, but
  63. > encoding convertfrom -profile lossless $enc \
  64. > [encoding convertto -profile lossless $enc $data]
  65. > should throw an exception if $data contains code-points
  66. > not valid in the $enc encoding (such as lone surrogates
  67. > other than U+DC80 - U+DCFF)
  68.  
  69. There are couple of issues with this. The first is a practical one.
  70. The current Tcl implementation has many instances where errors
  71. (return code of TCL_ERROR) is not tolerated. Either the caller
  72. panics or does not check the error code and has no means to
  73. handle it. ckalloc() is the obvious example but the same is true
  74. in the encodings and i/o modules as well in how internal transforms
  75. are implemented. This works because the 8.6 / tcl8 profile never
  76. fails. Any profile replacing the tcl8 profile must behave the same
  77. way (here the passthrough profile is replacing the tcl8 profile
  78. on glob etc.). Thus exceptions are not an option until Tcl's internals
  79. are fixed to handle and propagate errors and I dont see that feasible
  80. in the 9.0 time frame.
  81.  
  82. Second, I don't see the purpose in the above so it would help
  83. if you elaborate on the intent or use case. In general, I'm
  84. tend to be somewhat skeptical about asymmetrical behavior
  85. as you might recall from the discussion of making the strict
  86. profile the default only in one direction.
  87.  
  88. > 3) I have my doubts about making the "lossless" profile the
  89. > default for filenames and environment variables. For commands
  90. > like "glob" and "open" I could imagine a "-profile" option for
  91. > glob/open. For environment variables, I don't know how to do that.
  92.  
  93. There is no point adding this profile then. I see little value if not
  94. used for system interfaces including file names.
  95.  
  96. > 4) In the TIP:
  97. > "Nevertheless, to mitigate this, this specification (following
  98. > PEP 383** will not map byte values < 128 into the U+DC00
  99. > surrogate space. Instead they are mapped to the encoding
  100. > specific replacement character"
  101. > That's reasonable for UTF-8/16/32, but for other encodings I see
  102. > no reason why \x7F cannot be mapped into U+DC7F if it's
  103. > missing as a valid code-point in the encoding.
  104.  
  105. TL;DR Because security holes (from using different encodings
  106. in the two directions) almost arise in the ASCII code space,
  107. 0-127 are not passed through.
  108.  
  109. I tried to explain this in the Security consideration section but since
  110. my explanations are not clear, please see the references I listed
  111. at the bottom.
  112.  
  113. > 5) What should
  114. > encoding convertto -profile lossless $data
  115. > do for code-points from U+D800 - U+DBFF and from
  116. > U+DD00 - U+DFFF ? Since those codepoints cannot
  117. > be produced by "encoding convertfrom", I think it's
  118. > most logical to throw an exception in this case.
  119. >
  120.  
  121. As I said earlier throwing exceptions is not an option for any
  122. profile that aims to be replace tcl8. As is done for tcl8 and
  123. the replace profile, it is replaced with an encoding-specific
  124. fallback character.
  125.  
  126. Thank you for your review.
  127.  
  128. /Ashok
  129.