Posted to tcl by apn at Sat Jul 08 04:31:00 GMT 2023view raw
- Jan,
- Please see responses below.
- > -----Original Message-----
- > From: Jan Nijtmans <jan.nijtmans@gmail.com>
- > Yes, I think that's the point in TIP #671: If glob encounters a
- > filename "\xE6", it will be auto-corrected by Tcl to mean "æ".
- Auto "correct" is not the term I would use for this behavior!
- > However, when trying to open this file:
- > % open æ r
- > couldn't open "æ": no such file or directory
- > Tcl will never produce such filenames, when the system-encoding
- > is UTF-8. So - to me - this issue has only minor consequences,
- > I don't think anyone filed a bug-report on this, even though this
- > 'problem' is already present in Tcl for ages.
- It is a fair position to take (though not one I agree with) that if no one
- has complained about a bug it is not worth fixing. However, there is
- always the danger that the behavior is so egregious on far eastern
- locales for example that folks simply do not bother to use Tcl and
- thus not report it. I dunno. May be the above inconsistency bothers
- me more than it should.
- > First, let's comment on the discussion going on between
- > Ashok and Nathan. I'm reading it with joy, not because
- > it's a useful discussion: I see it as a fight throwing mud
- > at each other without really going into the main point.
- Glad we brought some joy to you! But I cannot recall
- any mud slinging. If I did, I apologize to Nathan. And the debate
- was very much on point. I proposed a solution to a specific
- problem. Nathan suggested it was not necessary because
- of the iso8859-1 wrapping idiom. I disagreed and so there
- has been a lot of back and forth. If that's not relevant
- discussion, I don't know what is.
- >
- > So, let's start criticising the TIP:
- > 1) First, I like the name "lossless" more than "passthrough". The
- > reason is that "passthrough" suggests that the bytes/codepoints
- > are just passed through, which isn't what this profile does.
- Opinion. Which is fine.
- > 2) The TIP doesn't describe well what "lossless" does in other
- > encodings than UTF-8.
- The TIP does not talk about utf-8 at all outside of the examples
- given so I do understand that comment. It distinguishes (at
- the top of the Specification section) ASCII-compatible
- versus non-ASCII compatible encodings and defines the behavior
- for each. The two classes obviously cover all encodings so I am
- at a loss as to what was unclear. I can clarify if you do.
- > I think that "lossless" can (and
- > should) be implemented for all encodings, but just in one
- > direction: so
- > encoding convertto -profile lossless $enc \
- > [encoding convertfrom -profile lossless $enc $data]
- > should always return the unmodified $data for any encoding, but
- > encoding convertfrom -profile lossless $enc \
- > [encoding convertto -profile lossless $enc $data]
- > should throw an exception if $data contains code-points
- > not valid in the $enc encoding (such as lone surrogates
- > other than U+DC80 - U+DCFF)
- There are couple of issues with this. The first is a practical one.
- The current Tcl implementation has many instances where errors
- (return code of TCL_ERROR) is not tolerated. Either the caller
- panics or does not check the error code and has no means to
- handle it. ckalloc() is the obvious example but the same is true
- in the encodings and i/o modules as well in how internal transforms
- are implemented. This works because the 8.6 / tcl8 profile never
- fails. Any profile replacing the tcl8 profile must behave the same
- way (here the passthrough profile is replacing the tcl8 profile
- on glob etc.). Thus exceptions are not an option until Tcl's internals
- are fixed to handle and propagate errors and I dont see that feasible
- in the 9.0 time frame.
- Second, I don't see the purpose in the above so it would help
- if you elaborate on the intent or use case. In general, I'm
- tend to be somewhat skeptical about asymmetrical behavior
- as you might recall from the discussion of making the strict
- profile the default only in one direction.
- > 3) I have my doubts about making the "lossless" profile the
- > default for filenames and environment variables. For commands
- > like "glob" and "open" I could imagine a "-profile" option for
- > glob/open. For environment variables, I don't know how to do that.
- There is no point adding this profile then. I see little value if not
- used for system interfaces including file names.
- > 4) In the TIP:
- > "Nevertheless, to mitigate this, this specification (following
- > PEP 383** will not map byte values < 128 into the U+DC00
- > surrogate space. Instead they are mapped to the encoding
- > specific replacement character"
- > That's reasonable for UTF-8/16/32, but for other encodings I see
- > no reason why \x7F cannot be mapped into U+DC7F if it's
- > missing as a valid code-point in the encoding.
- TL;DR Because security holes (from using different encodings
- in the two directions) almost arise in the ASCII code space,
- 0-127 are not passed through.
- I tried to explain this in the Security consideration section but since
- my explanations are not clear, please see the references I listed
- at the bottom.
- > 5) What should
- > encoding convertto -profile lossless $data
- > do for code-points from U+D800 - U+DBFF and from
- > U+DD00 - U+DFFF ? Since those codepoints cannot
- > be produced by "encoding convertfrom", I think it's
- > most logical to throw an exception in this case.
- >
- As I said earlier throwing exceptions is not an option for any
- profile that aims to be replace tcl8. As is done for tcl8 and
- the replace profile, it is replaced with an encoding-specific
- fallback character.
- Thank you for your review.
- /Ashok