RE: [TCLCORE] ANNOUNCE - TIP 671: Lossless encoding for system APIs

Jan,

Please see responses below.

> -----Original Message-----
> From: Jan Nijtmans <jan.nijtmans@gmail.com>
> Yes, I think that's the point in TIP #671: If glob encounters a
> filename "\xE6", it will be auto-corrected by Tcl to mean "æ".

Auto "correct" is not the term I would use for this behavior!

> However, when trying to open this file:
>     % open æ r
>     couldn't open "æ": no such file or directory
> Tcl will never produce such filenames, when the system-encoding
> is UTF-8. So - to me - this issue has only minor consequences,
> I don't think anyone filed a bug-report on this, even though this
> 'problem' is already present in Tcl for ages.

It is a fair position to take (though not one I agree with) that if no one
has complained about a bug it is not worth fixing. However, there is
always the danger that the behavior is so egregious on far eastern
locales for example that folks simply do not bother to use Tcl and
thus not report it. I dunno. May be the above inconsistency bothers
me more than it should.

> First, let's comment on the discussion going on between
> Ashok and Nathan. I'm reading it with joy, not because
> it's a useful discussion: I see it as a fight throwing mud
> at each other without really going into the main point.

Glad we brought some joy to you! But I cannot recall
any mud slinging. If I did, I apologize to Nathan. And the debate
was very much on point. I proposed a solution to a specific
problem. Nathan suggested it was not necessary because
of the iso8859-1 wrapping idiom. I disagreed and so there
has been a lot of back and forth. If that's not relevant
discussion, I don't know what is.
> 
> So, let's start criticising the TIP:
> 1) First, I like the name "lossless" more than "passthrough". The
> reason is that "passthrough" suggests that the bytes/codepoints
> are just passed through, which isn't what this profile does.

Opinion. Which is fine.

> 2) The TIP doesn't describe well what "lossless" does in other
> encodings than UTF-8. 

The TIP does not talk about utf-8 at all outside of the examples
given so I do understand that comment. It distinguishes (at
the top of the Specification section) ASCII-compatible
versus non-ASCII compatible encodings and defines the behavior
for each. The two classes obviously cover all encodings so I am
at a loss as to what was unclear. I can clarify if you do.

> I think that "lossless" can (and
> should) be implemented for all encodings, but just in one
> direction: so
>     encoding convertto -profile lossless $enc \
>         [encoding convertfrom -profile lossless $enc $data]
> should always return the unmodified $data for any encoding, but
>     encoding convertfrom -profile lossless $enc \
>         [encoding convertto -profile lossless $enc $data]
> should throw an exception if $data contains code-points
> not valid in the $enc encoding (such as lone surrogates
> other than U+DC80 - U+DCFF)

There are couple of issues with this. The first is a practical one.
The current Tcl implementation has many instances where errors 
(return code of TCL_ERROR) is not tolerated. Either the caller
panics or does not check the error code and has no means to
handle it. ckalloc() is the obvious example but the same is true
in the encodings and i/o modules as well in how internal transforms
are implemented. This works because the 8.6 / tcl8 profile never
fails. Any profile replacing the tcl8 profile must behave the same
way (here the passthrough profile is replacing the tcl8 profile
on glob etc.). Thus exceptions are not an option until Tcl's internals
are fixed to handle and propagate errors and I dont see that feasible
in the 9.0 time frame.

Second, I don't see the purpose in the above so it would help
if you elaborate on the intent or use case. In general, I'm 
tend to be somewhat skeptical about asymmetrical behavior
as you might recall from the discussion of making the strict
profile the default only in one direction.

> 3) I have my doubts about making the "lossless" profile the
> default for filenames and environment variables. For commands
> like "glob" and "open" I could imagine a "-profile" option for
> glob/open. For environment variables, I don't know how to do that.

There is no point adding this profile then. I see little value if not
used for system interfaces including file names.

> 4) In the TIP:
>     "Nevertheless, to mitigate this, this specification (following
>     PEP 383** will not map byte values < 128 into the U+DC00
>     surrogate space. Instead they are mapped to the encoding
>     specific replacement character"
> That's reasonable for UTF-8/16/32, but for other encodings I see
> no reason why \x7F cannot be mapped into U+DC7F if it's
> missing as a valid code-point in the encoding.

TL;DR Because security holes (from using different encodings
in the two directions) almost arise in the ASCII code space,
0-127 are not passed through.

I tried to explain this in the Security consideration section but since
my explanations are not clear, please see the references I listed
at the bottom.

> 5) What should
>         encoding convertto -profile lossless $data
> do for code-points from U+D800 - U+DBFF and from
> U+DD00 - U+DFFF ?  Since those codepoints cannot
> be produced by "encoding convertfrom", I think it's
> most logical to throw an exception in this case.
>

As I said earlier throwing exceptions is not an option for any
profile that aims to be replace tcl8. As is done for tcl8 and
the replace profile, it is replaced with an encoding-specific
fallback character.

Thank you for your review.

/Ashok