On 15 December 2016 at 22:39, Toshio Kuratomi a.badger@gmail.com wrote:
- I'm not 100% certain that LC_CTYPE is the best thing to check.
People will set LC_CTYPE in conjunction with LC_COLLATE to C to get a predictable sort order.(CTYPE is needed because bytes can be interpreted as different characters and those differences can affect sort order.). Changing this will mean that their command line sort order (ls, sort, etc) could differ from python's sort order. I haven't thought this fully through or a better way to check for our actual meaning, though, and perhaps python already uses LC_CTYPE in ways that would differ from other unix tools?
Yep, LC_CTYPE is the source of all the pain, as it's what controls the encoding used for everything that CPython needs to decode before it gets its own codec machinery bootstrapped. Victor Stinner made a couple of attempts at overriding it later in the interpreter bootstrapping process (e.g. based on environment variables) after the codec system was fully up and running, and the problem you end up with is that:
1. Within CPython it's easy to lose track of how you decoded system provided text like sys.argv, sys.warnoptions, sys._xoptions, sys.executable, os.env, etc, so "fixing" incorrectly decoded values later is fragile 2. Even if you *do* get all the details right within CPython, you may be in trouble again as soon as you call out to a third party C/C++ library, especially GUI toolkits like Tcl/Tk, Qt or Gtk that have a lot of locale dependent behaviour
His conclusion was that letting the locale-as-seen-by-CPython diverge from that seen by the rest of the process simply doesn't work, and I don't have any reason to second guess that conclusion.
However, I'll also note that any tooling written in Rust or Go already makes the "UTF-8 everywhere" assumption at the level of the language design, so the proposed change would just move tools written in Python 3 into the same category as those written in those languages (unless you set PYTHONALLOWCLOCALE to request the old behaviour).
- Thinking about whether this belongs in the library or the
interpreter some more I'm seeing some hefty cons in both directions. already noted that the con for doing it in the interpreter is that we get out of sync with other things linking to libpython, therefore making debugging harder.
Note that CPython already offers a range of "preconfiguration" APIs that allow applications embedding the runtime to override otherwise environment based configuration settings. In particular https://docs.python.org/3/c-api/init.html#c.Py_SetStandardStreamEncoding was added specifically so Blender could just tell the Python 3 runtime "configure the standard streams like *this*", rather than having to persuade CPython to guess the right answer.
So the fact embedded runtimes can give you different results from what you get at the command prompt isn't a *new* problem.
[Copying-and-pasting some of your comments from the other subthread to consolidate the two discussions]
I'd almost say that internalizing the click behviour could be the correct design here. Have the library check that it has a locale with non-ascii capabilities and fail if it doesn't would be helpful. That would quickly point to differences in behaviours running under a mod_wsgi vs /usr/bin/python, for instance, prompting the user to fix the mod_wsgi deployment in advance.
While I don't like the idea of locale *coercion* inside the library, I'd be fine with emitting a proper Python level warning inside Py_Initialize after we get the warnings machinery up and running
OTOH, users don't run into the problem all the time (it depends on the data being processed and how it is handled) so it seems heavy handed to do it this way
I think erroring out would be unduly harsh, but a warning seems reasonable given the availability of C.UTF-8.
(I suppose by the same argument I'd have to say that click is doing it wrong to force users to address ascii-only locales...)
click is younger than Python 3, so Armin did make some initial attempts to get it working in the C locale on both 2 & 3. However, he eventually gave the latter up as unsupportable and the error makes it clear that "I don't need to support ASCII based locales on Python 3" is a key constraint in deciding whether or not to adopt click.
Cheers, Nick.