[Dev-luatex] Utf-8 too dominant?

David Kastrup dak at gnu.org
Tue Mar 27 12:36:23 CEST 2007


Taco Hoekwater <taco at elvenkind.com> writes:

> Arthur Reutenauer wrote:
>>> Instead, LuaTeX barfs on "\^^9d" and similar ASCII _transliterations_
>>> of characters which happen to be legal _characters_ in Unicode (though
>>> not legal _bytes_ in utf-8).
>>
>>   Good spot, I already noticed there was many problems with latex but I
>> thought it was mainly due to pattern files (and I gave up very early on
>> LaTeX in LuaTeX anyway). I suppose the ^^ notation should yield a UTF-8
>> encoded sequence and not an individual byte (XeTeX indeed is perfectly
>> happy with it).
>
> It worked before, so I probably messed up something along the way.
> It is safe to assume there will be a fix in the next snapshot.

Anyway: I think it is a safe assumption that LuaTeX should be able to
deal with current versions of LaTeX (I think it would be a mistake to
have to rely on lambda).

So the kind of utf-8 support (OTP or something) used for Omega needs
to be somewhat optional.

I don't have any clue about the current implementation, but the amount
of error messages I got suggests there are several areas involved.

Here is my take on what would constitute a sane environment (some of
that probably is already implemented in XeTeX) in my opinion:

Single characters: encoded in unicode (UCS-21 or similar).

Input line buffer: array of single characters.  Characters are created
from input by using the input coding system of the file (basically one
of 8-bit, utf-8, at some later point of time possibly also things like
utf-16-le or utf-16-be).

LaTeX would be fixed to "transparent" at first.  Which would make it
work like before.  However, one would want to eventually add something
like an utf8l input encoding in order to have it behave more sanely.

String space: utf-8 encoded.  This is probably incompatible with
previous code, but saves space.

Log and console output: switchable utf-8 or 8-bit, probably depending
on locale and/or inherited from the mode of the current input file.
In "8-bit" mode, obviously all characters with a code point above 256
need to get output as ^^^^abcd or ^^^^^^01abcd or similar.

Write streams: similar.  It might be possible to generally write
utf-8, but then it might be a good idea to add a byte order mark at
the start of files so that \input on such files will flip the coding
system appropriately.

I really need to take a look at XeTeX.

-- 
David Kastrup


More information about the dev-luatex mailing list