[Dev-luatex] Incorrect UTF-8 decoding in \luaescapestring (+ fix)

Jonathan Sauer Jonathan.Sauer at silverstroke.com
Tue Apr 8 10:09:51 CEST 2008


Hello,

the following PlainTeX example illustrates the problem:

%&luatex

\directlua0{%
	% The following character is DESERET CAPITAL LETTER LONG I, 
	% Unicode 0x10400, encoded in UTF-8 as F0 90 90 80:
	local s = '\luaescapestring{��}'% A
% 	local s = '��'% B

	for c in string.bytes(s) do texio.write_nl(c) end
}%

\end

(A) results in:

This is LuaTeX, Version snapshot-0.25.0-2008031419 (Web2C 7.5.6)
(EscapeUTF8.tex! Pool contains an invalid utf-8 sequence
.
l.4 	local s = '\luaescapestring{????}
                                      '% A
? 
240
144
144
128
239
191
189 )
No pages of output.
Transcript written on EscapeUTF8.log.



(B) results in:

This is LuaTeX, Version snapshot-0.25.0-2008031419 (Web2C 7.5.6)
(EscapeUTF8.tex
240
144
144
128 )
No pages of output.
Transcript written on EscapeUTF8.log.


It seems that \luaescapestring does not handle long UTF-8 sequences
correctly. The additional three bytes above -- 239-191-189 or EF-BF-BD
-- encode Unicode character FFFD -- REPLACEMENT CHARACTER -- in UTF-8,
the character LuaTeX inserts when encountering an invalid UTF-8
sequence.

I think that the error lies in luatex.web, line 10911:

@d unicode_incr(#)==if str_pool[#]>@"F0 then #:=#+4 else if str_pool[#]>@"E0 
     then #:=#+3 else if str_pool[#]>@"C0 then #:=#+2 else incr(#)

Now instead of skipping the entire four-byte-sequence, only the first
three bytes are skipped and the last byte, hex 80, is left for
processing as the next character. Since 0x80 represents an invalid
UTF-8 sequence, LuaTeX displays above error message and continues with
0xFFFD.

So I think '>=' should be used, instead of '>':

@d unicode_incr(#)==if str_pool[#]>=@"F0 then #:=#+4 else if str_pool[#]>=@"E0 
     then #:=#+3 else if str_pool[#]>=@"C0 then #:=#+2 else incr(#)

I tried this modification, and the error disappeared along with the 
additional three bytes.


Jonathan

P.S: How does the bug tracker work? I tried to register some weeks ago,
but never got the confirmation mail with the password. When trying to
register again now, it says the username is already being used, even
though non-activated accounts should be purged after a week.



More information about the dev-luatex mailing list