# [NTG-context] Basic question on Unicode and ConTeXt

Mojca Miklavec mojca.miklavec.lists at gmail.com
Thu Jul 21 02:52:31 CEST 2005

Christopher Creutzig wrote:
> Hans Hagen wrote:
> >> So why not mapping the characters to unicode first and defining the
> >> mapping from unicode to \TeXcommand only once? regi-* files (at least
> >> in the meaning they have now) could be prepared automatically by a
> >> script, less error-prone and without the need to say "Some more
> >> definitions will be added later."
> >>
> > you mean ...
> >
> > \defineactivetoken 123 {\uchar{...}{...}}
> >
> > it is an option but it's much slower and take much more memory
>
>   I may be wrong, of course, but I think Mojca proposed something
> different (and something that should be really easy to implement):  Have
> the unicode vectors stored in a format easily parsed by an external ruby
> script and create the regi-* files from that, using the conversion
> tables provided by your operating system or iconv or wherever ruby gets
> them from.

Yes, I had something different in mind.

A1.) prepare the files to be used as a source of transformation from
"any" character set to utf and prepare a list of synonyms for
encodings

(example: a file that says that in ISO-8859-2, character 0xA3
represents an unicode character 0x0141 (lstroke): for every character,
for every Mac/Windows/iso/[...] encoding that we want to support)

A2.) write a script which automatically generates regi-* files from
those files, but regi-* files would contain only the mapping to
unicode number

(example:
\startregime[iso-8859-2]
...
\somecommandtomapacharactertounicode {163}{1}{65} % lstroke
...
\stopregime)

A3.) prepare a huge file with mapping from unicode numbers to ConTeXt commands

(example:
...
\somecommandtomapfromunicodetocontext {1}{65}{\lstroke}
...)

A4.) ... I don't mind what ConTeXt does with this \lstroke afterwards,
but it seems it is already clever enough to produce the (proper) glyph
at the end

What should ConTeXt do with that?
B1.) The file under A3 should be processed at the beginning. As it may
become really huge, exotic definitions should be only preloaded if
asked for (\usemodule[korean]), while there is probably no harm if
(accented) latin, greek, cyrillic and punctuation (TM, copyright, ..)

B2.) Once the \enableregime[iso-8859-2] or any other regime is
requested, the file with the corresponding regime definitions is
processed. However, as \somecommandtomapacharactertounicode
{163}{1}{65} is processed, the character '163' is not stored as
\uchar{1}{65}, but as \lstroke. '\somecommandtomapacharactertounicode'
would first take a look which ConTeXt command is saved under
\uchar{1}{65} and call the
\defineactivetoken 179 {\lstroke} as a result.

I don't know the details of the ConTeXt internal stuff, but I think
(hope) that it should be possible to do it this way. B1 (preloading
mapping from unicode to tex commands) is probably the only "hungry"
step in the whole story.

I think that it doesn't make any sense to ask the user to "\input
regi-whatever". \enableregime and some additional definitions should
be clever enough to find out which file to process in order to enable
the proper regime.

%%%%%%%%%%%%%%%%%%%%%

Christopher's idea is actually yet another alternative, which combines
the steps A2 and A3. If the mapping unicode->ConTeXt is in some
easy-to-parse format, there's actually no additional effort if the
script writes directly the ConTeXt commands instead of unicode numbers
into regi-* files, so that B2 has some less work to do. As long as it
is guaranteed that nobody will change these files manually, this is
OK. The only drawback is that if someone notices that "\textellipsis"
is more suitable than "\dots", the script has to be changed and the
files have to be generated once more. If the character is mapped to
(0x2026 HORIZONTAL ELLIPSIS) instead, only one line in the file with
unicode->ConTeXt mapping (A3) has to be changed.

If B2 cannot work as described, the Christopher's proposal would be
the only proper way to go.

%%%%%%%%%%%%%%%%%%%%%

I wanted to test \showcharacters on the live.contextgarden.net (as
Hans suggested that my map files are probably not OK), but it didn't
compile there. (I hope it's not because of my buggy contributions in
the last few days.)

Is there any tool or macro to visialize all the glyphs available in a
font? \showcharacters (if it works) shows only the glyphs that ConTeXt
is aware of. What about the rest?

Mojca