# [NTG-pdftex] Patch to support CMap namespaces

Vasile Gaburici vgaburici at gmail.com
Wed Aug 27 23:59:51 CEST 2008

There are a couple of LaTeX packages out there that provide CMaps.
They don't work as well as \pdfglyphtounicode, i.e. virtual fonts
don't get CMaps at all (the CMap is included in the PDF but not
referenced), and otftotfm-installed fonts lack the CMap entries for
the ligatures that otftotfm sneaks in empty slots. As you know,
\pdfglyphtounicode fixes these problems.

On the other hand, these two packages let the user specify a CMap for
each LaTeX encoding, so the user ca give different Unicode values to
the same PS glyph name in different LaTeX encodings. Of course that
works properly only if the fonts invoked by the different LaTeX
encodings are different; otherwise only one can win the \pdffontattr.
A compelling application of this feature are CMaps that set math code
points (usually above BMP) for TeX math fonts; those glyphs have
exactly the same names as in text fonts /A etc. Adding namespaces to
\pdfglyphtounicode makes those two packages obsolete in their current
implementation.

Another advantage of namespaces is the ability to (reliably) fix
TrueType font CMaps. The troublesome glyphs are usually ligatures that
don't have a Unicode entry (Th, ti, tf, ffb, etc.), which otftotfm
writes as /indexZZZ in the enc file. Putting those in a per-font
namespace avoids any potential clashes.

So, I've patched pdftex to provide namespaces using the following
syntax extension: the first argument of \pdfglyphtounicode can now
\pdfglyphtounicode{fnt:tex-font-name/ps-glyph-name}{...}
\pdfglyphtounicode{enc:ps-enc-name/ps-glyph-name}{...}

Since fonts for which the built-in encoding is used happen to be
exactly those that have multiple design sizes (cmr, stmary etc.),
'enc' namespace for those is obtained by dropping any final digits
from the font name, e.g. cmr10 has PS encoding cmr (for CMap purposes
only).

The search policy is to first search the font namespce, then the
encoding, and finally the global namespace, for which the syntax
remains unchanged. All these namespace are implemented in the same avl
tree; just using the above strings as key names. In theory this makes
the search 3 times slower, but that particular phase of pdftex hardly
takes any time, so it seemed premature to implement any optimization.

Some usage examples:

% make the ti ligature searchable in Calibri regular
\pdfglyphtounicode{fnt:calibly1--base/index415}{0074 0069}
% go crazy with Unicode math; TeX math italic gives above-BMP math A
\pdfglyphtounicode{enc:cmmi/A}{D835 DC34} % UTF16BE required

Note that search behavior for math letters varies with pdf viewers.
Acrobat implements only canonical equivalence, so you need to enter
the exact code point, and copy/paste preserves the code points, so you
can paste into a LaTeX document if it's using utf8x input encoding.
Evince implements compatibility equivalence, so it's easier to find
those math As by searching for plain A, but they also copy/paste as
plain A. You can use pdftotext however, which uses the same poppler
backend, to have the code points are preserved. I'm not really
advocating Unicode math letters, but now they're easily supported in
pdftex -- no need for manual CMaps anymore.

BTW, \pdfglyphtounicode now really needs to be documented in the
manual, so people would stop writing (buggy) CMaps by hand. I
volunteer to do it if you accept the patch :)

I also wrote some CMap handling tools, mostly for verification, I'll
send a separate announcement about that.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: CMap-ns.patch
Type: text/x-patch
Size: 8725 bytes
Desc: not available
Url : http://www.ntg.nl/pipermail/ntg-pdftex/attachments/20080828/5dd40072/attachment.bin