XeLaTeX: Unicode font fallback for unsupported characters

Traditionally I only used to use LaTeX to typeset documents, and it works perfectly when you have a single language script (e.g. only English or German). But as soon as you want to typeset Unicode text in multiple languages, you’re quickly out of luck. LaTeX is just not made for Unicode, and you need a lot of helper packages, documentation reading, and complicated configuration in your document to get it all right.

All I wanted was to typeset the following Unicode text. It contains regular latin characters, chinese characters, modern greek and polytonic (ancient) greek.

Latin text. Chinese text: 紫薇北斗星  Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.

I thought it was a simple task. I thought: let’s just use XeLaTeX, which has out-of-the-box Unicode support. In the end, it was a simple task, but only after struggling to solve a particular problem. To show you the problem, I ran the following straightforward code through XeLaTeX…

… and the following PDF was produced:

XeLaTeX rendering Computer Modern font with unsupported unicode characters

XeLaTeX rendering Computer Modern font with unsupported unicode characters

It turns out that the missing unicode characters are not XeLaTeX’s fault. The problem is that the used font (XeLaTeX by default uses a slightly more encompassing Computer Modern font) has not all unicode characters implemented. To implement all unicode characters in a single font (about 1.1 million characters) is a monumental task, and there are only a small handful of fonts whose maintainers aim to have full support of all characters (one of them is GNU FreeFont, which is already part of the Debian distribution, and therefore available to XeLaTeX).

So, I thought, let’s just use a font which is dedicated to unicode. I selected in my document the pretty Junicode font:

The result was:

XeLaTex and Junicode font with chinese and greek characters

XeLaTex and Junicode font with chinese and greek characters

Now, greek worked, but still no chinese characters. It turned out that even fonts which are dedicated to unicode do not yet have all possible characters implemented. Because it’s a lot of work to produce high-quality fonts with matching styles for millions of possible characters.

So, how do regular web browsers or office applications do it? They use a mechanism called font fallback. When a particular character is not implemented in the chosen main font, another font is silently used which does have this particular character implemented. XeLaTeX can do the same with a package called ucharclasses, and it gives you full control over the fallback font selection process. The ucharclasses documentation gives an example using the \fontspec  font selection. I decided to use the font IPAexMincho which supports chinese characters. So I added to my document:

… but when running XeLaTeX with this addition, ucharclasses somehow entered an endless loop with high CPU usage for the TexLive 2014 distribution (part of Debian). It hung at the line:

Endless googling didn’t bring up any useful hints. Something must have changed in the internals, and the ucharclasses documentation needs updating. In any event, it took me 4 hours to find a fix. The solution was to use a font selection other than  \fontspec{} — because it doesn’t seem to be compatible with ucharclasses any more. Instead, I used fontspec‘s suggested  \newfontfamily  mechanism. Here is the final working code:

Here is the result: Mixed latin, chinese, and greek scripts with two different fonts: Junicode and IPAexMincho:

XeLaTeX with unicode font fallbacks

XeLaTeX with unicode font fallbacks

Pretty!

XeLaTeX with unicode font fallbacks

XeLaTeX with unicode font fallbacks

, , , ,

4 Responses to XeLaTeX: Unicode font fallback for unsupported characters

  1. Scottie December 10, 2014 at 10:34 am #

    Oy vey… Thanks for this post. It just goes to show that something as simple as displaying characters in different languages – which most people take for granted – is a HUGE problem that can occupy lots of time and energy.

    It’s never as simple as “Just use unicode!” Heck, most programmers don’t even understand that utf-8 is an encoding, and unicode is a charset! =P

  2. Pomax February 27, 2017 at 1:32 am #

    As a small addendum to “even fonts which are dedicated to unicode do not yet have all possible characters implemented”, OpenType fonts can’t even fit all Unicode codepoints. They can only contain one USHORT’s worth of glyphs (and you typically need more glyphs than there are characters), which means only 65,535 available spots for encoding outline shapes. That’s nowhere near enough to cover all Unicode code points. OpenType fonts haven’t been able to cover “all of Unicode” since Unicode 3.0 back in the nineties =)

    • Michael Franzl February 27, 2017 at 8:47 am #

      I did not know that. Thanks for pointing this out!

  3. Paul C Roberts April 20, 2017 at 4:17 am #

    Also may be worth using the google Noto fonts, which aim to have all defined Unicode codepoints, including emoji.

    I think that you’d just need

    \newfontfamily\myregularfont{Noto Sans}
    \newfontfamily\mychinesefont{Noto Sans CJK TC}

    Unfortunately, I can’t try it out, because the Noto CJK fonts expose a bug in TeXLive 2013, and I haven’t currently got access to TeXLive 2016

    To be more rigorous around Japanese, Korean, etc. you’d need to define all the transitions as in the ucharclasses documentation, but Noto has fonts for all of these!

Leave a Reply

Powered by WordPress. Designed by Woo Themes