Normalization Footprint Description

M. Davis, 2000-11-21

This document describes how much memory the different normalization forms occupy at a minimum (e.g., with an implementation tuned for minimal space consumption).

The code for a simple implementation of Normalization is about 100 lines (discounting comments) in Java, not including code for accessing data. It would take about the same amount of space in C, since it does not need to use any of the Java runtime support, nor even the String class.

The minimal data size is given below. NFC is highlighted because that will be of most interest (see W3C Character Model for the World Wide Web).

Minimal Data Size (Bytes)
Form Can. Class Decomp Comp Total
NFD 828 11,062 N/A 11,890
NFC 828 11,062 5,508 17,398
NFKD 828 26,918 N/A 27,746
NFKC 828 26,918 5,508 33,254

A breakdown of the table sizes used for each of these is given at the end of this document. In each case, an index of keys is binary-searched. If the key not is found, then the index is used to fetch a value from a result table. In the case of decomposition, that result value is used to index into a string array. The result is broken into two pieces: 3 bits for length, and 13 bits for starting offset. If the length is 7, then the string in the array is null terminated. Otherwise, the final result consists of the characters in the string array from startingOffset to startingOffset + length, inclusive. The code for doing this is around 30 lines.

Optimizations

In all cases, a trie table could be used for the main lookup instead of a binary search. Trie tables are somewhat larger, but have excellent performance characteristics. Accessing data by character in a typical trie table is simply:

result = data[index[ch>>7] + (ch & 0x7F)];

Optimized Normalization code would also be somewhat larger than the simple code. The simplest code optimization is to use the quickCheck described in Annex 8: Detecting Normalization Forms (which is about 20 lines of code) and do a quick scan of the source to see if normalization is required. For example, the table of suspect characters for NFC consists of  495 characters (Unicode 3.0). This takes an additional table of 4,492 bytes (using a byte-trie; 1,468 using a bit-trie).

Speed-Optimized Data Size (Bytes)
Form Can. Class Decomp Comp Quick
Check
Total
NFC 2,124 16,238 11,000 4,492 33,854

For more information on trie tables, inversion tables, and other structures for use with Unicode, see http://www.macchiato.com: "Bits of Unicode". To see a Javascript implementation of Normalization, see NamePrep. For other implementations, see Annex 5 in UAX 15.

Decomposition (NFD)

Decomposition (NFKD)

Canonical Class (All)

Composition (NFC, NFKC)