Question 1

What's the difference between a code point and a code unit?

Accepted Answer

A code point is the abstract character ID (U+1F600). A code unit is how it's encoded in a particular UTF format. UTF-8 uses 1 to 4 bytes per code point; UTF-16 uses 1 or 2 16-bit code units. Surrogate pairs in UTF-16 represent a single code point with two code units.

Question 2

Why does my character display as a box?

Accepted Answer

Tofu (the empty box) means the font lacks a glyph for that character even though the code point is valid. Solutions: install a font that covers the script (Noto Sans family covers most Unicode), use a font stack with fallbacks, or check whether the character is actually a private-use code point with no standard glyph.

Question 3

What are private use areas?

Accepted Answer

Code point ranges (U+E000 to U+F8FF and the supplementary private use planes) reserved for application-specific or font-specific custom glyphs. Symbol fonts and icon font systems use them. Different fonts assign completely different glyphs to the same private code point — they're not interchangeable across fonts.

Question 4

Is Unicode versioned?

Accepted Answer

Yes. New characters are added in major versions (typically annually). Unicode 16.0 came out in 2024 with about 5,000 new characters, mostly emoji and historical scripts. Older systems may not render newer characters at all. Common emoji typically take a year or two to propagate to all major platforms.

Question 5

What's normalization (NFC, NFD, NFKC, NFKD)?

Accepted Answer

Different ways of representing the same logical character. NFC composes (é as one code point); NFD decomposes (é as e + combining acute). NFKC and NFKD additionally fold compatibility characters (full-width to half-width, ligatures to letter pairs). Different applications expect different forms — comparing strings without normalizing first is a common bug source.

Question 6

Can I use any Unicode character in code identifiers?

Accepted Answer

JavaScript and Python both allow most Unicode letters and digits in identifiers. Mixing scripts (Latin l vs Cyrillic l, which look identical) is a known phishing vector in package names. Most linters flag mixed-script identifiers. Stick to a single script in identifiers unless you're explicitly working in a non-Latin language.

Question 7

How does Unicode handle emoji modifiers?

Accepted Answer

Skin-tone modifiers (U+1F3FB through U+1F3FF) and ZWJ sequences (zero-width joiner, U+200D) compose multiple code points into single visual emoji. A single emoji can be 4 to 7 code points (a family with skin tones and gender modifiers). Length-counting and reversal both need awareness of these sequences to handle emoji correctly.

Question 8

What's the maximum Unicode code point?

Accepted Answer

U+10FFFF — about 1.1 million possible code points. Currently around 155,000 are assigned; the rest are reserved for future use. The 21-bit code point space was set in 1996 and is unlikely to be expanded. UTF-16 surrogate pair design fixes this maximum.

Unicode Character Lookup

Related Tools

About This Tool

Frequently Asked Questions