Text Encoding Explained — UTF-8, ASCII, and Beyond

Understanding Text Encoding: Why Characters Break

Every piece of text on your computer is stored as a sequence of numbers. Text encoding is the system that defines which number represents which character. When you see garbled text — Ã©motionnel instead of émotionnel, or ??? instead of Chinese characters, or â€™ instead of an apostrophe — you are experiencing an encoding mismatch: the numbers were written using one encoding system and read using a different one.

This problem has plagued computing for decades and still causes issues in 2026. Email attachments, CSV imports, web pages, database migrations, and file transfers between systems are all vulnerable to encoding mismatches. Understanding the basics of text encoding helps you diagnose and fix these issues rather than just being frustrated by them.

ASCII: The Original Encoding

ASCII (American Standard Code for Information Interchange), created in 1963, assigns numbers 0 through 127 to 128 characters: English letters (uppercase and lowercase), digits 0-9, punctuation marks, and control characters (like newline and tab). ASCII works perfectly for English text but has no provision for accented characters (é, ñ, ü), non-Latin scripts (Chinese, Arabic, Hindi, Korean), or even basic symbols like the Euro sign (€).

Because ASCII only uses 7 bits per character, the eighth bit in each byte was available for extensions. Different regions created different encodings that used codes 128-255 for their local characters — Latin-1 (Western European), Latin-2 (Central European), Windows-1251 (Cyrillic), Shift-JIS (Japanese), Big5 (Traditional Chinese), and dozens more. The same number (say, 233) might represent é in Latin-1, щ in Windows-1251, or half of a Japanese character in Shift-JIS. This is why a Russian email opened with Western European encoding settings displays meaningless Latin characters.

Unicode: The Universal Solution

Unicode assigns a unique number (called a code point) to every character in every script used by humans — over 150,000 characters across 161 scripts, including historical scripts, mathematical symbols, musical notation, and yes, emoji. The letter é is always U+00E9, the Chinese character 中 is always U+4E2D, and the emoji 😊 is always U+1F60A, regardless of the platform, language, or application.

Unicode solves the character identification problem — every character has one unambiguous number. But Unicode itself is not an encoding — it does not specify how those numbers are stored as bytes in a file. That job falls to encoding formats, the most important of which is UTF-8.

UTF-8: The Dominant Encoding

UTF-8 is the most widely used text encoding on the internet, used by over 98 percent of web pages. It uses a variable-length encoding: ASCII characters (0-127) use one byte (making UTF-8 backwards-compatible with ASCII), European accented characters use two bytes, Asian characters use three bytes, and emoji use four bytes. This variable-length design means English text in UTF-8 is the same size as ASCII, while still being able to represent every character in Unicode.

UTF-8's backwards compatibility with ASCII is the key to its dominance — existing ASCII documents are automatically valid UTF-8 documents. This meant the internet could gradually migrate to UTF-8 without breaking existing content. Our Text Encoding Converter at tristanconvert.com detects the encoding of any text file and converts between UTF-8, Latin-1, Windows-1252, and other common encodings.

How to Fix Encoding Problems

When you encounter garbled text, the fix is straightforward in principle: determine what encoding the text was written in, then read it with that encoding. The challenge is determining the original encoding, because there is no reliable way to detect encoding automatically — the same bytes can be valid in multiple encodings with different interpretations.

Clues for identifying encoding: if accented characters appear as two-character sequences (Ã© for é), the text is UTF-8 being read as Latin-1. If you see question marks or squares where special characters should be, the reader does not support the source encoding at all. If Cyrillic text appears as random Latin characters, the text is Windows-1251 being read as Latin-1. Use these patterns to identify the mismatch and specify the correct encoding in your import settings.

Best Practices for Avoiding Encoding Issues

Use UTF-8 for everything. Specify UTF-8 explicitly when creating files, database connections, email headers, and HTTP responses. Do not rely on default encoding settings because defaults vary between operating systems, applications, and locales. When receiving data from external sources, ask what encoding was used rather than assuming. When problems occur, fix the encoding at the source rather than applying character-by-character corrections in the output.