Character Set in Computers | Jackey Song's Blog

Computer data is represented in binary in all computing systems, which is not human-friendly to read. Therefore, character encoding schemes were developed to map different binary data to corresponding characters. When we read electronic documents, computers decode binary data into readable characters for users. Different encoding schemes use unique methods to encode and decode characters, and using the decoding method of one character set to interpret documents encoded in another character set can result in garbled text.

Computers originated in the United States, so initially, only English encoding was supported. In English, there are only 26 letters (including upper and lower case), along with special punctuation marks and other special characters like carriage return and line feed. All these characters amount to just over a hundred, easily represented by one byte (8 bits). In binary, an 8-bit binary number ranges from 0 to 255, providing 256 possible values, of which only 7 bits were initially used to represent all characters.

As computers became more widespread globally, they were sold in Europe, Asia, and beyond, leading to the development of additional character encodings to meet different language needs. The following are commonly used character sets:

ASCII Character Set

Definition: ASCII (American Standard Code for Information Interchange) is a computer encoding system based on the Roman alphabet, primarily for displaying modern English and other Western European languages.
Development: ASCII was first published as a standard in 1967 and was last updated in 1986. It remains one of the earliest and most fundamental character encoding standards used in computers.
Features: ASCII uses 7-bit encoding, comprising 128 characters. Characters from 32 to 126 are printable, including letters, digits, and punctuation. To accommodate more European characters, ASCII was extended to use 8 bits per character, totaling 256 characters.

GB2312 Character Set

Definition: GB2312, also known as GB2312-80, is the Chinese national standard character set for simplified Chinese characters.
Development: GB2312 was issued by the Chinese National Standardization Administration in May 1981. It met the computing needs for Chinese characters, covering 99.75% of usage in mainland China at that time.
Features: GB2312 includes 7,445 characters, comprising 6,763 Chinese characters and 682 other characters (Latin letters, Greek letters, Japanese kana, etc.). It uses a two-byte encoding format with characters organized into “zones.”

BIG5 Character Set

Definition: BIG5, also known as Big Five, is the traditional Chinese character set used in Taiwan.
Development: BIG5 was established in 1984 by the Institute for Information Industry and five software companies in Taiwan to address compatibility issues among different vendors’ proprietary encodings.
Features: BIG5 includes 13,053 Chinese characters using a two-byte storage method. It covers a vast number of traditional Chinese characters but does not include all characters used in names and places.

GB18030 Character Set

Definition: GB18030-2000 is the Chinese national standard character set extension of the basic GB18030 character set for information exchange.
Development: GB18030 was issued by the Chinese government in 2000 after extensive deliberation to resolve encoding issues for Chinese, Japanese kana, Korean, and Chinese minority languages.
Features: GB18030 supports over 1.5 million encoding spaces, including 27,484 Chinese characters, and is compatible with Unicode 3.0, covering various written languages globally.

Unicode Character Set

Definition: Unicode, abbreviated as Universal Multiple-Octet Coded Character Set, is a character encoding system developed by the Unicode Consortium.
Development: Unicode development began in 1990, with the first standard published in 1994. The latest version as of this writing is Unicode 6.1 (note: Unicode continues to be updated).
Features: Unicode assigns a unique binary code to every character in every language, facilitating text conversion, processing, and display across different languages and platforms.

Within the Unicode character set, there are various versions such as UTF-8, UTF-16, and UTF-32, each with distinct features and applications.

UTF-8

Definition and Features:

UTF-8 (Unicode Transformation Format-8) is a variable-width character encoding that is backward compatible with ASCII and widely used for internet data exchange.
It uses 1 to 4 bytes to represent a character based on its Unicode position.
For ASCII characters (U+0000 to U+007F), UTF-8 encoding is identical to ASCII encoding, i.e., single-byte encoding with the highest bit as 0.
For non-ASCII characters, UTF-8 uses multiple bytes, with each byte except the first indicating that it is part of a multi-byte sequence.

Advantages:

ASCII compatibility minimizes impact on existing systems.
Resolves encoding mismatches for global text representation.
Saves storage space due to variable-length encoding.

Disadvantages:

Processing speed may be slower due to variable-length nature.
UTF-16

Definition and Features:

UTF-16 is another Unicode encoding scheme using 16-bit (2-byte) or 32-bit (4-byte) binary to represent characters.
It evolved from UCS-2 and directly maps character code points (similar to Unicode code points) in the Basic Multilingual Plane (BMP) using two fixed bytes.
For supplementary planes (planes 1 to 16, code points ranging from 0x10000 to 0x10FFFF), UTF-16 introduces surrogate pairs to represent characters, making it a variable-length encoding.

Advantages:

Fast processing for most commonly used characters using two bytes.
Supports all characters in the Unicode character set.

Disadvantages:

Endianness issues may cause encoding errors during information exchange.
Increased storage space usage for some characters requiring four bytes.
UTF-32

Definition and Features:

UTF-32 is the simplest Unicode encoding scheme, representing each Unicode character code point directly as a 32-bit code unit.
It is a fixed-width (or fixed-length) encoding scheme for Unicode characters.

Advantages:

Fast processing speed due to fixed-length encoding.
Directly represents all characters in the Unicode character set without surrogate pairs.

Disadvantages:

Wastes storage space, as even ASCII characters require four bytes.

UTF-8

UTF-16

UTF-32

你的赏识是我前进的动力