What is the difference between Character Sets, Encoding & Collations


You must have heard before about these terms – Encoding, Character Sets and Collations.

So what exactly these terms means ?

A character set is a set of symbols and encodings. We can also term character set as a list of characters with unique numbers. For example, in the Unicode character set, the number for A is 41.

A collation is a set of rules for comparing characters in a character set.

Encoding is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this: “00000001 00000010 00000011 00000100″.

Let’s explore more about these terms with the help of an example :

Suppose we have four letters : “S”, “T”, “s”, “t”. We assume that each letter corresponds to a number :

“S”=0, “T”=1, “s”=2, “t”=3

Now letter “S” represents a symbol and corresponding number, which is 0, represents the encoding for “S” and combination of all four letters and their encodings is a character set.

If you want to compare two letter then you can simply do that by comparing their encodings and can quickly find out that by looking at their encoding.

For eg, you want to compare “S” with “T” then look at their corresponding encoding which is 0 and 1 respectively. Thus we can quickly find out that “S” is less than “T” because 0 is less than 1.

Collation is a set of rules for comparing characters( for eg. sorting of data in mysql) in a character set. Choosing appropriate collation type is very important, if you are dealing with sorting of multilingual data. “utf8_unicode_ci” is the most preferred collation type in such case.

There are some other terms as well, which are closely related with what we are discussing, like Unicode, UTF-8 etc. Let’s try to understand some of these as well.

UTF-8 is an encoding used to translate binary data into numbers, whereas Unicode is a character set used to translate numbers into characters.

Unicode is a standard that defines Universal Character Set (UCS) which is a superset of all existing characters required to represent all known languages. Unicode assigns a Name and a Number(code-point) to each character in its repertoire.

On other hand, UTF-8 encoding is a way to represent these characters digitally in computer memory. UTF-8 maps each code-point into a sequence of octets (8-bit bytes)

