Binary level

Encoding numbers and text

System Programming Fundamentals

Encoding numbers and text

Integers
Hexadecimal form
Real numbers
Text

Integers

In decimal digits, each digit has 10 times the value of the next one, for example in the number 134725 (one hundred thirty-four thousand seven hundred twenty-five):

the digit 1 by itself is the number 1 but because of its position has its value multiplied by 100000
the digit 3 has its value multiplied by 10000
the digit 4 has its value multiplied by 1000
the digit 7 has its value multiplied by 100
the digit 2 has its value multiplied by 10
the digit 5 has its value multiplied by 1

1 * 100000 = 100000
3 * 10000  =  30000
4 * 1000   =   4000
7 * 100    =    700
2 * 10     =     20
5 * 1      =      5
-------------------
total      = 134725

The simplest way to represent positive integers in binary format is to follow the same pattern and assign to each bit a value that's 2 times the next one, like this:


*128	*64	*32	*16	*8	*4	*2	*1

(As you can see, the value of each bits follows the same progression as the powers of 2)

For example, the number 0 would be represented like this:

0	0	0	0	0	0	0	0	total
*128	*64	*32	*16	*8	*4	*2	*1	0

The number 5 would be represented as 4+1:

0	0	0	0	0	1	0	1	total
*128	*64	*32	*16	*8	*4	*2	*1	5

100 would be represented as 64+32+4:

0	1	1	0	0	1	0	0	total
*128	*64	*32	*16	*8	*4	*2	*1	100

With this system, the biggest number we can represent by switching all bits on a single byte (11111111) is 255. For bigger numbers, we would need to use more than one byte. Common sizes for numbers are 2 bytes (16 bits), 4 bytes (32 bits), and 8 bytes (64 bits). Each bit we add can hold twice the value of the previous one. 64 bits are enough to store the number 18 446 744 073 709 551 615!

In most cases, the bytes are ordered the same way the bits are, with the "most significant" one (the biggest) coming first, and the "least significant" (the smallest) coming last. This is called big endianness.

Certain specific software systems though (like the TCP/IP internet protocols, the jpeg format, or the Java bytecode) have big endianness, with the most significant byte coming last.

Let's take for example the number 512. In little-endian notation, it could be represented with the binary string 00000010 00000000, with the 7th bit of the first byte holding the value. That's because the bits in the first byte can hold bigger numbers than the bits of the second byte, just the first bits of a byte carry more value than the last ones.

But in big-endian notation, the bytes would be inverted, so you'd have 00000000 00000010, with the last byte carrying the bit that holds the value. As mentioned, big-endian notation is not very common, so most of the time you can just think in little-endian.

What if we wanted to represent negative numbers? We could just dedicate the first bit of our binary sequence to indicate whether the number is negative. This is the distinction between "unsigned" and "signed" integers. When we talk about "unsigned integers", we're talking about a numeric encoding that can only represent positive numbers, as all of the bits are dedicated to hold the value of the number. Signed integers, on the other hand, allow us to also store negative numbers, by dedicating one bit to record whether the value is positive or negative. The tradeoff is that signed integers have one bit less to hold the value, so they can store smaller number; for example, the biggest number you can store in an unsigned byte is 255 (as mentioned before), so the range for an unsigned byte is 0 to 255; that's 256 possible states. By splitting those 256 states between positive and negative, a signed byte can store the numbers from -128 to 127.

Modern programing languages have many integer types you can use, for example in Rust you have i8, i16, i32, i64, i128 (for storing signed integers using 8 bits, 16 bits, 32 bits etc), and their unsigned counterpart u8, u16, u32, u64, u128.

Hexadecimal form

When we start to work with encodings that require a lot of bytes, it becomes tedious to visualize them as long strings of 0 and 1. For example, this is an u32 encoding for the number 3224443499:

11000000 00110001 00011010 01101011

Kinda hard for humans to visualize. That's why programmers often write binary in hexadecimal format. We can group for bits into a single symbol, like this:

0000 -> 0
0001 -> 1
0010 -> 2
0011 -> 3
0100 -> 4
0101 -> 5
0110 -> 6
0111 -> 7
1000 -> 8
1001 -> 9
1010 -> a
1011 -> b
1100 -> c
1101 -> d
1110 -> e
1111 -> f

We are using 16 symbols (the numbers from 0 to 9 and the letters from a to f) to create a hexadecimal (from the greek work "hexadeca", which means sixteen) encoding.

So if we take the previous 32 bit string

11000000 00110001 00011010 01101011

and we break it into eight 4-bits segments

and we assign to each segment the corresponding hexadecimal symbol

1100 -> c
0000 -> 0
0011 -> 3
0001 -> 1
0001 -> 1
1010 -> a
0110 -> 6
1011 -> b

and the u32 encoding for the number 3224443499 can now be represented with the hexadecimal string c031 1a6b. This makes it much easier to compare with other numbers, for example, if we want to compare said number with the number 3214443499, we would have this in binary form:

11000000 00110001 00011010 01101011
10111111 10011000 10000011 11101011

And this in hexadecimal form:

c031 1a6b
bf98 83eb

Hexadecimal is particularly useful when you need to take a quick look to huge binary sequences, like the ones used for files. There are a lot of programs that allow you to "hex dump" some binary data, aka print it in a hexadecimal form.

Real numbers

So far we have seen how to encode positive and negative integers. But what about real numbers?

One simple way to encode them could be to dedicate the first 4 bits of a byte to store the integer part, and the last 4 bits to store the fractional part. We could assign to the bits a value like this:


*8	*4	*2	*1	*1/2	*1/4	*1/8	*1/16

So that if we wanted to represent the value 5.75, we would encode it as

0	1	0	1	1	1	0	0
*8	*4	*2	*1	*1/2	*1/4	*1/8	*1/16

(1 * 4) + (1 * 1) + (1 * 1/2) + (1 * 1/4) =
(4)     + (1)    +  (0.5)     + (0.25) = 
5.75

01011100 = 5.75

The decimal point would therefore be situated after the first 4 bits, you could think of 01011100 like it was 0101.1100

This is called a fixed point representation. One problem with such a simple approach is that there's infinite real numbers between any two adjacent integers, and not only 8 bits would only be able to represent a small range between 0.0625 (with 00000001) and 15.9375 (with 11111111), but it would also skip a lot of very common ones, like for example 0.3, which cannot be represented using this method. Take a look, we can represent 0.25 with00000100

0	0	0	0	0	1	0	0
*8	*4	*2	*1	*1/2	*1/4	*1/8	*1/16

But if we flip the last bit to get to a slightly bigger number, we get

0	0	0	0	0	1	0	1
*8	*4	*2	*1	*1/2	*1/4	*1/8	*1/16

00000101, which has the value of 0.3125. Any real number between 0.25 and 0.3125, between 00000100 and 00000101, cannot be represented using this 8-bit fixed point encoding. We would need to use a lot more bits to get to something useful. Also, what about negative real numbers?

For this and other reasons, the technique that is most commonly used for encoding real numbers on modern computers is a different one. We use floating point numbers.

Are you familiar with scientific notation for numbers? It's a way to express very big or very small numbers in a concise way, as simpler number multiplied by a power of 10. For example, the number 12 100 000 000 could be written as

1.21 × 10¹⁰ (or simply 1.21e10 to avoid having to type special characters)

and the number 0.000 000 142 03 could be written as

1.4203 × 10⁻⁷ (or 1.4203e-7)

The first number in these expressions is called mantissa and the second one is called exponent.

	mantissa	exponent
1.21 × 10¹⁰	1.21	10¹⁰
1.4203 × 10⁻⁷	1.4203	10⁻⁷

Floating point numbers work in a similar way.

Single-precision floating point numbers (sometimes called simply floats, or f32) are stored in 32 bits of memory and dedicate 1 bit to the sign, 8 bits to the exponent, and 23 bits to the mantissa.

Double-precision floating point numbers (also called double or f64) are stored in 32 bits of memory and dedicate 1 bit to the sign, 11 bits to the exponent, and 52 bits to the mantissa.

These formats allow us to encode a giant range of numbers, but it's important to remember that we're still trying to simulate something infinite (the range of real numbers, of which there's an infinite amount between each adjacent pair of integers) using a finite resource (a measly 4 byte or 8 byte binary string). That's why floating point numbers can behave in counter-intuitive ways.

For example, if you're using 32 bit floats and you try to calculate 251000.0 + 0.1, you'd expect the result to be 251000.1, but you'll actually get 251000.093750, again because of the infinite gaps between integers being impossible to cover exhaustively; the encoding will just give you the closest approximation possible.

Text

The first major text encoding standard was ASCII (American Standard Code for Information Interchange). It's a format in which text characters are stored using 7 bits, allowing for 128 (2^7) possible characters. For example, the capital letter M can be represented as 01001101; the at symbol @ is 01000000, and so on. The first 32 of the 128 characters were special "control characters", like "End of Transmission", and the remaining 96 were printable characters like latin letters, numbers, and a few common symbols like # or +.

ASCII was developed during the 1960s, and almost everything about both hardware and software has changed since then. The text that is used in modern software includes characters from non-latin alphabets (like Ж) and emojis (like 😀). ASCII has proven to be extremely limited in its range; for this reason, a more complex standard was developed: Unicode.

Unicode is more like a meta-encoding; it defines an abstract model of text, which can be implemented in different ways. The abstract model is primarily a set of graphemes and code points. A grapheme is what we see as a single character on the screen, to which it can correspond one or more code points. For example, the grapheme à is formed by the combination of a code point for the letter "a" and a code point for the accent. Unicode has potential space for 1,114,112 code points, although the vast majority of them have not been assigned yet (the committee wanted to be sure to never run out of space this time).

This abstract model can be implemented in different encodings. The most common one is UTF-8, which is the one most web pages use.

UTF-8 is optimized for saving space. It implements Unicode code points as binary strings of varying size, from 1 to 4 bytes. For example, the 128 characters of ASCII are stored exactly the same as in ASCII; and the first bit (the one ASCII doesn't use, remember it only used 7 bits and a byte has 8) is used to signal that the code point is not among the 128 ASCII ones. This means that if a UTF-8 file only uses ASCII-compatible characters, it will be able to be processed even by old software that predates the encoding.

The first bits of a UTF-8 code unit signals how many bytes it will use to represent that code point:

if it starts with 0 -> 1 byte, ASCII character
if it starts with 110 -> 2 bytes, covers most languages
if 1110 -> 3 bytes, covers East Asian languages like Chinese 
if 11110 -> 4 bytes, covers emojis and other stuff

UTF-8 is compact, retro-compatible, and it covers our modern necessities. It has its own problems though, deriving from having such a flexible length for its code points.

Since in ASCII each character had the same size, it was pretty intuitive in the way the characters map to the binary represenation. "hello" is 5 letters long, and in ASCII it requires 5 bytes to be represented. If I tell my code to give me its second character, it will just need to take the position in memory of the beginning character of the string, move to the next byte, and it will find "e".

This doesn't work with modern UTF-8 strings. If you have a string "hello 😀" and you try to access its bytes in sequence in modern C, you'll get:

h
e
l
l
o

�

Why � ? You reached the first of the 4 bytes that form the emoji 😀, and that have no grapheme to represent them by themselves. In fact, if you continue printing the string byte by byte, you'll get

h
e
l
l
o

�
�
�
�

Those �� correspond to the 4 bytes that combine to form 😀, but if you try to print them individually they have no way to be printed correctly. You're trying to print 1/4th of the emoji, that's why you get `�.

This can create confusion and more importantly, makes it harder to optimize the performance of text manipulation in programs, since you cannot just skip across strings as you please, you need to process them in the order they were written in.

An alternative encoding that tries to solve this issue is UTF-32, which stores every code point as a 32 bit integer. This takes up a lot of space, but allows programs to skip to the n character of a string like they did with ASCII, since each code point is always represented with a 32 bit integer...except that, as mentioned previously, a visible Unicode grapheme is not necessarily represented by a single code point. For example, the family emoji👨‍👩‍👦 is a grapheme composed of 5 code points, and if you try to access them one by one you'll still get some weird surprises.

There are other ways to represent Unicode in memory, and many ways to access and manipulate its components. Different programming languages have opted for different solutions, but UTF-8 is by far the most common.