base64 in, text out
written on June 6th, 2023 by alexDecoding base64 into readable text
A lot of systems don’t transmit plain binary data over networks. There are many systems that would misinterpret incoming binary data as control commands … or something. Others are designed to always expect plain text. Moreover, some methods of transmission are not utf-8/ASCII compliant and so would lose data on send. Idk a lot about this. But the point is that sending raw binary over the wire is not common practice.
So before transmitting dataj, it is encoded into a (smaller) set of characters. A commonly used set of is base64. There are versions of this set but here’s one,
A-Z
have values0-25
a-z
have values26-51
0-9
have values52-61
- the
+
character has value62
- the
/
character has value63
For a total of 64 unique character-value pairs.
Note that this paring is different than the ASCII character set. In the ACII character table
A-Z
have values65-90
a-z
have values97-122
0-9
have values48-57
- the
+
character has value43
- the
/
character has value47
There are more characters in the ASCII table. 256 total.
So, some message is encoded into base64, sent to us, and it arrives looking like this, YmVydC1sZWFybnMtYmFzZTY0
. If we know the spec of base64 encoding used, how do we decode it? Here are the steps,
- Remove trailing
=
signs.=
signs are used to pad the string when encoding to base64 in order to meet bit requirements. Notice that=
is not in the base64 character set. In this example no padding was added so we remove nothing. - Convert each character in the string to it’s base64 integer value. So
Y
is24
. - Convert each of these integers into its binary value. So
24
is11000
- Note that this is a 5-digit binary sequence. Base64 requires 6 digits (
2^6 = 64
). - In the case of decimal value
24
, we simply didn’t need the 6th digit. But we’ll add the leading0
s to get the 6-bit word,011000
- Note that this is a 5-digit binary sequence. Base64 requires 6 digits (
Now we have 6-bit words representing integer values 0-63
, which in turn represent characters in the base64 spec.
The ASCII spec has 256 possible characters though…. So how does encoding ensure that each ASCII character can map to a base64 character without losing data? Note that in order to represent 256 unique characters we’d need 8-bit words (2^8=256
). So, when encoding, a sequence of 8-bit words is parsed into a “longer” sequence of 6-bit words. The range of values that each 6-bit word can represent is smaller, but there are more words.
When decoding, we do the oppsoite. We pad the 6-bit word binary sequence with trailing 0
s until the entire length of the string is divisible by 8
.
Then,
- Convert the 6-bit words into 8-bit words.
- If the first
3
6-bit words (representing base64 charactersY
,m
, andV
from our input string) are011000 100110 010101
, then the 8-bit conversion gives us01100010 01100101 01.....
. - We just borrow bits from the next word to complete each 8-bits word.
- The trailing
0
s we added will come into play when we form the last 8-bit word.
- If the first
- Convert each of these 8-bit binary words into their integer value. So
01100010
gives us98
- And finally, convert each of these integer values to an ASCII character. For
98
, we getb
When we decode the entire string YmVydC1sZWFybnMtYmFzZTY0
we get bert-learns-base64
.
Here is the python code to do the above. It’s not well written but it works for the cases tested.
import re
def strip_padding(in_str: str) -> str:
return in_str.split("=")[0]
def convert_base64_char_to_index_int(in_str: str) -> list:
digits = [str(digit) for digit in range(10)]
indexes = []
capitals = "[A-Z]"
for char in in_str:
if char == "+":
indexes.append(62)
elif char == "/":
indexes.append(63)
elif char in digits:
# this method of finding the int value of a base64 char is kinda interesting
indexes.append(ord(char) - ord('0') + 52)
elif re.search(capitals, char):
indexes.append(ord(char) - ord('A'))
else:
indexes.append(ord(char) - ord('a') + 26)
return indexes
def convert_int_list_to_binary_str(int_list: list) -> str:
# remove prefix 0b
# pad to 6 bits
binaries = []
for num in int_list:
binaries.append(str(bin(num)[2:]).rjust(6, "0"))
binary_string = "".join(binaries)
return binary_string
def convert_binary_str_to_8_bit_binary(binary_str: str) -> list:
bit_words = []
true_length = len(binary_str) - (len(binary_str) % 8)
slow = 0
fast = 8
while slow < true_length:
bit_words.append(binary_str[slow:fast])
slow += 8
fast += 8
return bit_words
def convert_8_bits_to_int(bit_words: list) -> list:
return [int(word, 2) for word in bit_words]
def convert_ints_to_ascii(int_list: list) -> str:
plain_list = []
for num in int_list:
plain_list.append(chr(num))
return "".join(plain_list)
def main():
# user input keeps going until empty string is passed
while True:
try:
user_input = input()
if user_input == "":
return
stripped = strip_padding(user_input)
# print(stripped)
stuff = convert_base64_char_to_index_int(stripped)
# print(stuff)
binary_string = convert_int_list_to_binary_str(stuff)
# print(binary_string)
binaries = convert_binary_str_to_8_bit_binary(binary_string)
# print(binaries)
ints = convert_8_bits_to_int(binaries)
# print(ints)
plain = convert_ints_to_ascii(ints)
print(plain)
except EOFError:
return
if __name__ == "__main__":
main()
Here are some other strings to decode
dHJ5aW5nLWl0LW15c2VsZg==
->trying-it-myself
bGVhcm4taW4tcHVibGlj
->learn-in-public
cnVubmluZy1weXRob24tMy44Ljk=
->running-python-3.8.9