base64 in, text out
written on June 6th, 2023 by alexDecoding base64 into readable text
A lot of systems don’t transmit plain binary data over networks. There are many systems that would misinterpret incoming binary data as control commands … or something. Others are designed to always expect plain text. Moreover, some methods of transmission are not utf-8/ASCII compliant and so would lose data on send. Idk a lot about this. But the point is that sending raw binary over the wire is not common practice.
So before transmitting dataj, it is encoded into a (smaller) set of characters. A commonly used set of is base64. There are versions of this set but here’s one,
A-Zhave values0-25a-zhave values26-510-9have values52-61- the
+character has value62 - the
/character has value63
For a total of 64 unique character-value pairs.
Note that this paring is different than the ASCII character set. In the ACII character table
A-Zhave values65-90a-zhave values97-1220-9have values48-57- the
+character has value43 - the
/character has value47
There are more characters in the ASCII table. 256 total.
So, some message is encoded into base64, sent to us, and it arrives looking like this, YmVydC1sZWFybnMtYmFzZTY0. If we know the spec of base64 encoding used, how do we decode it? Here are the steps,
- Remove trailing
=signs.=signs are used to pad the string when encoding to base64 in order to meet bit requirements. Notice that=is not in the base64 character set. In this example no padding was added so we remove nothing. - Convert each character in the string to it’s base64 integer value. So
Yis24. - Convert each of these integers into its binary value. So
24is11000- Note that this is a 5-digit binary sequence. Base64 requires 6 digits (
2^6 = 64). - In the case of decimal value
24, we simply didn’t need the 6th digit. But we’ll add the leading0s to get the 6-bit word,011000
- Note that this is a 5-digit binary sequence. Base64 requires 6 digits (
Now we have 6-bit words representing integer values 0-63, which in turn represent characters in the base64 spec.
The ASCII spec has 256 possible characters though…. So how does encoding ensure that each ASCII character can map to a base64 character without losing data? Note that in order to represent 256 unique characters we’d need 8-bit words (2^8=256). So, when encoding, a sequence of 8-bit words is parsed into a “longer” sequence of 6-bit words. The range of values that each 6-bit word can represent is smaller, but there are more words.
When decoding, we do the oppsoite. We pad the 6-bit word binary sequence with trailing 0s until the entire length of the string is divisible by 8.
Then,
- Convert the 6-bit words into 8-bit words.
- If the first
36-bit words (representing base64 charactersY,m, andVfrom our input string) are011000 100110 010101, then the 8-bit conversion gives us01100010 01100101 01...... - We just borrow bits from the next word to complete each 8-bits word.
- The trailing
0s we added will come into play when we form the last 8-bit word.
- If the first
- Convert each of these 8-bit binary words into their integer value. So
01100010gives us98 - And finally, convert each of these integer values to an ASCII character. For
98, we getb
When we decode the entire string YmVydC1sZWFybnMtYmFzZTY0 we get bert-learns-base64.
Here is the python code to do the above. It’s not well written but it works for the cases tested.
import re
def strip_padding(in_str: str) -> str:
return in_str.split("=")[0]
def convert_base64_char_to_index_int(in_str: str) -> list:
digits = [str(digit) for digit in range(10)]
indexes = []
capitals = "[A-Z]"
for char in in_str:
if char == "+":
indexes.append(62)
elif char == "/":
indexes.append(63)
elif char in digits:
# this method of finding the int value of a base64 char is kinda interesting
indexes.append(ord(char) - ord('0') + 52)
elif re.search(capitals, char):
indexes.append(ord(char) - ord('A'))
else:
indexes.append(ord(char) - ord('a') + 26)
return indexes
def convert_int_list_to_binary_str(int_list: list) -> str:
# remove prefix 0b
# pad to 6 bits
binaries = []
for num in int_list:
binaries.append(str(bin(num)[2:]).rjust(6, "0"))
binary_string = "".join(binaries)
return binary_string
def convert_binary_str_to_8_bit_binary(binary_str: str) -> list:
bit_words = []
true_length = len(binary_str) - (len(binary_str) % 8)
slow = 0
fast = 8
while slow < true_length:
bit_words.append(binary_str[slow:fast])
slow += 8
fast += 8
return bit_words
def convert_8_bits_to_int(bit_words: list) -> list:
return [int(word, 2) for word in bit_words]
def convert_ints_to_ascii(int_list: list) -> str:
plain_list = []
for num in int_list:
plain_list.append(chr(num))
return "".join(plain_list)
def main():
# user input keeps going until empty string is passed
while True:
try:
user_input = input()
if user_input == "":
return
stripped = strip_padding(user_input)
# print(stripped)
stuff = convert_base64_char_to_index_int(stripped)
# print(stuff)
binary_string = convert_int_list_to_binary_str(stuff)
# print(binary_string)
binaries = convert_binary_str_to_8_bit_binary(binary_string)
# print(binaries)
ints = convert_8_bits_to_int(binaries)
# print(ints)
plain = convert_ints_to_ascii(ints)
print(plain)
except EOFError:
return
if __name__ == "__main__":
main()
Here are some other strings to decode
dHJ5aW5nLWl0LW15c2VsZg==->trying-it-myselfbGVhcm4taW4tcHVibGlj->learn-in-publiccnVubmluZy1weXRob24tMy44Ljk=->running-python-3.8.9