base64 in, text out

Decoding base64 into readable text



A lot of systems don’t transmit plain binary data over networks. There are many systems that would misinterpret incoming binary data as control commands … or something. Others are designed to always expect plain text. Moreover, some methods of transmission are not utf-8/ASCII compliant and so would lose data on send. Idk a lot about this. But the point is that sending raw binary over the wire is not common practice.

So before transmitting dataj, it is encoded into a (smaller) set of characters. A commonly used set of is base64. There are versions of this set but here’s one,

  • A-Z have values 0-25
  • a-z have values 26-51
  • 0-9 have values 52-61
  • the + character has value 62
  • the / character has value 63

For a total of 64 unique character-value pairs.

Note that this paring is different than the ASCII character set. In the ACII character table

  • A-Z have values 65-90
  • a-z have values 97-122
  • 0-9 have values 48-57
  • the + character has value 43
  • the / character has value 47

There are more characters in the ASCII table. 256 total.



So, some message is encoded into base64, sent to us, and it arrives looking like this, YmVydC1sZWFybnMtYmFzZTY0. If we know the spec of base64 encoding used, how do we decode it? Here are the steps,

  • Remove trailing = signs. = signs are used to pad the string when encoding to base64 in order to meet bit requirements. Notice that = is not in the base64 character set. In this example no padding was added so we remove nothing.
  • Convert each character in the string to it’s base64 integer value. So Y is 24.
  • Convert each of these integers into its binary value. So 24 is 11000
    • Note that this is a 5-digit binary sequence. Base64 requires 6 digits (2^6 = 64).
    • In the case of decimal value 24, we simply didn’t need the 6th digit. But we’ll add the leading 0s to get the 6-bit word, 011000

Now we have 6-bit words representing integer values 0-63, which in turn represent characters in the base64 spec.



The ASCII spec has 256 possible characters though…. So how does encoding ensure that each ASCII character can map to a base64 character without losing data? Note that in order to represent 256 unique characters we’d need 8-bit words (2^8=256). So, when encoding, a sequence of 8-bit words is parsed into a “longer” sequence of 6-bit words. The range of values that each 6-bit word can represent is smaller, but there are more words.

When decoding, we do the oppsoite. We pad the 6-bit word binary sequence with trailing 0s until the entire length of the string is divisible by 8.

Then,

  • Convert the 6-bit words into 8-bit words.
    • If the first 3 6-bit words (representing base64 characters Y, m, and V from our input string) are 011000 100110 010101, then the 8-bit conversion gives us 01100010 01100101 01......
    • We just borrow bits from the next word to complete each 8-bits word.
    • The trailing 0s we added will come into play when we form the last 8-bit word.
  • Convert each of these 8-bit binary words into their integer value. So 01100010 gives us 98
  • And finally, convert each of these integer values to an ASCII character. For 98, we get b

When we decode the entire string YmVydC1sZWFybnMtYmFzZTY0 we get bert-learns-base64.



Here is the python code to do the above. It’s not well written but it works for the cases tested.

import re


def strip_padding(in_str: str) -> str:
    return in_str.split("=")[0]

def convert_base64_char_to_index_int(in_str: str) -> list:
    digits = [str(digit) for digit in range(10)]
    indexes = []
    capitals = "[A-Z]"
    
    for char in in_str:
        if char == "+":
            indexes.append(62)
        elif char == "/":
            indexes.append(63)
        elif char in digits:
            # this method of finding the int value of a base64 char is kinda interesting
            indexes.append(ord(char) - ord('0') + 52)
        elif re.search(capitals, char):
            indexes.append(ord(char) - ord('A'))
        else:
            indexes.append(ord(char) - ord('a') + 26)
    
    return indexes

def convert_int_list_to_binary_str(int_list: list) -> str:
    # remove prefix 0b
    # pad to 6 bits
    binaries = []
    for num in int_list:
        binaries.append(str(bin(num)[2:]).rjust(6, "0"))
    binary_string = "".join(binaries)

    return binary_string

def convert_binary_str_to_8_bit_binary(binary_str: str) -> list:
    bit_words = []
    true_length = len(binary_str) - (len(binary_str) % 8)
    slow = 0
    fast = 8

    while slow < true_length:
        bit_words.append(binary_str[slow:fast])
        slow += 8
        fast += 8

    return bit_words

def convert_8_bits_to_int(bit_words: list) -> list:
    return [int(word, 2) for word in bit_words]

def convert_ints_to_ascii(int_list: list) -> str:
    plain_list = []
    for num in int_list:
        plain_list.append(chr(num))
    
    return "".join(plain_list)

def main():
    # user input keeps going until empty string is passed
    while True:
        try:
            user_input = input()
            
            if user_input == "":
                return
            
            stripped = strip_padding(user_input)
            # print(stripped)
            stuff = convert_base64_char_to_index_int(stripped)
            # print(stuff)
            binary_string = convert_int_list_to_binary_str(stuff)
            # print(binary_string)
            binaries = convert_binary_str_to_8_bit_binary(binary_string)
            # print(binaries)
            ints = convert_8_bits_to_int(binaries)
            # print(ints)
            plain = convert_ints_to_ascii(ints)
            print(plain)
        except EOFError:
            return

if __name__ == "__main__":
    main()

Here are some other strings to decode

  • dHJ5aW5nLWl0LW15c2VsZg== -> trying-it-myself
  • bGVhcm4taW4tcHVibGlj -> learn-in-public
  • cnVubmluZy1weXRob24tMy44Ljk= -> running-python-3.8.9