A monoalphabetic cipher is really just a cereal box secret decoder ring, or the daily jumble in the newspaper. It's an encryption technique whereby every instance of a particular letter is replaced with some other letter throughout the message. The mapping from plaintext letter to ciphertext letter is the key for the cipher - anyone who has that key can encrypt and decrypt messages and securely exchange them.
There's only one problem - this encryption technique is very easy to break, especially with modern computers (or any computer at all for that matter). The primary issue is that every language has a relatively fixed letter frequency, meaning certain letters are much more likely to occur than others in any given document. For example, in english 'e' is the most common letter (followed by 't', 'a', etc...). An attacker can exploit this fact when trying to break a monoalphabetic cipher. They can find the most common letters in the encrypted message, and match those up with the expected letter frequency in order to deduce the key.
It's also possible to go a step further, instead of just analyzing letter frequencies, an attacker can analyze "digram" (pairs of letters) and "trigram" (sets of three letters) frequencies. Just like single letters, certain groupings of letters, like 'th', will be much more likely to occur. This is probably the approach you use for solving the jumble in the newspaper without even knowing it - you look for certain patterns: words like "the", common letters like "e" or words starting with "th", and use those clues to find the entire key.
Last Fall, one of my ECE5560 homework assignments was to write a program for analyzing letter frequency in an encrypted message. I had just heard about a fledgling language called julia, and I had been looking for something to use it on - this project seemed like a perfect candidate. I also decided to take it a step beyond just measuring letter frequency, and write a program to completely decrypt the message. At first I thought this was something I could just tack on quickly to impress the professor, but it ended up being a bit more involved.
Analyzing Letter Frequency¶
First things first: we need to read in some data. I'll first read in the encrypted message, saved in cipher.txt
, then I will read in a plaintext version of A Tale of Two Cities from Project Gutenberg. The latter will represent a typical english document, so that I can measure the letter frequency of plain english.
function read_file(filename)
open(filename, "r") do fh
return readall(fh);
end
end
ciphertext = read_file("cipher.txt");
println("Ciphertext:");
println("-----------");
print(ciphertext);
println("");
println("Plaintext (A Tale of Two Cities):");
println("---------------------------------");
plaintext = read_file("tale_of_two_cities.txt");
print(plaintext[1:1000]);
Next we need to actually analyze the letter frequency of these two documents. I wrote the following get_frequency
function, which iterates through a document and counts letters, digrams and trigrams. It outputs three normalized matrices, a 26 element array of letter frequencies, a 26x26 matrix of digram frequencies and a 26x26x26 matrix of trigram frequencies.
function get_frequency(str)
alpha_freq = zeros(Int, 26)
digram_freq = zeros(Int, (26, 26))
trigram_freq = zeros(Int, (26, 26, 26))
alpha_count = 0;
digram_count = 0;
trigram_count = 0;
char_queue = Array(Int, 0);
for c in str
c = lowercase(c) - 'a' + 1;
if 1 <= c <= 26
push!(char_queue, c)
else
empty!(char_queue)
continue
end
if length(char_queue) >= 1
alpha_freq[char_queue[end]] += 1;
alpha_count += 1;
end
if length(char_queue) >= 2
digram_freq[char_queue[end-1], char_queue[end]] += 1;
digram_count += 1;
end
if length(char_queue) >= 3
trigram_freq[char_queue[end-2], char_queue[end-1], char_queue[end]] += 1;
shift!(char_queue)
trigram_count += 1;
end
end
alpha_freq = alpha_freq./alpha_count;
digram_freq = digram_freq./digram_count;
trigram_freq = trigram_freq./trigram_count;
return (alpha_freq, digram_freq, trigram_freq)
end
plain_freq = get_frequency(plaintext);
cipher_freq = get_frequency(ciphertext);
The first thing I want to do is plot the results, to get a handle on what I'm dealing with. I'll use the excellent Gadfly plotting library to plot just the letter frequencies for each document
using Gadfly;
function plot_frequency(frequency, title)
letters = ['a':'z'];
plot(x=1:26, y=frequency, Geom.bar, Scale.x_discrete(labels = i->letters[i]), Guide.xlabel("Letter"), Guide.ylabel("Frequency"), Guide.title(title))
end
plot_frequency(plain_freq[1], "Plaintext Letter Frequencies")
plot_frequency(cipher_freq[1], "Ciphertext Letter Frequencies")
The plaintext freqeuncies look pretty good, matching what can found on Wikipedia closely. Letters like 'e', 't', 'a', and 'o' are in fact the most common letters in A Tale of Two Cities. The ciphertext letter frequencies look substantially different, but it is clear that 'x', 't' and 'j' are the most common letters, so these probably correspond to 'e', 't', and 'a'. The best way to find out is to write a function for extracting a key from this mess.
Generating a Key¶
I grappled with the problem of how to distill letter, digram, and trigram frequencies all down to a key for quite a while. I got in over my head trying to use Bayesian statistics and conditional probabilities for a good while, but eventually I settled on the following simple approach.
The first step is to build what I call the "monoalphabet matrix" - a 26x26 matrix where one dimension represents ciphertext letters and the other represents plaintext letters. We generate this matrix from the letter, digram, and trigram information, but I will start by explaining how the letters are used. We sort both letter frequency arrays so that the most common letters are first, then we iterate through both arrays simultaneously. For each pair of ciphertext, plaintext letters we increment the corresponding element in the monoalphabet matrix. So in this case we will start by incrementing the elements at ('x', 'e'), then ('t', 't'), ('j', 'a'), etc...
We follow a similar procedure for digrams and trigrams, but in those cases all two or three letters pairs are incremented.
function build_monoalphabet_matrix(plain_freq, cipher_freq)
rtrn = zeros(Float64, 26, 26);
increment_element(idx, p, c) = rtrn[p,c] += 1.0/idx;
flatten(m) = reshape(m, length(m));
map(increment_element,
1:length(plain_freq[1]),
sortperm(plain_freq[1], rev=true),
sortperm(cipher_freq[1], rev=true));
increment_element_digram(idx, p, c) = map((x, y)->increment_element(idx, x, y),
ind2sub(size(plain_freq[2]), p),
ind2sub(size(cipher_freq[2]), c));
map(increment_element_digram,
1:length(plain_freq[2]),
sortperm(flatten(plain_freq[2]), rev=true),
sortperm(flatten(cipher_freq[2]), rev=true));
increment_element_trigram(idx, p, c) = map((x, y)->increment_element(idx, x, y),
ind2sub(size(plain_freq[3]), p),
ind2sub(size(cipher_freq[3]), c));
map(increment_element_trigram,
1:length(plain_freq[3]),
sortperm(flatten(plain_freq[3]), rev=true),
sortperm(flatten(cipher_freq[3]), rev=true));
return rtrn
end
ma_matrix = build_monoalphabet_matrix(plain_freq, cipher_freq);
letters = ['a':'z'];
spy(ma_matrix, Scale.x_discrete(labels = i->letters[i]), Scale.y_discrete(labels = i->letters[i]), Guide.xlabel("Ciphertext Letter"), Guide.ylabel("Plaintext Letter"), Guide.title("Monoalphabet Matrix"))
Again, we use Gadfly to look at this data. We can see a couple of clear peaks in this data. For example, a plaintext 'h' is definitely a ciphertext 'p', and 'e' does appear to correspond to 'x'. Finally we can search this matrix for peaks in order to identify the key.
function key_from_matrix(mat)
rtrn = Dict();
for idx in 1:26
# find maximum
cipher_idx, plain_idx = ind2sub(size(mat), indmax(mat));
# Remember this key
rtrn['a' + plain_idx - 1] = 'a' + cipher_idx - 1;
# Zero out this row and column so we dont find it again
mat[cipher_idx, :] = zeros(1, 26);
mat[:, plain_idx] = zeros(26, 1);
end
return rtrn
end
ma_key = key_from_matrix(ma_matrix)
Initial Decrpytion¶
Great we have a key - lets decrypt this puppy! For every charachter in the ciphertext, use the key to find the associated plaintext letter. Then join the resulting array together into a string.
decrypt = join([get!(ma_key, c, c) for c in ciphertext]);
print(decrypt)
Hmmm well that doesn't look that great. It turns out that letter frequency analysis can only get you so far, particularly for a short message. Only the highest frequency letters like 'e', 't', and 'a' will be properly decrypted. We could probably just stop here, and use this key as a starting point to deduce the entire true key manually.
But take a second to think about how you would approach this problem when you did it manually. Personally, I would probably search through the initial decrypt until I found a word that looks familiar - for example 'ezwellent' should clearly be 'excellent' - and correct the key so that it decrypts properly. Then I would repeat this process until there are no more mistaken words.
Well why bother doing this manually? We can automate this process using a computer and a dictionary file.
Correcting the Key¶
First let's load in the dictionary file...
dict_words = filter(x->!contains(x, "\'"), [chomp(lowercase(word)) for word in readlines(open("words"))]);
Then we will define a function, guess_word
, which takes a garbled word, and searches for the closest match in the dictionary.
function distance(str1, str2)
if length(str1) != length(str2)
return Inf
end
return sum(map((x,y)->x != y, str1, str2));
end
function guess_word(garbled)
dist, ind = findmin([distance(word, garbled) for word in dict_words]);
return (dict_words[ind], dist);
end
I'll be honest, these are the functions that made me fall in love with julia. With two <5 line functions I'm doing a fuzzy dictionary search. I'm sure anyone can show me similar examples in other langauges, but what I really liked about julia is that it felt like the language was encouraging me to write the functions this way. Since iterating through strings is very painful, the best option is to use functional techniques to compute distance
, and julia's great standard library makes it easy to use a functional approach for guess_word
as well.
Alright enough gushing - does guess_word
work? Let's try it out:
println(guess_word("ezwellent")[1]) # excellent
println(guess_word("awejime")[1]) # awesome
Great! but can we break it? Probably.
println(guess_word("ezwettenf")[1]) # excellent
Obviously if the word is too garbled, we can't expect it to work. 'ezwettenf' would be hard me to correct manually...
Also, it's importatnt to note that this function is designed specifically for the monoalphabetic cipher case, where we always have the same number of characters, its just that some of the characters may be incorrect. We would need something more intelligent to handle character insertions or deletions.
Now we can start to use guess_word
to correct the decryption. We'll start by defining some helper functions for swapping two characters in the key. When we correct 'ezwellent', for example, we will use this function to swap 'z' and 'x' in the key, and then 'w' and 'c'.
find_in_dict(d, element) = collect(keys(d))[findfirst(collect(values(d)), element)]
fix_key(good_char, bad_char) = begin
if good_char == bad_char
return;
end
good_key = find_in_dict(ma_key, good_char);
bad_key = find_in_dict(ma_key, bad_char);
good_key = find_in_dict(ma_key, good_char);
bad_key = find_in_dict(ma_key, bad_char);
ma_key[good_key] = bad_char;
ma_key[bad_key] = good_char;
end
Now we split the decrypt up into words and iterate through each word, correcting the key as we go. We start by sorting by length, since longer words are easier to correct with this approach. Once 30 words in a row match a dictionary word perfectly, we consider the key complete.
decrypt_words = [rstrip(word, [',', '.', ';']) for word in split(decrypt)];
length_order = sortperm([length(word) for word in decrypt_words], rev=true);
match_counter = 0
for word_idx in length_order
if contains(decrypt_words[word_idx], "\'")
continue
end
@printf("Correcting \'%s\'... ", decrypt_words[word_idx]);
# guess a word
corrected_word, dist = guess_word(decrypt_words[word_idx]);
println(corrected_word);
# use differences to correct ma_key
map(fix_key, corrected_word, decrypt_words[word_idx]);
# re-decrypt
decrypt = join([get!(ma_key, c, c) for c in ciphertext]);
decrypt_words = [rstrip(word, [',', '.', ';']) for word in split(decrypt)];
if dist == 0
match_counter += 1;
if match_counter > 30
break;
end
else
match_counter = 0;
end
end
Uh oh this doesn't look good. By the end we're correcting words like 'gwmyiwgc' and 'brieimc', it looks like we've actually made things worse! When you look through the list of corrections from the beginning, everything seems ok until we get to 'belteshaxxar', which gets corrected to 'netherlander'. Clearly we've encountered a proper noun that is not actually in the dictionary! We'll need to add logic to handle this edge case.
Here is a revised version of the search. Now, if we have to correct more than one third of the word's characters, we won't consider it a valid correction, and we'll skip it.
decrypt_words = [rstrip(word, [',', '.', ';']) for word in split(decrypt)];
length_order = sortperm([length(word) for word in decrypt_words], rev=true);
match_counter = 0
for word_idx in length_order
if contains(decrypt_words[word_idx], "\'")
continue
end
@printf("Correcting \'%s\'... ", decrypt_words[word_idx]);
# guess a word
corrected_word, dist = guess_word(decrypt_words[word_idx]);
println(corrected_word);
if dist > length(corrected_word)/3
println("Corrected word is not close enough! Skipping")
continue
end
# use differences to correct ma_key
map(fix_key, corrected_word, decrypt_words[word_idx]);
# re-decrypt
decrypt = join([get!(ma_key, c, c) for c in ciphertext]);
decrypt_words = [rstrip(word, [',', '.', ';']) for word in split(decrypt)];
if dist == 0
match_counter += 1;
if match_counter > 30
break;
end
else
match_counter = 0;
end
end
print(decrypt);
A successful decryption! You can see that this time we ignored words like 'chaldeans' and 'forasmuch', which were correct, but just not available in our dictionary.