Created: 2014-11-07 22:12:02
Last modified: 2014-11-11 18:00:44
There's an authorization code for some Thyrin Labs information here, along with someone's favorite song. But it's been encrypted! Find the authorization code.
You may want to look at what the relative frequencies of letters in english text are.
This is a simple substitution cipher. It is susceptible to frequency analysis. There are numerous tools online for this. Below, I explain how to write your own in python.
A simple substitution cipher maps every character plaintext to a different character for the cipher text. You can do a 'frequency analysis' on this kind of cipher. I happen to know that the most common letter in English is 'e'. The most common letter in the ciphertext is 'h'. Therefore 'h' probably maps to 'e'
We can actually do a lot more than this. I will write a script in python that get two lists: the first is a sorted list of n-grams, the secound is a sorted list of n-long words. An n-gram is a string of letters within a word. For example the most common bigram is 'th' then 'he', 'in', 'er', etc. The most common trigram is 'the', 'and', 'tha', 'ent', etc. By extension an n-gram just a string of letters within a word. If I find the most common n-grams in the output, I can compare these to the most common n-grams in english. The list of n-long words is a lot more straightforward. I can guess common one-letter words like 'I', and 'a', then common two-letter words like 'an' or 'it'. We also want to go through the list of words and finding the most common first letter
To find the n-grams, we can loop through i from zero to the length of the string minus n. Then we slice the string from i to i plus n. If it contains a space or newline, ignore it. Otherwise add it to a counter. The pyhton standard library provides a counter class which is very useful for doing these frequency analyses. To find n-long words, we can use regex to split the text into words and add each word the appropriate counter.
import re
from collections import Counter
from data import ctext
ctext_words = filter(bool, re.split('\s', ctext))
max_gram = 15 # up to 15-gram, up to 15-letter words
n_grams = []
n_words = []
for i in range(1, max_gram + 1):
# for i = 2, do bigrams and 2-letter words
# for i = 3, do trigrams and 3-letter words
a = Counter()
for j in range(len(ctext) - i):
gram = ctext[j:j + i]
if ' ' in gram or '\n' in gram:
continue # skip grams seperated by space
a[gram] += 1
n_grams.append(a)
b = Counter()
for word in ctext_words:
if len(word) != i:
continue
b[word] += 1
n_words.append(b)
starting_letters = Counter()
ending_letters = Counter()
for word in ctext_words:
starting_letters[word[0]] += 1
ending_letters[word[-1]] += 1
To use this script, just let ctext
be the encrypted text you have and run
it. Now we need to display the results in a meaningful way.
code = {} # fill this in along the way
def decrypt(word):
decoded = ''
for letter in word:
try:
decoded += code[letter]
except KeyError:
decoded += letter
return decoded
def table(counter, n=10):
total = sum(counter.values())
for word, freq in counter.most_common(n):
print '{0: <10.1f}{1}'.format(float(freq) / total * 100, decrypt(word))
Now just type table(starting_letters)
to get a table of the most common
starting letters and their frequency as a percentage, or table(n_words[3])
to
get the most common 3 letter long words.
These websites may be helpful
- http://www.cryptograms.org/letter-frequencies.php,
- http://www3.nd.edu/~busiforc/handouts/cryptography/cryptography%20hints.html,
- http://norvig.com/mayzner.html,
- http://www.letterfrequency.org/
From here, we make multiple passes. Here are the most useful results. I made educated guesses on what the letters should be.
Full text: first_pass.txt
Single letters
12.3 h -> E
9.8 c
9.0 r
8.4 k
We have inconclusive evidence on all of the other letters.
Letter bigrams
6.9 ri -> TH
5.3 ih -> HE
Letter trigrams
6.9 rih -> THE
All letter tetragrams
2.6 hshw -> EVER
2.0 bqri
1.7 khsh -> WERE
One-letter words
70.0 e
30.0 q
I know these map to either 'a' or 'I'.
Four-letter words
8.8 bqri -> WITH (consistent with q -> I)
Therefore, we fill in code
with the appropriate letters. I have decided to
make the deciphered text uppercase, so I can tell apart what is already
deciphered with what is not already deciphered.
code = {'h':'E', 'r':'T', 'i': 'H', 'b':'W', 'e':'A', 'q':'I'}
Full text: second_pass.txt
Two-letter words
20.5 co -> OF
11.4 Tc -> TO
9.1 WE
9.1 Ia
6.8 ac
6.8 Ik -> IN|IS|IF -> IN?
Three-letter words
31.5 THE
18.5 gcj
9.3 Auu
9.3 Akt -> ANt|ASt|AFt -> AND
6.5 fAk
5.6 AwE
2.8 cWk -> OWN|OWS|OWF -> OWN
You can see how the encrypted and decrypted letters mix together. The decrypted ones are the ones we solved last pass. The decrypted letters are capital and the encrypted letters are lowercase.
Apparently the text qk
showed up a lot. We
know the q
maps to I
from the first pass, so it shows up as Ik
(I
is decrypted,
k
is encrypted). This could be IN
, IS
, or IF
. This means k
decrypts
to N
or S
or F
. Trying that out on the three letter words, it becomes
obvious that since there is no word AS_
or AF_
, but there is a word AND
which is quite frequent, k
decrypts to D
.
Five-letter words
14.6 kEsEw -> NEVER
9.8 gcjuu
9.8 THIkd -> THINK
9.8 EAwTH -> EARTH
9.8 lAIkT
4.9 aTIuu
2.4 uEAwk
2.4 gcjsE
2.4 EsEwg -> EVER
Here I have begun guessing whole words. I have gotten other letters the same way (you can see the full text here). Now we have:
code = {'h':'E', 'r':'T', 'i': 'H', 'b':'W', 'e':'A', 'q':'I', 'k':'N', 'c':'O',
'd':'K', 's':'V', 'w':'R'}
Full text: third_pass.txt
Thirteen-letter words
100.0 AjTHORIxATION -> AuTHORIzATION
Eight-letter words
12.5 zRINNINz -> GRINNING
12.5 agfAnORE -> S_CA_ORE
12.5 IzNORANT -> IGNORANT
12.5 WHATEVER
12.5 vROTHERa -> BROTHERS
12.5 ajNaWEET -> SUNSWEET
12.5 fREATjRE -> CREATURE
12.5 aTRANzER -> STRANGER
Seven-letter words
20.0 aKINNEt -> SKINNED
20.0 vERRIEa -> BERRIES
20.0 WHETHER
20.0 oRIENta -> _RIEN_S
20.0 zRINNEt -> GRINNED?
Now the whole thing comes tumbling down. I got a lot more letters this way. You can see the full text here. Now our code is:
code = {'h':'E', 'r':'T', 'i': 'H', 'b':'W', 'e':'A', 'q':'I', 'k':'N', 'c':'O',
'd':'K', 's':'V', 'w':'R', 'j':'U', 'x':'Z', 'z':'G', 'v':'B', 'a':'S',
'f':'C', 'u':'L', 'l':'P', 't':'D', 'o':'F',
I started guessing single letters and bigrams. Then I guessed words. Now I am ready to guess sentences and paragraphs. You can see the fourth_pass.txt and the final decrypted message in fifth_pass.txt. For these, I am using:
print decrypt(ctext)
Now we got it! Lots of people just used an online tool. Those work, but for educational purposes, I want to tell you how to do it by yourself so you will know the online tools work. Here is the full script and all of my work:
- subst.py has the bulk of this guide
- data.py has a copy of the encrypted words
- first_pass.txt was individual letters and bigrams
- second_pass.txt was bigrams and words
- third_pass.txt was words
- fourth_pass.txt was sentences
- fourth_pass.txt was the full decrypted text
- http://www.cryptograms.org/letter-frequencies.php,
- http://www3.nd.edu/~busiforc/handouts/cryptography/cryptography%20hints.html,
- http://norvig.com/mayzner.html,
- http://www.letterfrequency.org/
WITHALLTHECOLORSOFTHEWIND