Vigenere Cipher System

Perhaps the "cutest" aspect of the Vigenere cipher is that it was invented by Giovan Batista Belaso in 1553 based on a table created by Johannes Trithemius in 1508. Vigenere created his own cipher later (which is similar but different, and considered stronger), in the 1580's, but was assigned the work cipher to his name about 300 years later. The attribution is so strong that calling it a Belaso cipher would probably just warrant blank stares, and besides...it's all just a name, anyhow.

What the name describes is a system of three parts. First you have a given alphabet which is generally of a "numerical" aspect (in that its letters have a given place inside the alphabet and can be thought of as both letters and markers for the value of that place (in other words, A = 1, B = 2, C = 3...etc)), you have an indefinitely long "plain text" composed of this alphabet, and you have a indefinitely long "key text" also composed of letters of this alphabet. The key text has historically been a word or phrase fairly easy to remember, though it can also be a length of "random" characters or some combination. The basic algorithm is to shift the characters of the plain text by a cycle of the characters given by the key text, in which the place of the key text letters is also the number of shifts to apply to the plain text. This solved as a modulus against the maximum number of places in the alphabet.

An example will make this all better. In the paragraph below, a portion of the plain text "I am going to say that I have a fever and can't make it." is processed in the standard english alphbet by the key text of EXCUSES (assuming I haven't made too many errors). Since "EXCUSES" isn't enough to cover the entirety of the word, it is cycled over the "top" of it. It is 7 characters long and so the 8th character of the plaintext (ignoring spaces) is back to the start of the key text (as is the 15th plain text letter, the 22nd, the 29th, and so on until the end of the plain text is reached). If the plain text length is not a multiple of the key text length, you will have some characters left over in the key text, and this is ok.

cipher text EXCUSESEXCU... plain text IAMGOINGTOS... cipher text MXOAGMFKQQM...

While their are tables and charts to help (one of which is available to view in the Wikipedia article on Tabula Recta, I find it makes much more sense to think of it in terms of modulur addition. In Roman Alphabets, letters can be thought of as having a value of 0..25 with A = 0 and Z = 25 (a slight shift from thinking of A = 1). If you think of the key text and the plain text in these terms, you can "add" them together. In the above example, E from excuses and I from "I am..." becomes 4 and 8, respectively, and adds up to 12, which is the equivalent of M. If two letters added up to something greater than 25, you can subtract 26 from the total (modulo 26, in other words) and get the new letter. The fourth letters of key = U(20) and plain = G(6) end up with a result of 26 which is sent back to 0 (26-26=0) and therefore is an A.

In the above example, we see both the strength and the weakness of the system at the same time. The keyword can be short enough, and plain enough, to be readily remembered. It would not be a stretch of the imagination to remember dozens of that length and use them at some pre-determined system. Also, the act of sliding the plain text by the key text does help to obscure the plain text more than a single shift cipher (i.e. Caesar Cipher) or more than a simple monoalphabetic substitution. The pair of Qs in the cipher text, for instance, might lead to the false assumption that the starting letters were also a pair. Whatever time chasing that red herring is warranted might be the difference between the cipher working long enough to do its job and it failing early.

The weakness is that in the short example above, the letter M shows up three times in the cipher text. If the person trying to break the cipher assumes either the plain text or key text has an "E" (a natural assumption) in that place, then they would be right as far as the key is concerned. While this might not cause them to break it wide open, it is a break in the cover and every little break helps to weaken the whole. Considering there are only something like 11 characters in the above example and there are already two-three possible holes in it, this is not a good sign.

How Secure Is it?

It should generally be known that it is not. This is a blanket statment, and as all blanket statements go, it has problems. But, given the sort of key words that people tend to use and the sort of plain text they tend to encrypt, the security is low at best. It might be enough, in a given situation. If a school age kid encrypts a message about where he can be found at lunch, even if he uses obvious keywords like "lunch", there is quite likely no one who will be around to decrypt it in time to find him before lunch, unless they know the keyword. There are better real word examples than this, and some of great importance.

Past the scope of this page itself, I have done some statistical analysis on Vigenere schemes. More of the discussion will be carried out there. You can read about the results of my test, here.

But the general outcome can be expressed by saying that a Vigenere-type schemes' security is most readily influenced by the relative length between the key and the plain text, greatly influenced by how much the key and the plain text fit "real world statistics", and also somewhat influenced by the size of the alphabet. Each of these I will look at more in the next section.

Good and Bad Obfuscation

As said before, this is a tends-to-be insecure system. There have been many ways, some really effective, most only kind of so, to make the cipher more obscure. The first general class of "bad" obfuscations are the multiple "pass" methods. Many assume, at first glance, that if one pass through with a key word is good, multiple pass throughs (with different key words) are better. This, though, can very well warrant less security than its worth. Like those schemes that try and subtract key words, the multiple pass system can be thought of as merely creating a new, possibly really hard to read, keyword. It is most obvious in using two key words of the same length. If two 4-letter keywords (GOSH and DARN) are used, then a wee bit of common sense will show you that you could have equally used whatever 4-letter word results from passing them through each other (JOJU) and so the only real increase in security is that the new word is not an English ones, which isn't much security as all and might buy only a couple of hours. In the case of subtraction, the best example is the key letter "N", in which whether you subtract it or add it, you come to the same cipher letter. All key words, if subtracted, can be thought of as adding a complimentary key word. This again might warrant a few extra hours of protection against active breaking, but all-in-all is substituting one key word for another.

You can say that all multiple passes (and subtraction) do is generate a key word/phrase/stream not readily linkable to the originals. One way, though, this can be used to your advantage would be to use a handful of keywords, generally of prime number lengths (or any, really, besides the same). The resultant key word (key-prime, as it were) will be of a length equal to the least common multiple of all the words involve. In the case of key words, which can be everyday words, that are 3, 5, 7 and 11 letters in length, the resultant key-prime is 1155 characters long. Even just the 3 and 5 and 7 letter words, which any school kid can recall a new triple word set once a week, will result in a 105 character key-prime.

Which brings me to the next point. Every study I have done on the subject has led me to the fact that the number one determinant of the ease to break a Vigenere cipher is the ratio of the length of the key and the length of the plain text. Not only a single given plain text, but all plain text that are passed through the same key. More is in the article I linked above. Keep your messages short as possible, and your keystreams as long as possible. This is balanced by larger keystreams needing memory aids to be recalled, if not needing to exist at both sides of the transmission altogether. It is easy to recall a relatively long sentence as a keystream, but not so easy to remember a 1000-character long key of random parts. A balance will be needed, or some alternate path used. If you put keys at some agreed upon URL or e-mail them from a different account to a different account, for instance, then a very long key might be share among all who need it. Of course, if any gets a copy, then it is all for naught.

This is the advantage of multipass systems, but they only work at maximum when words and phrases of prime number lengths are used, and it seems to be best when the words and phrases share as few common letters as possible. At any rate, recalling the 11-character "A Child of Men" and the 13-String "Botany Posters" (both of which are fairly easy to recall), and doing a two-pass system ends with something like a 143-character keystream. Not only does this increase the amount of plain text that can be processed for the ratio gets too high, but also keeps the key phrases from being readily apparent as plain text.

Another version of what I would consider good obfuscation is to avoid, to what extent possible, real word give aways. Not only is the word "the" used a lot, it contains two letters -- e and t -- that are at the top of the letter frequency list. Some methods of breaking a code involve doing nothing more than guessing that every trigraph (three letters) that repeats is the word "the" (or the word "and") and using that as a starting scheme.

In other words, don't use "the". Or, use it sparingly. Most times, articles can be dropped off with little to no loss in meaning. "I took the book to the library" can just as easily be "took book to library". That sentence gets by with a much small percentage of "give aways". Another solution is to generate short "code phrases" (for example, use "dha" or "skh" or "fdd" instead of "the" for "the" or "an" or "and") that can get rid of give aways. The extreme version would be to generate several five-seven letter codes that can cover a wide range of common words but look like enciphered text upon solving. Articles and common prepositions should be minimized in both plain and key texts.

Secondly, in this mode, words can be misspelled, letters can be left out (i.e. "go to the beach" can be "go ta bech", if your friends will understand this), puns can be used, or phonetics applied. A lot of people can make sense of weird speech as long as they know for sure the right keyword was used, but it will be a lot harder to easily solve a keyword if words look like they make no sense in part. Lastly, in a similar mode, be wary of those blocks of texts we write from habit. Letters should never begin with "dear" or any other standard greeting. They should not end with any standard valediction. "Fluff" words like "although" and "maybe", the sort of things added between sentences should be as minimized as the words above.

Transpostion can also obscure text, and in the case of Vigenere scheme ciphers has two basic ways of doing this. Since the point of a Vigenere is to add key words in order to plain text in order then taking things back out of order can really mess up the process, or slow it down (and in this game, slowing it down is better than just about anything you can hope for). There are hundreds of relative easy tranposition schemes. You can write the words in a spiral, in a zig-zag, or in some sort of rectangle and then read them off in a different way. You can write them into columns, and then move the columns around. You can put odd number characters on one line and evens on another. You could even go for backwards writing or reverse words in blocks of five or something. None of these are unbreakable, but they can make it hard enough to break to make it not worth it.

Steganography and other encryption schemes can also be combined. There are various methods of encryption (ranging from something like a simple monoalphabetic cipher to a more complex matrix system) that can be applied on top, but stop and ask yourself if it is really making the test harder, or if it is merely complicating the process? Steganography, the act of hiding messages, won't make the Vigenere cipher harder to crack, but it might make it harder to find. There are numerous ways to embed short strings of text into files, or to put text up (briely) on some pre-arranged message board online. If someone does not realize there is text to crack, they won't crack it.

In a world of easy PGP, what's the point?

The main point I have for sticking with and loving Vigenere systems, despite their weaknesses, is they just strike me as some of the most fun. They are almost puzzle like, they have a certain riddle quality to them. GPG can encrypt a large file, with complicated formatting, and then rebuild it at the end and no one without the key combination can touch it. It is relatively unbreakable, compared to many old schemes, and is built into good operating systems or close enough to it that it requires minimal work to make it complete. Compared to that, Vigenere doesn't test out so well, I admit. But, for something that can be done in class, or even in your head, and can be transported by scratching a message out in sidewalk chalk, well, in some ways the old ways have a bit of life to them.

Besides, how important are you to need more encryption than this, heh.

The Real Vigenere Cipher

How do you do the Vigenere system that he actually described? Like this. You start with the keyword. Except, rather than repeat it, you then take the original plain text message and append it to the end of the keyword, so the message (with the offset of the key length) begins ciphering itself. This avoids the pesky repetition problems.

But it has a few problems of its own. For instance, in a longer cipher text, you will occasionally have repeated words ("the" for instance) ciphering themselves, or portions. THE + THE = NOG. If "nog" ever shows up, it could be a case of the plaintext "the" ciphering a plain text "the". As always, small breaks are the beginnings of longer breaks.

Caesar and Vignere Ciphers

A little redundant, but I want to repeat that you can think of Vigenere systems as being extensions of Caesar systems, or, as I tend to think about, you can think of Caesar Systems being a Vigenere system with a single character key.

My Vigenere Scripts

I have a small "suite" of Python programs that handle Vigenere-esqe things: vigenere.tar.gz. There are some key stream generators (one pulls words out of a dictionary, the other just spouts out random characters), enchiper scripts and decipher scripts, and an implementation of a "one-time pad" system based on the tabula recta.

Written by W Doug Bolden

For those wishing to get in touch, you can contact me in a number of ways

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

The longer, fuller version of this text can be found on my FAQ: "Can I Use Something I Found on the Site?".

"The hidden is greater than the seen."