Insanity continues,..Oh no..I’m not insane….This is the bad joke for the first time, yeah because I have already know about BWT algorithm after read it from some resources, and now that I’m gonna to share it to you with the hope you can also share it to others based on your easier grasp, or at least you have something to publish in your own blog to improve your page rank, or just for enjoy :-). Anyway, this article just as a sharing and discussion about BWT, not perfection. There are a lot of better articles about this algorithm out there. I found Michael’s O’Stuff or Dr Dobb’s Journal is a good article so far. I highly recommend you to read them if you are serious studying about data compression. However, in this discussion I will keep try to explain the concept behind it and some issues due to the implementation of this algorithm. I interest to choose this algorithm as a topic since it’s really simple and easy to understand, but it always brings good effects for some data compression algorithms. In other words, BWT makes something become easier to be compressed. So, even if a simple algorithm like RLE (Run-Length Encoding), can achieve a good ratio if it’s helped by BWT especially for text file. Besides that, Huffman Coding or Move-to-Front (MTF) also a good choice for BWT. That’s way I interested to this algorithm, even I’ve chosen it as one of compression algorithms that I compared to LZW and Arithmetic Coding for my paper’s topic. But, I’m not still get the executive conclusion from this comparison, cos I still need more time to learn them. Okay, let’s get the ball rolling.
BWT Algorithm Overview
BWT algorithm was invented by two scientists David Wheeler and Michael Burrows. But, i don’t know exactly when it was invented, but based on …..(sorry, I forget) it came from their paper titled “A Block-Sorting Lossless Data Compression” in 1994 which explain about compression based on transformation. But, actually it was invented by David Wheeler in 1983 in advance. Thus, this transformation called as a Burrows-Wheeler Transform (BWT).
Actually BWT algorithm is NOT a compression algorithm, but it’s just something like a processing which perform the transformation of a block data to a new form which still CONTAIN the same elements. As I said, this transformation is very helpful for those algorithms since it tends to group a same characters respectively, it happens because the main principle of BWT is to reordering the characters lexicographically. So that, it more compressible by RLE, Huffman or MTV, oh sorry MTF. 🙂
Okay, rather than talk much, let’s see how well this algorithm transform a block of text.To do this, I’ve quoted a text from “Canterbury Corpus” as an experiment object.
“Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,’ thought Alice `without pictures or conversation?’ So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.”
yggergelteersd,,ten?, dtkgds)kyron ÿ Rhhmehddmr `bwww (ehsshd a bu inii-i iinleaneannnil uadia tchhshhhdhslrchchhchbrlprfplbewhdeehthhhhthdvvvvrryighg ooooo nnnnnnnnn uett cw ttststtttsw t wtW ttglwlppppss amktrvkhntdpg ttt assa b hwwsnonoacleuubpsAAecn aiiweoiooiaaaaiiiiiiiiiniaei ooioottnStd oo i icccbb f wlhnrhwch ue u eooeeeeeoeoee u iuueteeoeeaaneeeaaiaerrou ni ii aihoiauueiiissir eo ittaa n iescc osooo tstt bonn a t `o lpbbarrse”
Wow, amazing! As you can see no magic happens here, BWT just transforms its order to a new form that still contain the same elements but consist a lot of same characters respectively, and then we just need to compress it with RLE or Huffman. So, the best question right now is “How does BWT do that?”
To transform (encoding) a string S which contains N characters, the first thing that BWT do is to rotate cyclic shift characters of string S as much as N-1, so that the matrix M would be produced with the ordo N x N. Keep in mind that N is the length of string S. Columns and the rows are the string from rotations. The rotation process could be described as follows:
Assume we have a string S with the length N which contain C characters. The process of rotation can be seen as follows:
S0 = C0,…,CN – 1
S1 = C1,…,CN – 1, C0
S2 = C2,…,CN – 1, C0, C1
SN – 1 = CN – 1, C0,…,CN – 2
(Hmm…I got this idea from Michael’s page, hopefully he doesn’t mind)
After we have the matrix M, it must be reordered lexicographically, in which one of the matrix’s row is the original string which identified by Index I. So, the encoding result is the LAST CHARACTERS every rows of matrix M followed by index I.
To make everything easier let’s to transform string “BANANA” using BWT. Here are the steps:
1. Build a Matrix M with ordo N x N
To do that, rotate the string cyclic lift as much as N – 1. So we would get result as follows:
2. Reorder the matrix M
Reorder the string (rows of the matrix) lexicographically so we would get result as follows:
The most important thing that need be considered after we got the rotation of the string is how to reverse it to original string. The BWT algorithm would be useless if it couldn’t reverse it’s transformed string to original, right? The process decoding of BWT algorithm is more complicated than transform process. However, during the reversing, we don’t need to build a matrix M with the whole elements as in earlier matrix, we just need the last (L) and the first (F) column (in our case the last column is NNBAAA) of the matrix. Last column can be derived by reordered string NNBAAA lexicographically, I’ll explain it later why it is ok). To reverse all characters of our original string, we then need to create a vector transformation T which contains the array (index) which derived from relation between L and F.
As I said, F could be derived by ordering the L lexicographically since our last matrix is ordered lexicographically (do you recall it?) and since both L and F contain the same elements, it would be ok ordering L to get F. In other words F only could be derived by ordering the L lexicographically as shown bellow.
Since we’ve known the index of our original string (consider the red asterik), we can then guess that the first and the last character of the original string must be “B” and “A”. But, the new problem is how to find the characters between B and A?
Well, consider again our previous vector picture which tell us the mapping both of F and L, assume we want to find the position of character “A” (index 1) in F, by the help of T, it’s actually fall in index 4 belong to L. Thus we can write a formula to find every characters F in L as follows:
F[i] = L[T[i]]………(i)
The above formula is very important to prevent the ambiguous in L, since there are three A’s in it which might be causes the decoder confuse to choose which one is the right.
After knowing the characters in L, the next task is to find up the next character in F, since during the encoding process only involves the permutation by rotating the first character to the last, so both F and L is in the same index. In other words, character A (index 2 in F) is the next character in L. The illustration of these steps can be seen as follows:
To reverse the original string, do the same process until N-1 and then reorder the result ascending.
After followed all the above steps we should get the original string, in this case is “BANANA”. What do you think, huh?
Although the theory of BWT algorithm might looks like the back of your hand, actually the implementation of this algorithm is extremely complicated especially if you have to consider its speed. It turns out that the most much time consuming in BWT algorithm is the “sorting” of the rotated strings since it has to rotated the block of string N times and then sort them lexicographically as well. Performing sorting is an expensive thing in BWT, why not? the rotated strings of BWT would be N size and all of them not in a sorted condition, performing the sorting must be achieved by comparing the character by character until N times.
Optimizing the sorting of this algorithm seems had became a hot topic for those who have been playing with compression. Please read this paper if you’re one of the “insane” scientists about data compression “A Block-sorting Lossless Data Compression Algorithm“, by M.Burrows and David J Wheeler. The good news is that they had have an idea to optimize the sorting instead of comparing all characters N times in the block data, the main idea of it is by placing an unique EOF in a string being encoded. So, if you have a string S0, simply concatenate it with an EOF character. Notice: EOF must unique, in other words it must be the smallest or the largest character in the block data based on its index in ASCII characters. So, each time the decoder found this character, the comparison of characters then can be terminated which means it doesn’t need performing the comparison N times, anyway in some programming languages there is a pretty standard sorting method like in C is qsort(), but if you would like to use another naive method, I recommend you to use QuickSort or MegerSort algorithm, but if you’re really serious to consider its speed, you should learn the Suffix Tree or something else. Good Luck!