Assignment #4

Huffman Codes

(Based on an assignment given by Owen Astrachan at Duke University.)

You are urged to work in groups of three or four. Each group should submit ONE program per group.

Introduction

This assignment involves implementing a compression/uncompression program based on the greedy Huffman algorithm discussed in class.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as the standard Windows utilities such as winzip which use different algorithms permitting a higher degree of compression than Huffman coding. You should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program! Design, develop, and test so that you have a working program at each step.

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, an iterative enhancement development of the program is suggested below. You are by no means constrained to adopt this approach, but it might prove useful if your final program doesn't work to show useful and working programs along the way.

You are to write two programs:

huff.cpp This program compresses a file
unhuff.cpp This program uncompresses a (previously huffed/compressed) file

Both programs may eventually read from a file specified by the user on the command line, but initially the user can be prompted for the file name (or you could hardwire the filename into the program, but this is strongly discouraged). The program huff.cpp should write its (compressed) output to a file also specified in some way (by user, on command line, or to cout). The line below might compress the file foo.txt storing the compressed version in foo.txt.H.

               huff foo.txt

Later you might provide the user the option of naming the output file explicitly:

               huff foo.txt -o foo.txt-huffed

As described below in the development section, you MUST use a class-based approach to this problem. This means, for example, that you should implement at least one class. Some suggestions are provided in the development section.

Assignment Details

The files for this assignment include the following:

You should copy these files to get started. They contain wrapper classes that allow you to perform bit-at-a-time I/O. Wrapper.h is used to wrap a pointer variable with meaningful comparison semantics.

The program huff.cpp

You must write a program huff.cpp that will compress a file so that it can be uncompressed by unhuff.cpp. In writing huff.cpp you'll implement several classes, usually a class will take care of one part of the Huffman algorithm. These classes will also provide support for the decompression program unhuff.cpp.

Before getting into particulars of the program, we'll outline the steps in the Huffman algorithm.

Steps in the Huffman Algorithm

Your implementation of Huffman coding has four principle steps:

Count how many times every character occurs in a file. These counts are used to build weighted nodes that will be leaves in the Huffman tree. Here we use character to mean a conceptual character, not a C++ char value, although it might be.
From the counts, build the Huffman tree. To do this, first create one node per character, weighted with the number of times the character occurs, insert each node into a priority queue. Then choose two minimal nodes, join these nodes together as children of a newly created node, and insert the newly created node into the priority queue. The new node is weighted with the sum of the two minimal nodes taken from the priority queue. Continue this process until only one node is left, this is the root of the Huffman tree.
By traversing the Huffman tree, create a code for each leaf/character. The code is formed from the path from the root of the Huffman tree to each leaf, going left adds a zero to the path, going right adds a one. At each leaf, the character at the leaf is mapped to the sequence/path of zeros and ones used to reach the leaf. All characters/encoding bit pairs are stored in some kind of table or map. Depending on how you store the path of bits you may need the number of bits for each character as well. The number of bits may be needed since the bit patterns 00101 and 0101 are different, but are both represented as numbers by 5. However, if you store the bit patterns as strings, e.g., "00101" and "0101" then the string's length can be used to determine the number of encoding bits.
Read the input file a second time. For each character read, instead of writing the character, write the sequence of zero and one bits used to encode the character. This sequence of bits was determined in the previous step.

Steps in Implementing huff.cpp

Some steps that may be useful in developing the program are described below. It's important to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you'll find it easier to isolate bugs and you'll be more likely to get a program working faster. Do NOT write hundreds of lines of code before compiling and testing.

It's probably a good idea to create several classes. A class can be responsible for one part of the Huffman compression. For example, a class could be responsible for creating the initial counts of how many times each character occurs. You could design, implement, and test the class in isolation from the rest of the huff.cpp program. Doing this for each of several classes will help both in developing huff.cpp and in re-using code in writing the uncompress program unhuff.cpp.

For example, one possibility for a character counting class is shown below:
CharCounter cc;
cc.process( "poe.txt" );

     for( int k = 0; k < 256; k++ )
     {
         int occs = cc.count( k );
         if( occs > 0 )
         {
             cout << char( k ) << " " << occs << endl;
         }
     }

Here the member function CharCounter::process reads and counts all the characters in a file. The function CharCounter::count returns the number of occurrences of a given character. You may decide that process is not a good identifier. For example, some people think that readSource is a better name, indicating that the function reads a source file as opposed to a compressed file. You might use CharCounter to read compressed-file headers too in unhuff.cpp, for example. Note that there are 256 different 8-bit values, so you'll need 256 different counters. (There are 257 possible characters if you include the pseudo-EOF chararacter described below.)

Do NOT use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment above.

Some other ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.

A class to create the Huffman tree used in compression and uncompression. The tree is created from character counts, so a Huffman-tree creating class might use a CharCounter object in creating the Huffman tree.
A class that represents the table of character and encoding bit pattern pairs. You could use a map for example to map char values to string values that represent the sequence of zeros and ones that encode the character, e.g., that map 'A' to "0101011".

Implementing and Debugging

You will probably find it useful to implement debugging functions in your classes. These are functions you use in the debugging stages, but not in the final version of the program. For example, you could implement CharCounter::print to print all characters and frequencies read from a source file --- this could use the code fragment shown above with the CharCounter description. Although you could write such a printing/testing function outside the class, encapsulating the testing/debugging functions within the class makes the class more coherent.

Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.

Designing debugging functions as part of the original program will make the program development go more quickly since you'll be able to verify that pieces of the program, or certain classes, work properly.

Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.

You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.

Final Stages of Implementation

When you think your huff.cpp program is compressing properly (you'll probably need to write unhuff.cpp to really verify this) you'll need to add a few finishing touches to the program to make it more usable and more robust.

huff

-f

      teer4> huff myprog.c
      no compression yielded, no output file written
      teer4> huff -f myprog.c

huff

As shown in the example above, your program will need to process command line arguments like -f or the name of the file being compressed.

To determine if compression results in a smaller file, you'll need to determine the number of characters in the original file (your program will compute this by determining character counts). The size of the compressed file can be calculated from the same character counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.

Don't spend time worrying how to process command-line arguments until you're sure the program works, this isn't the most important part of the program. You can add command-line parameters after everything works.

The program unhuff.cpp

The program should be robust enough not to crash if it is given a program to uncompress that wasn't compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program. There are a variety of methods that you can use to ensure this works, but it will probably always be possible to ``fool'' the program (although very unlikely).

One easy way to ensure that compression/decompression work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding uncompression program first reads N bits, if the bits don't represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitstream.h.

In writing unhuff.cpp you should try to use some of the same classes that use used in huff.cpp. For example, to uncompress you must have the same Huffman tree that was used to compress. This tree might be stored directly in the compressed file, or it might be created from character counts stored in the compressed file. In either case, a class to create a Tree could create the tree from either a CharCounter object or directly from a compressed file. Once the tree object can be used once it's created.

Coding and Algorithmic Details

The number of characters counted is at most 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character.

Every time a file is compressed the count of the the number of times that pseudo-EOF occurs should be one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

When you open files, you should assume they are binary files, not text files. Modes that will work are

static const int READ_MODE = ios::in | ios::binary;
static const int WRITE_MODE = ios::out | ios::binary;

Pseudo-EOF character

The operating system will buffer output, that is output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can't be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you'll use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file.

       int bit;
       while( ( bit = in.readBit( ) ) != EOF ) )  // Should never actually hit EOF this way
       {
           // use the zero/one value of the bit read
           // to traverse Huffman coding tree
           // if a leaf is reached, decode the character and print UNLESS
           // if the character is pseudo-EOF, then decompression done
           // loop to read more bits
       }

When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You'll have to write these bits explicity. These bits will be recognized by the program unhuff.cpp and used in the decompression process. In particular, when using unhuff a well-formed compressed file will be terminated with the encoded form of the pseudo-EOF char (see code above). This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (think about this). In other words, when decompressing (unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bitreading function returns false) the compressed file is not well-formed.

Compressed Header

For decompression to work with Huffman coding, information must be stored in the compressed file that allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, it's possible to store just character counts and recreate the codes from the counts. It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable. If you use a successful space-saving technique you can earn several points of extra credit (up to 5). Space-saving techniques are defined as those using less space than simply storing 257 counts as 32 bit ints. One useful technique is to write the tree to the file. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example.

Reading Files Twice

Finally, because Huffman coding requires two passes over the input file you will need to rewind the underlying stream. The effect of rewinding the input stream is to allow it to be ``read again''. You can do so as follows:

in.clear( );
in.seekg( 0, ios::beg ); // Rewind the stream

Priority Queues

You should use the STL priority_queue class.Assuming that you store a tree node in an object of type HuffNode, then a respectable declaration would be

priority_queue<Pointer<HuffNode>, vector<Pointer<HuffNode> >, less<Pointer<HuffNode> > > pq;

Note two things:

The STL priority queue gives the max, not the min.
You need a comparison function; pointers don't compare well, but if you wrap the HuffNode in a Pointer object, then all you have to do is define operator< for HuffNode. Note that because of point #1 above, you will want to reverse the meaning of operator<.

Creating a Table from a Huffman-tree

To create a table or map of coded bit values for each character you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the table should be set to 00110. You'll need to make a decision about how to store the bit patterns the table. One method is based on storing strings. This makes it easy to add a character (using +) to a string during tree traversal and makes it possible to store just one entry in the table: the string. An alternative is to use a vector<int>; this makes it easy to interact with writeBits (see next section)

Using bitstream.h and bitstream.cpp

In order to read and write in a bit-at-a-time manner, the file bitstream.cpp and the corresponding header file bitstream.h must be used. Two classes are specified for reading/writing bits-at-a-time: ibstreamand obstream, respectively. These classes wrap a corresponding stream, and then provide readBit and writeBit, respectively.

Bit read/write subprograms

Here is an example that reads the bits in a file and outputs Z for zero and O for one. Needless to say, it is short on error checks.

    ifstream fin( "input.txt" );
    ibstream in( fin );
    int bit;

    while( ( bit = in.readbit( ) ) != EOF )
    {
        cout << ( bit == 1 ) ? 'O' : 'Z';
    }

Similarly, writebit can be called to write a single bit at a time. Or you can use writeBits, sending to it a vector of 0s and 1s.

When using writebits to write a specified number of bits, some bits may not be written because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flush is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly.

Command-line Arguments

When a C/C++ program is invoked, arguments are often given as command-line arguments/parameters, i.e., options entered on the same line as the program. To run a program from the command line, bring up an MS/DOS window. You should find the executable in the DEBUG directory of your project.

   int main( int argc, char *argv[] )
   {
       if( argc == 1 )
       {
          cout << "program has NO command-line arguments" << endl;
          cout << "name of program is " << argv[ 0 ] << endl;
       }
       else
       {
           cout << "program " << argv[ 0 ] << " has arguments:" << endl;
           for( int k = 1; k < argc; k++ )
           {
               cout << "arg: " << k << " = " << argv[ k ] << endl;
           }
       }
       return 0;
   }

All programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero). The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these char *, C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable. Then you can use "standard" C++ string functions to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style, char * strings. Assign each element of argv to a C++ string variable for processing.

What to submit

Submit this program with all source code, sample output and complete documentation (report)

Description of the report

1/ How to use the program, what options are available, and what error checks are performed.

2/ Brief description of the problem and the way you have used to solve
it (main algorithm, basic design, different techniques used i.e how information is stored in compressed files so that uncompression works...etc)

2/ Brief description of the software(program) :
    - list of files of the project
    - List of Classes, libraries and functions used (including those
defined by the author)
    - List of Classes you have defined (for each class, list all public
and private methods)
    - Description of the Makefile (how are the different modules
compiled and linked ...static or dynamic linking..etc).
        - Compilers that can or cannot be used to compile your program.

3/ Brief description of the project :
        - Students involved.
        - Tasks performed by each student (specify the approximative duration
of each task).
        - Dependencies(ordering) between the different tasks.
        - Total time required.

4/ General comments
What was frustrating or enjoyable about the assignment, a list of collaborators with whom you worked (if any)...etc.

Description of the individual presentation(interview) :

I will choose one student from each team to be interviewed to make sure
that any student from each team has a good idea about the project.

Hand In :

1/ create a directory named "assign4"

2/ move all files required by the program to the directory "assign4"
i.e :
        - README (how to use the program, what options are available, and what error checks are performed)
        - headers (.h)
        - implementations (.cc , .cpp...etc)
        - the Makefile :
                - should be named "makefile"

3/ tar your directory

tar cf assign4.tar assign4

4/ compress the result with gzip

gzip assign4.tar

5/ The result will be a file named :

assing4.tar.gz

6/ Execute the following script :

/home/lib3620/assignments/a4submit_tar_gz

7/ Check(read the message) that the script has found your file

8/ Ignore the message about uuencode !

9/ The marker will send you an e-mail confirming the reception
of all files.

Grading

The programs are worth 40 points + 5 points (bonus).

**Huffman Coding Grading Standards**
description	points
report	10 points
individual presentation (interview)	5 points
compression of any text file	10 points
compression of any file (including binary files) [BONUS]	5 points
decompression	8 points
user-interface (how easy and intuitive program is to use)	2 points
robustness (does unhuff program crash on non-huffed files?)	2 points
program style (class design/implementation, program design)	3 points