(Based on an assignment given by Owen Astrachan at Duke University.)
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as the standard Windows utilities such as winzip which use different algorithms permitting a higher degree of compression than Huffman coding. You should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program! Design, develop, and test so that you have a working program at each step.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, an iterative enhancement development of the program is suggested below. You are by no means constrained to adopt this approach, but it might prove useful if your final program doesn't work to show useful and working programs along the way.
huff foo.txtLater you might provide the user the option of naming the output file explicitly:
huff foo.txt -o foo.txt-huffedAs described below in the development section, you MUST use a class-based approach to this problem. This means, for example, that you should implement at least one class. Some suggestions are provided in the development section.
Before getting into particulars of the program, we'll outline the steps in the Huffman algorithm.
It's probably a good idea to create several classes. A class can be responsible for one part of the Huffman compression. For example, a class could be responsible for creating the initial counts of how many times each character occurs. You could design, implement, and test the class in isolation from the rest of the huff.cpp program. Doing this for each of several classes will help both in developing huff.cpp and in re-using code in writing the uncompress program unhuff.cpp.
For example, one possibility for a character counting class is shown
below:
CharCounter
cc;
cc.process( "poe.txt" );
for( int k = 0; k < 256; k++ )
{
int occs = cc.count(
k );
if( occs > 0 )
{
cout << char( k ) << " " << occs << endl;
}
}
Here the member function CharCounter::process reads and counts all the characters in a file. The function CharCounter::count returns the number of occurrences of a given character. You may decide that process is not a good identifier. For example, some people think that readSource is a better name, indicating that the function reads a source file as opposed to a compressed file. You might use CharCounter to read compressed-file headers too in unhuff.cpp, for example. Note that there are 256 different 8-bit values, so you'll need 256 different counters. (There are 257 possible characters if you include the pseudo-EOF chararacter described below.)
Do NOT use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment above.
Some other ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.
Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.
Designing debugging functions as part of the original program will make the program development go more quickly since you'll be able to verify that pieces of the program, or certain classes, work properly.
Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.
You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.
teer4> huff myprog.c no compression yielded, no output file written teer4> huff -f myprog.cHere, the second time huff is executed the compression is forced because of the -f argument. Only the first argument to the program can be the -f argument.
Don't spend time worrying how to process command-line arguments until you're sure the program works, this isn't the most important part of the program. You can add command-line parameters after everything works.
One easy way to ensure that compression/decompression work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding uncompression program first reads N bits, if the bits don't represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitstream.h.
In writing unhuff.cpp you should try to use some of the same classes that use used in huff.cpp. For example, to uncompress you must have the same Huffman tree that was used to compress. This tree might be stored directly in the compressed file, or it might be created from character counts stored in the compressed file. In either case, a class to create a Tree could create the tree from either a CharCounter object or directly from a compressed file. Once the tree object can be used once it's created.
Every time a file is compressed the count of the the number of times that pseudo-EOF occurs should be one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
When you open files, you should assume they are binary files, not text files. Modes that will work are
static const int READ_MODE = ios::in
| ios::binary;
static const int WRITE_MODE = ios::out | ios::binary;
int bit; while( ( bit = in.readBit( ) ) != EOF ) ) // Should never actually hit EOF this way { // use the zero/one value of the bit read // to traverse Huffman coding tree // if a leaf is reached, decode the character and print UNLESS // if the character is pseudo-EOF, then decompression done // loop to read more bits }When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You'll have to write these bits explicity. These bits will be recognized by the program unhuff.cpp and used in the decompression process. In particular, when using unhuff a well-formed compressed file will be terminated with the encoded form of the pseudo-EOF char (see code above). This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (think about this). In other words, when decompressing (unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bitreading function returns false) the compressed file is not well-formed.
in.clear( ); in.seekg( 0, ios::beg ); // Rewind the stream
priority_queue<Pointer<HuffNode>, vector<Pointer<HuffNode> >, less<Pointer<HuffNode> > > pq;Note two things:
ifstream fin( "input.txt" ); ibstream in( fin ); int bit;
while( ( bit = in.readbit( ) ) != EOF ) { cout << ( bit == 1 ) ? 'O' : 'Z'; }Similarly, writebit can be called to write a single bit at a time. Or you can use writeBits, sending to it a vector of 0s and 1s.
When using writebits to write a specified number of bits, some bits may not be written because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flush is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly.
int main( int argc, char *argv[] )
{
if( argc == 1 )
{
cout <<
"program has NO command-line arguments" << endl;
cout <<
"name of program is " << argv[ 0 ] << endl;
}
else
{
cout
<< "program " << argv[ 0 ] << " has arguments:" <<
endl;
for(
int k = 1; k < argc; k++ )
{
cout << "arg: " << k << " = " << argv[ k ] <<
endl;
}
}
return 0;
}
All programs have at least one command-line argument, the name of the
program being run. This is stored as the first entry in the array
argv
(the first entry has index zero). The array argv is an array of
c-style strings. These strings are just pointers to characters, with a
special NUL-character '\0' to signify the last character in a C-style string.
You do NOT need to know this to manipulate these char *, C-style strings.
The easiest thing to do is to assign each element of argv to a C++
string variable. Then you can use "standard" C++ string functions to manipulate
the values, e.g., you can call length(), you can use substr(),
you can concatenate strings with +, etc. None of these operations work
with C-style, char * strings. Assign each element of argv to
a C++ string variable for processing.
Description of the report
1/ How to use the program, what options are available, and what error checks are performed.
2/ Brief description of the problem and the way you have used to solve
it (main algorithm, basic design, different techniques used i.e how
information is stored in compressed files so that uncompression works...etc)
2/ Brief description of the software(program) :
- list of files of the project
- List of Classes, libraries and functions used
(including those
defined by the author)
- List of Classes you have defined (for each class,
list all public
and private methods)
- Description of the Makefile (how are the different
modules
compiled and linked ...static or dynamic linking..etc).
- Compilers that can or
cannot be used to compile your program.
3/ Brief description of the project :
- Students involved.
- Tasks performed by each
student (specify the approximative duration
of each task).
- Dependencies(ordering)
between the different tasks.
- Total time required.
4/ General comments
What was frustrating or enjoyable about the assignment, a list
of collaborators with whom you worked (if any)...etc.
I will choose one student from each team to be interviewed to
make sure
that any student from each team has a good idea about the project.
1/ create a directory named "assign4"
2/ move all files required by the program to the directory "assign4"
i.e :
- README (how to use the
program, what options are available, and what error checks are performed)
- headers (.h)
- implementations (.cc ,
.cpp...etc)
- the Makefile :
- should be named "makefile"
3/ tar your directory
tar cf assign4.tar assign4
4/ compress the result with gzip
gzip assign4.tar
5/ The result will be a file named :
assing4.tar.gz
6/ Execute the following script :
/home/lib3620/assignments/a4submit_tar_gz
7/ Check(read the message) that the script has found your file
8/ Ignore the message about uuencode !
9/ The marker will send you an e-mail confirming the reception
of all files.
description | points |
---|---|
report | 10 points |
individual presentation (interview) | 5 points |
compression of any text file | 10 points |
compression of any file (including binary files) [BONUS] | 5 points |
decompression | 8 points |
user-interface (how easy and intuitive program is to use) | 2 points |
robustness (does unhuff program crash on non-huffed files?) | 2 points |
program style (class design/implementation, program design) | 3 points |