Pure Programmer
Blue Matrix


Cluster Map

Project: Huffman Coding

[[Huffman_coding|Huffman Coding]] is a particular type of code commonly used for lossless data compression. By using shorter codes for more common symbols in a file and longer codes for less common symbols, we can save space overall.

Write a program that reads in a file and writes out a shorter compressed file using a Huffman Code. You will need to process the file to compute the frequency counts for each symbol in the file before you construct the Huffman tree used to encode the file. You will also need to write out the Huffman tree at the beginning of the file so that it can be used to decompress the file.

The program should take three arguments on the command line: name of input file, name of output file, and C/D for compress or decompress.

Even though this is an "optimal" compression technique, it doesn't always result in a smaller file. Why is this? What is the worst case?

Output
$ node HuffmanCoding.js ../../data/text/UnicodeTest.utf8 output/testUnicodeTest.hz C $ node HuffmanCoding.js output/testUnicodeTest.hz output/testUnicodeTest.txt D $ ls -al output/testUnicodeTest.* -rw-r--r--@ 1 rich staff 2535 Sep 8 21:02 output/testUnicodeTest.cc20 -rw-r--r--@ 1 rich staff 2535 Sep 8 21:02 output/testUnicodeTest.cc20.utf8 -rw-r--r--@ 1 rich staff 2567 Sep 8 21:03 output/testUnicodeTest.hz -rw-r--r--@ 1 rich staff 2535 Sep 8 21:03 output/testUnicodeTest.txt $ diff ../../data/text/UnicodeTest.utf8 output/testUnicodeTest.txt $ node HuffmanCoding.js ../../data/text/100.txt output/test100.hz C $ node HuffmanCoding.js output/test100.hz output/test100.txt D $ ls -al output/test100.* -rw-r--r--@ 1 rich staff 3274838 Sep 8 21:03 output/test100.hz -rw-r--r--@ 1 rich staff 5589889 Sep 8 21:04 output/test100.txt $ diff ../../data/text/100.txt output/test100.txt

Solution