Day 42: Huffman Coding
Table of Contents
Huffman Coding #
Welcome to Day 42 of our 60 Days of Coding Algorithm Challenge! Today, we’ll explore Huffman Coding, a greedy algorithm used for lossless data compression.
What is Huffman Coding? #
Huffman Coding is an efficient method for data compression. It assigns variable-length codes to characters based on their frequencies. More frequent characters get shorter codes, while less frequent characters get longer codes.
How Huffman Coding Works #
- Calculate the frequency of each character in the input.
- Create a leaf node for each character and add it to a priority queue.
- While there’s more than one node in the queue:
- Remove the two nodes with the lowest frequency.
- Create a new internal node with these two nodes as children.
- Add the new node to the queue.
- The remaining node is the root of the Huffman tree.
- Traverse the tree to assign codes to characters.
Implementation #
Let’s implement Huffman Coding in Python:
import heapq
from collections import defaultdict
class Node:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(text):
# Count frequency of characters
frequency = defaultdict(int)
for char in text:
frequency[char] += 1
# Create a priority queue of nodes
heap = [Node(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap)
# Build the Huffman tree
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
internal = Node(None, left.freq + right.freq)
internal.left = left
internal.right = right
heapq.heappush(heap, internal)
return heap[0]
def generate_codes(root):
def traverse(node, code):
if node.char:
codes[node.char] = code
return
traverse(node.left, code + '0')
traverse(node.right, code + '1')
codes = {}
traverse(root, '')
return codes
def huffman_encoding(text):
root = build_huffman_tree(text)
codes = generate_codes(root)
encoded_text = ''.join(codes[char] for char in text)
return encoded_text, root
def huffman_decoding(encoded_text, root):
decoded_text = []
current = root
for bit in encoded_text:
if bit == '0':
current = current.left
else:
current = current.right
if current.char:
decoded_text.append(current.char)
current = root
return ''.join(decoded_text)
# Example usage
text = "this is an example for huffman encoding"
encoded, tree = huffman_encoding(text)
print(f"Encoded text: {encoded}")
decoded = huffman_decoding(encoded, tree)
print(f"Decoded text: {decoded}")
Time Complexity #
- Building the frequency table: O(n)
- Creating and building the Huffman tree: O(k log k), where k is the number of unique characters
- Generating codes: O(k)
- Encoding: O(n)
- Decoding: O(n)
Overall time complexity: O(n + k log k), where n is the length of the text and k is the number of unique characters.
Space Complexity #
O(k) for the Huffman tree and codes table, where k is the number of unique characters.
Advantages of Huffman Coding #
- Lossless compression
- Variable-length codes for efficient compression
- Optimal prefix codes (no code is a prefix of another)
Disadvantages #
- Requires knowledge of character frequencies in advance
- Two passes over the data (one for counting frequencies, one for encoding)
- Compressed data size depends on the accuracy of frequency statistics
Applications #
- Data compression in file archiving tools (e.g., zip)
- Image compression (used in JPEG and PNG)
- Implementation of data structures like succinct trees
Exercise #
- Modify the Huffman Coding algorithm to work with bytes instead of characters for more general data compression.
- Implement a function to calculate the compression ratio achieved by Huffman Coding for a given input.
- Create a variant of Huffman Coding that uses a predefined frequency table for English text, eliminating the need for the first pass over the data.
Summary #
Today, we explored Huffman Coding, a fundamental algorithm in data compression. We implemented the algorithm, discussed its time and space complexity, and looked at its advantages, disadvantages, and real-world applications.
Understanding Huffman Coding provides insights into how efficient data compression works and demonstrates another powerful application of greedy algorithms and priority queues.
Tomorrow, we’ll study the Dijkstra’s Algorithm. Stay tuned!