Huffman Coding
Welcome to Day 42 of our 60 Days of Coding Algorithm Challenge! Today, we’ll explore Huffman Coding, a greedy algorithm used for lossless data compression.
What is Huffman Coding?
Huffman Coding is an efficient method for data compression. It assigns variable-length codes to characters based on their frequencies. More frequent characters get shorter codes, while less frequent characters get longer codes.
How Huffman Coding Works
- Calculate the frequency of each character in the input.
- Create a leaf node for each character and add it to a priority queue.
- While there’s more than one node in the queue:
- Remove the two nodes with the lowest frequency.
- Create a new internal node with these two nodes as children.
- Add the new node to the queue.
- The remaining node is the root of the Huffman tree.
- Traverse the tree to assign codes to characters.
Implementation
Let’s implement Huffman Coding in Python:
1import heapq
2from collections import defaultdict
3
4class Node:
5 def __init__(self, char, freq):
6 self.char = char
7 self.freq = freq
8 self.left = None
9 self.right = None
10 …
Huffman Coding
Welcome to Day 42 of our 60 Days of Coding Algorithm Challenge! Today, we’ll explore Huffman Coding, a greedy algorithm used for lossless data compression.
What is Huffman Coding?
Huffman Coding is an efficient method for data compression. It assigns variable-length codes to characters based on their frequencies. More frequent characters get shorter codes, while less frequent characters get longer codes.
How Huffman Coding Works
- Calculate the frequency of each character in the input.
- Create a leaf node for each character and add it to a priority queue.
- While there’s more than one node in the queue:
- Remove the two nodes with the lowest frequency.
- Create a new internal node with these two nodes as children.
- Add the new node to the queue.
- The remaining node is the root of the Huffman tree.
- Traverse the tree to assign codes to characters.
Implementation
Let’s implement Huffman Coding in Python:
1import heapq
2from collections import defaultdict
3
4class Node:
5 def __init__(self, char, freq):
6 self.char = char
7 self.freq = freq
8 self.left = None
9 self.right = None
10
11 def __lt__(self, other):
12 return self.freq < other.freq
13
14def build_huffman_tree(text):
15 # Count frequency of characters
16 frequency = defaultdict(int)
17 for char in text:
18 frequency[char] += 1
19
20 # Create a priority queue of nodes
21 heap = [Node(char, freq) for char, freq in frequency.items()]
22 heapq.heapify(heap)
23
24 # Build the Huffman tree
25 while len(heap) > 1:
26 left = heapq.heappop(heap)
27 right = heapq.heappop(heap)
28 internal = Node(None, left.freq + right.freq)
29 internal.left = left
30 internal.right = right
31 heapq.heappush(heap, internal)
32
33 return heap[0]
34
35def generate_codes(root):
36 def traverse(node, code):
37 if node.char:
38 codes[node.char] = code
39 return
40 traverse(node.left, code + '0')
41 traverse(node.right, code + '1')
42
43 codes = {}
44 traverse(root, '')
45 return codes
46
47def huffman_encoding(text):
48 root = build_huffman_tree(text)
49 codes = generate_codes(root)
50 encoded_text = ''.join(codes[char] for char in text)
51 return encoded_text, root
52
53def huffman_decoding(encoded_text, root):
54 decoded_text = []
55 current = root
56 for bit in encoded_text:
57 if bit == '0':
58 current = current.left
59 else:
60 current = current.right
61 if current.char:
62 decoded_text.append(current.char)
63 current = root
64 return ''.join(decoded_text)
65
66# Example usage
67text = "this is an example for huffman encoding"
68encoded, tree = huffman_encoding(text)
69print(f"Encoded text: {encoded}")
70decoded = huffman_decoding(encoded, tree)
71print(f"Decoded text: {decoded}")
Time Complexity
- Building the frequency table: O(n)
- Creating and building the Huffman tree: O(k log k), where k is the number of unique characters
- Generating codes: O(k)
- Encoding: O(n)
- Decoding: O(n)
Overall time complexity: O(n + k log k), where n is the length of the text and k is the number of unique characters.
Space Complexity
O(k) for the Huffman tree and codes table, where k is the number of unique characters.
Advantages of Huffman Coding
- Lossless compression
- Variable-length codes for efficient compression
- Optimal prefix codes (no code is a prefix of another)
Disadvantages
- Requires knowledge of character frequencies in advance
- Two passes over the data (one for counting frequencies, one for encoding)
- Compressed data size depends on the accuracy of frequency statistics
Applications
- Data compression in file archiving tools (e.g., zip)
- Image compression (used in JPEG and PNG)
- Implementation of data structures like succinct trees
Exercise
- Modify the Huffman Coding algorithm to work with bytes instead of characters for more general data compression.
- Implement a function to calculate the compression ratio achieved by Huffman Coding for a given input.
- Create a variant of Huffman Coding that uses a predefined frequency table for English text, eliminating the need for the first pass over the data.
Summary
Today, we explored Huffman Coding, a fundamental algorithm in data compression. We implemented the algorithm, discussed its time and space complexity, and looked at its advantages, disadvantages, and real-world applications.
Understanding Huffman Coding provides insights into how efficient data compression works and demonstrates another powerful application of greedy algorithms and priority queues.
Tomorrow, we’ll study the Dijkstra’s Algorithm. Stay tuned!