To Compress or Not to Compress - That is the Question
Introduction
When using cloud storage, both upload speed and storage costs matter. Compression can impact both positively or negatively.
In this post, we explore the impact of compression on the total time required for data compression and upload. We utilize the Python library cshelve
with Azure Blob Storage for our benchmarking.
Test Setup
We ran two sets of tests:
- Small data
- Large data
For each test, we measured upload times with and without compression.
The tests were conducted on an Ubuntu 24.04 system using Python 3.12, cshelve
1.0.0, and a premium Azure Blob Storage account located in the same Azure region as the virtual machine.
What is cshelve
?
cshelve
is a Python library that provides a key-value storage interface, similar to Python’s built-in shelve
module. It simplifies cloud storage interactions, making it easy to store data in Azure Blob Storage.
If you find cshelve
useful, please consider starring the project on GitHub.
Experiment Details
We tested compression using two datasets:
-
Small dataset:
- String data: 30 bytes (uncompressed) → 34 bytes (compressed)
- Binary data: 946 bytes (uncompressed) → 540 bytes (compressed)
-
Large dataset:
- String data: 160 MiB (uncompressed) → 122.99 MiB (compressed)
- Binary data: 167.46 MiB (uncompressed) → 34.22 MiB (compressed)
We measured upload times with and without compression for both cases.
Note: The size values reflect data stored in Azure Blob Storage.
Configuration Files
We used two configuration files:
With Compression (compression.ini
)
Without Compression (no-compression.ini
)
Benchmarking Code
Small Data Upload Test
Large Data Upload Test
Results
Small Data Test
Large Data Test
Insights
For small data, compression increases completion times, making it generally unnecessary. However, for large data, compression improved upload times for binary files but slowed down string uploads. This is because our benchmark used fully random string data, which is hard to compress. If we replace the random string with:
the compression time drops to 5 seconds, and the file size shrinks to 715 KiB!
Should You Compress?
It depends on your data. Testing on real-world datasets is ideal. Since this can be cumbersome, tools like cshelve
are working on algorithms to automatically decide the best approach. Until then, a good rule of thumb is:
- Compression helps with large datasets
- Compression may hurt performance for small or highly random data
Conclusion
Whether to use compression depends on your data type and size.
From experience, compression is usually beneficial, but testing on your specific workload is the best approach.
Thankfully, cshelve
makes it easy to enable or disable compression with a simple configuration change.