Smoke – A unified means of generating, transmitting, encapsulating, and validating multiple hash digests simultaneously to replace existing stand-alone hash digest software. The software generates digests in parallel and is notably faster than using individual algorithms serially on large files. Smoke operates much the same way as existing hash digest tools, like md5sum, and Smoke designed to be a full replacement.

Impetus

An important component of Information Security is the CIA triad: Confidentiality, Integrity, and Availability. Today, I talk about file integrity, weak hashing, and how it affects some of my clients. The clients using legacy hash validation algorithms like md5 are subject to which maintaining backward compatibility and speeding up hash digest generation, so I created a new mechanism for hash digest management: Smoke.

I always recommend using stronger hashing algorithms since both md5 and sha1 are subject to collisions. Even for small files, as demonstrated here, or for valid files in complex formats, such as in these jpeg images. For the large data dumps or the small files of financial directives, a injected collision could have disastrous consequences for my clients and their customers.

My clients distribute the set of files to multiple of their customers. These files can be quite different: one client sends very large multi-terabyte data files and other clients send smaller files containing transactional commands. For a variety of legacy and mainframey reasons, the downstream customers of our clients only use older hashing algorithms, like md5 and sha1, for validating the files were correctly transferred. My clients would like to force their customers to upgrade, but these downstream systems do not support the more secure hash validation mechanisms.

For the client who sends large files out, the pseudo-code use in their system is essentially:

The clients’ customers are finally demanding better hashing algorithms. Unfortunately, downstream Customer A wants to move to sha256 while Customer B desires sha512. Yet Customers C & D still needs legacy sha1 and md5 for the foreseeable future, so these need to stay too. A lazy code change to add the two new algorithms might result in something like:

But, reading a massive file of multi-gigabytes to a few terabytes in size four times sequentially from disk is an inefficient means of computing all of these sums. But doing this below won’t help either:

That might actually be worse as amazing amounts of disk head thrash will occur when one of the more complex hash process falls behind the others when the disk cache becomes flushed. Ideally, we want:

Now we’re talking a real program. So, I wrote one for a these clients.

Existing Software

I looked for a simple, pre-existing tool to generate these multiple hashes. Fancy tee commands are suggested in this article, but it gets really complicated with four hash algorithms and using /proc dependencies. A simple python script for multiple hash shown there too, but the script only dumps the hash – there is no validation.

Another bit of software called Quickhash will also generate and compare file hashes for multiple algorithms, but it does not have a command line or API interface. The HashDeep suite does contain command line and API, but after consideration of HashDeep’s complexity and implementation, it was not used. It is probably possible to extend HashDeep to implement all features of Smoke and remove the need for a Python install; but it would be a complex undertaking.

The Python Snakeoil project does perform threaded hashing using multiple algorithms, but contains no file checksum verification functionality. The code was designed as a compatibility crutch for older Python versions or missing OS utilities. For this and other reasons, I decided to not this software as a starting point.

Introducing Smoke

As none of the existing multi-hash tools would work, I created a new tool that can generate or validate hashes for files and called it Smoke. In determining operational needs, I examined the existing single-hash command line tools (both OS-based and GNU-based). Running them, we can observe semi-standardized formats for reporting hashes on the cleartext “t123”:

The BSD-based version of md5 used in OS X produces a verbosely formatted “tagged” output that is to be avoided. In fact, both BSD-native md5 and shasum programs produce this tagged format, yet OS X remixes things up by using the BSD md5 and a Perl script for shasum, thus the output formats differ. The GNU and Perl versions of md5 and sha1 are basically “hash whitespace filename”. Sometimes whitespace is spaces, sometimes it is a tab. There is also an asterisk (*) added in instances where hashes are computed on binary files versus treating files as text (not shown here). Some software used a dash (-) for standard in, others did not.

From these various existing output format ideas, I decided to use delimited formatting for Smoke, but with a single tab for the whitespace for easier downstream parsing. I also decided that binary-only assumptions is better and eschew the asterisk completely. Since Smoke utilizes multiple hashes, the output must contain the hash name. Thus, in keeping the same tab delimitation, each line might be:

An argument can be made that algorithms have known bit-lengths, thus you can completely eschew providing the algorithm name for each line. So, 40 hex digits is md5, 96 means sha384. Bad idea, because: 1) a future algorithm may have the same output bit length as an existing one and 2) there are commonly used truncated hashes, like sha-512/256 which happens to be the same bit-length as sha256. Thus, the hash algorithm must be specified.

After some testing and getting feed back from clients, we determined the duplication of filenames per line are also unneeded. The tab/newline format assists when a human reads the file, but this is not necessary for a machine. This format also makes stream processing harder if files become sorted; you’d have to consume the whole SUMS file to find all hash digest for a single filename. So, I chose a single-line implementation instead: hash1=val1;hash2=val2 (tab) filename. The previous example would then be like so:

This is slightly less readable for a human, but much nicer to process and conceptualize. One item per line; first part is the mutli-hash, second is filename. Stdin is specified with a dash (-) as the filename.

The Smoke file format will also ignore empty lines and lines starting with “#” as a convenience. I coded my Python parser to be more forgiving by striping whitespace from the hashes and to deal with spaces versus tabs – but the standard file should always use tabs to separate fields.

The hash names are normalized: lowercase name, no dashes. Thus, SHA-1 becomes sha1. I did not define a special case for “sha”, it must be written as “sha1”. “SHA” by itself is too ambiguous.

Additionally, the hash is always lowercase hex digits. By using hex, the hash’s length is doubled in size. Base-64 would only increase the size by 50%, but there are too many problems with Base64 and transmission. The special characters / and + get lost in a URL plus the whole = and == string endings that might get in the way of name=value pair divination. There is the alternate Base64URL encoding which changes those / and + characters and removes the = and == endings in certain conditions. Too many variables for too little gain – thus Smoke’s input & output format requires Hex-Digits.

Catching Collisions

Does Combining MD5 plus SHA1 Create A More Awesomely Secure Hash? Nope. There is a wonderful PhD dissertation by Anja Lehmann which goes into great depth as to why the combination of two different hashes is not significantly better. Some less technical explanations are offered here.

Smoke’s combining of the hashes, if anything, provides some bit of future-proofing for when (not if) a hash algorithm is deemed “broken”. Thus, if techniques to produce reasonable collisions in O(1) time for sha1 are created, the Smoked Hash will still contain the other “safe” hashes which are validated simultaneously when sha1 is computed.

Here is an example which shows how Smoke will catch a hash collision in one algorithm. Using the 128-byte md5 collision file created by Xiaoyun Wang and Hongbo Yu, here is a run of md5 and smoke in checksum mode:

As can be seen above, smoke operates just like existing hash software using -c as a checksum function, but smoke will perform the checksum operation using all hashes provided. There are two debugging lines, #INFO and #WARN, which show the validations used; these can be turned off. The software’s normal output is that the digest checksum has failed for the file, exactly as we want.

And even with all of these extra hashes, the software is nearly as fast as using a single hash generator.

Speed and More Speed

When generating hashes serially, hashing is very slow. Smoke tries to optimize the process by performing the slowest part of hashing only once: the disk read. Even with SSD drives, disk I/O is still slower than CPU computation. Smoke also does a second speed up: use multiple CPU cores, one per algorithm.

Here are some metrics for a 24GB file, a size large enough in order to remove the disk and memory caching factors. This test was performed using a five-year-old MacBook Pro with 8GB RAM and a 5400rpm drive. The average times for five runs of OS X and GNU software versus Smoke:

real user sys
md5/osx 61.83 58.53 12.34
sha1/osx 76.03 60.78 10.12
sha512/osx 124.85 111.67 10.86
md5/gnu 66.78 53.16 8.27
sha1/gnu 66.09 51.46 8.45
sha512/gnu 106.69 94.36 8.52
smoke 74.58 133.07 14.25

If you’ve never seen the real/user/sys notation, Real time is time passed in the real world (e.g., look at a wall clock). User time is how long your program runs across all CPU cores. Sys time how much time the OS spends loading files, context switching your threads, etc.

As my implementation of smoke is multi-threaded, the user time is higher than with the other algorithms, but not by very much. Remember that Smoke has generated all three results. Thus, to get a real comparison with the other tools, you need to combine their the sha1/md5/sha512 times together. So:

real user sys
sum of osx 262.70 230.98 33.31
sum of gnu 239.57 198.99 25.25
smoke speedup vs osx 3.52 × 1.73 × 2.33 ×
smoke speedup vs gnu 3.21 × 1.49 × 1.77 ×

From this standpoint, Smoke’s real world wall clock time is 3.2-3.5 times faster overall than the other programs run individually.

The test harness shown below was run via ./time-test.sh >> outname-num.txt 2>&1 five times and the real/user/sys lines extracted into a spreadsheet. The averages of each software+algorithm was published above.

During testing, the verbose version of /usr/bin/time was used, producing extra information such as memory sizes, page faults, file blocking statistics, etc. There were differences in each piece of software, but the one that stands out was the “maximum resident set size” – the RAM used.

alg/impl bytes MB
md5/osx 2,406,634,291 2,295.1 MB
sha1/osx 4,265,574 4.1 MB
sha512/osx 4,294,246 4.1 MB
md5/gnu 861,798 0.8 MB
sha1/gnu 847,872 0.8 MB
sha512/gnu 864,256 0.8 MB
smoke 9,601,024 9.2 MB

The OS X / BSD version of md5 tried to map the whole 24-ish GB file into RAM, perhaps using C’s memmap(); the others did not. Smoke’s memory footprint was around 9 MB. Reducing the memory cache buffer by half to 0.5 MB reduced the memory footprint by 0.5 MB and increased the runtime by 5-6 seconds. There is probably a happy middle ground that could be determined for speed/size trade-off. Smoke does have a command-line option to reduce the memory buffer for low memory situations. However, the Python+OpenSSL overhead does present a limit to the memory savings for this Smoke implementation.

Hashing in Smoke is done using Python’s default hashlib implementation. Each hashing algorithm could be made a little faster if a different underlying crypto library were used. In Timo Bingmann’s report, certain libraries are definitely faster than others with the same algorithm. I couldn’t find more recent speed tests for new versions of OpenSSL (used by Python) versus, say, Apple’s CoreCrypto library. But, the library used really does not matter — especially compared to disk read time and threading. A hardware implementation of the hashing algorithms should best even the fastest of libraries.

Even with threading, the speed of Smoke is still limited by the slowest algorithm. There are diminishing returns on additional parallelization of Smoke.

Algorithms Supported

So far, this paper has mentioned md5, sha1, sha256, and sha512. But, the smoke file format is algorithm agnostic. Thus, any hashing digest is supported, including md4, ripemd160, whirlpool, etc. Since this project was coded in Python, Smoke automatically inherits all digest algorithms that Python, né OpenSSL, supports. On MacOS High Sierra, this is a large set including many algorithms that I’ve only heard of at Ballmer Peak infosec events.

In general, a smoked hash digest should include a minimal set of algorithm: sha1, md5, and sha512 for maximal support. Including additional algorithms does not pose negative impact on downstream systems. If a downstream system performing a hash verification does not support streebog512 as provided by the system creating the smoke digest, then the downstream system will simply consume other digests provided, ergo sha1 & sha512.

Output and Compatibility

Since one of my clients has downstream customers that want a single SUMS file and other customers want a different hash digest file per file on the disk, I added options to generate all of these scenarios at the same time. So, smoke can generate filename.md5 and filename.sha1 along with SUMS.smoke and SUMS.md5. I made output flexible to minimize the hash generation time.

So, for a download structure like Ubuntu uses, Smoke can generate all of the SUMS files in a single optimized run.

Smoke can only validate a checksum using a SUMS.smoke file. I did not implement logic for checking filename.smoke or for the other singular hashes. Perhaps a future endeavor for someone else.

I also did not create binding for other languages. The beauty of smoke is the generic file format, not the Python script which I coded that generates the hashes. Thus, it is relatively simple to create the smoke output format in which ever language desired. Just make the output format hash1=val1;hash2=val2 (tab) filename (newline) and add a tiny bit of optional threading.

Implementation

Smoke is a concept: many hashes combined into a single, simple transmission format. The implementation is a command line utility called smoke and it is a feature-complete hash digest generator and validator. Here is the help execution:

For our sample runs, we use two 4-byte files t123 and t456 that contain “t123” and “t456” respectively; there is no newline at the end of either data file.

First, let’s start with the operational basics, such as how to create a hash digest from stdin:

That looks like any other digest software; the hashes followed by a tab, then by a dash (-). Next, get a digest from two input files:

Same thing, one line per file, with the hashes first, the tab, and then the name of the file. Earlier, Ubuntu’s download folder was mentioned as being easily commutable with the digest files MD5SUMS, SHA1SUMS, and SHA256SUMS. How would that look on the command line?

The option --multiple-sums-digests produces the “SUMS” collection of files. Since the default is to use sha512 and not sha256, the option --use-only-algs turns off the defaults and the --use-algs and -a flags starts building up the hashes you wish to use. As can be seen, you can just separate the algorithms with commas or using multiple command line entries. Finally, the --smoke-file produces the combined SMOKESUMS file, since we are going for the future of hashing here. After running this command line, here is what we get:

If the command line ./smoke -a sha256 --multiple-sums-digests --smoke-file t123 t456 was used instead, then the default algorithms (sha1 md5 sha512) would have been used plus the additionally specified sha256. As such, there would also be a SHA512SUMS file generated.

The other command usage pattern is a single digest per file, such as fileA.md5 / fileB.md5. If someone wants this single digest file for two specific algorithms, this command could be used:

To generate both the individual filename.hashtype and the HASHSUMS files (aka “Kitchen Sink”), you might do something like:

You can mix and match the options to generate the output you require.

Code

All code has been published under the Apache License on GitHub. The code’s quality is, frankly, bad – I’m not a Python expert. The code runs, does what it needs to do, but not in the “Python way”. Someone more talented than I may need to perform a bit of assistance on the code to make it both pretty, readable, and expandable. Feel free to submit bug fixes, pulls, etc. Some future ideas:

  • Add generation of CRC32, et al as those are extremely useful. It would simply be another algorithm.
  • Get ideas for the best “short flags” on the command line. While --multiple-sums-digests is needed, maybe -M is better. Think this out before creating them and getting these command line flags set in stone.
  • The threading in Python is lazy – it could be made faster if threads were reused between files.
  • Standardized API or bindings for languages. There is a Python class, but it is probably very non-Python-like.
  • Add support to validate non-smoke checksum files, like MD5SUMS or filename.sha1.

Conclusions

Smoke aims to make hash generation quicker (lower disk I/O & parallelization), easier (one command line call to rule them all), more flexible (file format with multiple algorithms), and expandable (legacy support & future algorithms). I designed Smoke to be flexible to the needs of my clients and my Python implementation does all that they require and more. Eventually, I’d like to see smoke become part of the standard Unixy tool-set, along side the other existing hashing algorithms in /usr/bin or even /sbin.

Entomology

Why name this Smoke? The hash command on Unix was already taken. Simply put, hashes are smoked. A list of hashes is smoke stack. Transmission of them is done via smoke signal. I’m sure there are more puns to be had.

Author

Jay is a NYC #infosec professional who goes by @veggiespam on Twitter, GitHub, LinkedIn, and other networks while occasionally writing articles on Personal Site.

Feedback on this article is appreciated, contact via Direct Message on social media.