Multi-Resolution Similarity Hashing

August 13, 2007

Full Text

Multi-Resolution Similarity Hashing, DFRWS 2007

Abstract

Large-scale digital forensic investigations present at least two fundamental challenges. The first one is accommodating the computational needs of a large amount of data to be processed. The second one is extracting useful information from the raw data in an automated fashion. Both of these problems could result in long processing times that can seriously hamper an investigation.

In this paper, we discuss a new approach to one of the basic operations that is invariably applied to raw data – hashing. The essential idea is to produce an efficient and scalable hashing scheme that can be used to supplement the traditional cryptographic hashing during the initial pass over the raw evidence. The goal is to retain enough information to allow binary data to be queried for similarity at various levels of granularity without any further pre-processing/indexing.

The specific solution we propose, called a multi-resolution similarity hash (or MRS hash), is a generalization of recent work in the area. Its main advantages are robust performance – raw speed comparable to a high-grade block-level crypto hash, scalability – ability to compare targets that vary in size by orders of magnitude, and space efficiency – typically below 0.5% of the size of the target.