Updated 16/9:php_ssdeep is now in PECL so I have updated this post to reflect that.
On a recent project I needed a fast way to compare documents for likeness and return a percentage match. With much research and one unanswered Stackoverflow post later I came across Jesse Kornblum’s ssdeep utility intended for computer forensics such as looking for signatures in files when hunting rootkits etc. All the technical details of fuzzy hashing are described in his 2006 journal article Identifying almost identical files using context triggered piecewise hashing.
I was using a wrapper around the ssdeep binary written in pure PHP, but I recently decided to write my first PHP extension by tying into the fuzzy hashing API ssdeep provides. The result is now a PECL extension or module and the BSD licensed source code is hosted on PHP.net’s SVN.
There are full instructions provided in the PHP manual and the php_ssdeep site to install the exension and use the provided functions. If you end up using this extension or its code I would be very interested to find out more about your project.