Shuffling hashing on builtin ints in Python
Background
Python builtin hashing
Python has this really convenient feature where hashing is a builtin function. It works great for many things, including dictionnaries (hashmaps). It also does a great job at making sure you don’t need to take a look at its inner workings or even just the result of the hashing function itself. It works, and that’s good enough.
Hashing for lossy label compression
However, I had a situation where I have labeled regions of space in a volumetric image. A lot of regions, more than would comfortably fit in a 16-bit integer. But I really want to use 16 bits of resolution so as not to blow up the file size - in volumetric images, this adds up quickly and can be the difference between a manageable 4GB file and an unmanageable 8GB one.
So, I figured out that having a 16-bit hash of the “true” label would be a good enough approximation. There is the obvious problem of hash collision, which I deemed insignificant enough for my use-case (the probability of having two regions that touch each other and have the same hash out of 65536 possible values is small enough). Additionally, a hash function applied on the labels would break the structure given by the iteration order of the labelling function, which is a net positive for visualization (two regions close have different labels, so they have different colors).
Why the builtin hash is not good enough
My first idea was to simply use Python’s hash function. I never really used it explicitely before, so I tried and… I didn’t see any difference.
Wait a minute.
1
2
3
4
5
>>> hash(1)
1
>>> hash(70_000)
70000
>>> hash(int(2**32))
Shoot. Let’s see how it holds up.
1
2
for pow in range(64):
print(f"For power {pow}: {hash(int(2**pow - 1)) == int(2**pow - 1)=}")
This snippet shows me that it is true until pow=60 inclusive. In fact, 2**61 - 1 is the first value where this pattern breaks down.
I wanted to make sure why this happens: the hash is based on value modulo 2**p - 1 with p being 31 or 61 depending on the system being 32 or 64 bits respectively. It’s no coincidence that these are Mersenne prime exponents. See the CPython source code for computing hashes and for defining the relevant constants. It’s an implementation detail and can change for other Python implementations and/or versions.
Anyways, what matters for me here is that for small integer values of x, hash(x)==x, which is not what I want.
Solution
It turns out that compound objects have their hash computed based on their address in memory. So, if we can encapsulate our label in a lightweight wrapper that changes the hashing strategy, we can get the properties we want. In my case, I simply used a 1-tuple.
As an added bonus, we clip the value to only take the least significant bits to ensure our hash fits in 16-bits (or however many you need).
A full implementation is given below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from typing import Callable
def hash_factory(n) -> Callable[..., int]:
"""
Creates a hash function that produces non-negative
integer hashes of a specified bit length.
Parameters
----------
n : int
The number of bits for the hash output.
The resulting hash will be in the range [0, 2^n - 2].
Returns
-------
function
A function that takes an input (int or tuple)
and returns a non-negative integer hash of n bits.
Notes
-----
- If the input is an int, it is converted to a tuple before hashing.
This is to ensure that hash(n) != n.
- The hash value is always non-negative and fits within n bits.
"""
def _hash(input) -> int:
if isinstance(input, int):
x = (input,)
else:
x = input
h = hash(x)
# turn as non-negative and clip to n bits
return abs(h) % ((1<<n)-1)
_hash.__doc__ = f"""\
Computes a non-negative integer hash of the input, constrained to {n} bits.
Parameters
----------
input : int or tuple
The value to hash. If an int is provided, it is converted
to a tuple to avoid collisions with tuple inputs.
Returns
-------
int
A non-negative integer hash value in the range [0, 2^{n} - 1].
"""
return _hash