Haskell: Int and Integer
In my last post about the Rabin Karp algorithm I mentioned that I was having some problems when trying to write a hash function which closely matched its English description.
rm-1 * ascii char) + (rm-2 * ascii char) + … (r0 * ascii char % q where r = 256, q = 1920475943
This is my current version of the hash function:
hash = hash' globalR globalQ
hash' r q string m = foldl (\acc x -> (r * acc + ord x) `mod` q) 0 $ take m string
And my initial attempt to write the alternate version was this:
hash2 :: [Char] -> Int -> Int
hash2 = hash2' globalR globalQ
hash2' r q string m = (flip mod q . sum . map (\(pow, char) -> ord char * (r ^ pow)) . zip [m-1, m-2..0]) string
Unfortunately this version breaks down as we try to create hashes for bigger lists of characters:
> hash "markusaere" 10
849698536
> hash2 "markusaere" 10
1239828806
We can see what’s going on if we execute the core bit of the hash function in the REPL:
> map (\(pow, char) -> ord char * 256 ^ pow) $ zip [9, 8..0] "markusaere"
[0,0,8214565720323784704,30117822508040192,128642860449792,493921239040,1627389952,6619136,29184,101]
In this case the hash value for the first character is:
> 256 ^ 9 * ord 'm'
0
Be default if we do calculations on integers the 'Int' type is used:
"Int" is the more common 32 or 64 bit integer. Implementations vary, although it is guaranteed to be at least 30 bits.
The maximum value a 64 bit Int can hold is 263, a value that we’re exceeding with the first two calculations in our list.
Since our calculation has exceeded the maximum value of a 64 bit integer we need to coerce our calculation to use the 'Integer' type which has arbitrary precision up to the limit of the machine’s memory.
We can use the 'toInteger' function to do this:
> toInteger (256 ^ 9) * toInteger (ord 'm')
514737946632791328292864
We can change the core of the function to use 'toInteger' like so:
> map (\(pow, char) -> toInteger (ord char) * 256 ^ pow) $ zip [9, 8..0] "markusaere"
[514737946632791328292864,1789334175149826506752,8214565720323784704,30117822508040192,128642860449792,493921239040,1627389952,6619136,29184,101]
We can then change the hash function to read like this:
hash2 :: [Char] -> Int -> Int
hash2 value m = fromInteger $ hash2' (toInteger globalR) (toInteger globalQ) value m
hash2' :: Integer -> Integer -> [Char] -> Int -> Integer
hash2' r q string m = (sum $ map (\(pow, char) -> toInteger (ord char) * toInteger (r ^ pow)) $ zip [m-1, m-2..0] string) `mod` q
When we call the function with the original parameters it now returns the correct answer:
> hash2 "markusaere" 10
849698536
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.