This post is primarily about my experience with the bloomfilter package in Haskell.
Coding it upCoding it up was much of a muchness, aside from a late realisation that there's no Hashable instance for Text which caused a quick change from Text to ByteString, but aside form that, it was very easy.
Even querying the index is easy, I shall post up the code when I am back at home, since I'm with my parents over the Christmas period.
Overall, it's roughly 38 lines, but around 7 lines of that are simple module imports.
TestingTo test it, I grabbed roughly 2.5MB of data from Project Gutenberg, and simply started querying the index. I haven't done any "proper" performance testing so I've just been feeling it out with no timers.
I didn't bother forcing anything, so the first query can take a little while as the index is built, but it's nothing prohibitive. After that, every search feels instantaneous.
After building with -rtsopts, I had it generate a heap profile, which showed that the peak memory allocated with my test set was roughly 8MB, but that fell back down quite rapidly.
ImprovementsI suspect that a carefully used deepseq, and ByteString.Lazy could really improve the memory foot print of the system, and remove the initial "hang" when doing the first search. I shall have to investigate this at a later date.
ExtensionsIf I could just get access to the underlying UArray Int Hash in a way that I could manipulate, I could make searching more probabalistic, but also much faster.
Similarly, I could extend the usage of a bloom filter into a similarity metric, but unfortunately, I'm not entirely sure how to use the underlying representation at this time.
I basically need access to the cardinality of the underlying representation, and the ability to union and intersect the underlying bit arrays. Once I have these (or worked out how to do it on the underlying representation) performing similarity matching becomes very simple.
ConclusionHaskell's bloom filter library seems to be very fast whilst maintaining an easy to use facade.
However this is not flexible enough to really play with bloom filters in all the ways they can be very easily. Perhaps basing the underlying representation on the bitset package, or offering a "serialisable" version which can be easily manipulated would be all it takes.