Bloom filter is a probabilistic data structure designed to tell you if a member is in a set in a highly memory and time efficient manner. In that sense, it is similar to a set, but it does it with an adjustable false positive rate.
A brilliant example is provided in this blog post.
Bloom filter is especially fit for large set, where it is hard to fit the whole set into main memory as it is, and when it is okay to allow certain rate of false positives.
One typical example is account name selection while creating your gmail account. You might wonder how does google know so fast that if an email account is taken in a blink? Think about it, Gmail has around 1.5 billions users, it is impossible to search through all of them. Nor it is ideal to store all those existing email addresses in main memory.
Instead, we can use a bit array with some help from a few hash functions to encode existing email addresses. When a user types a desired new address, we just need to check if the hash outputs (indices in the bit array) are all 1’s. If they answer is yes, there is high probability this address is already taken, otherwise, this address is definitely not taken.
The false positive rate is approximated by:
where m is number of bits in the bit array, n is the number of elements in the existing set, k is the number of hash functions.
To minimize the false positive rate given fixed m and n, we can choose the number of hash functions following:
- For the calculation of False Positive rate: see wikipedia: https://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives
- For an interactive example: https://llimllib.github.io/bloomfilter-tutorial
- For another easy read on Medium: https://medium.com/techspace-usict/bloom-filter-a-probabilistic-search-approach-in-data-structures-cb3dd48c6632
- A python C extension: http://axiak.github.io/pybloomfiltermmap/ref.html