Bloom Filter

What's bloom filter

Bloom Filter is a data structure design to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The price paid for this efficiency is that bloom filter is a probabilistic data structure. It tells us that the element:

  • definitely is not in the set
  • maybe in the set

适用范围:

  • 数据字典判重
  • 集合求交集
  • ...

concept

use bit array.

  • when enter an element, we hash it k times and set the bits in the bit array to 1 according to hash results
  • when test if an element exist, hash it using the same hash functions, check the bits

hash function

使用k个相互对立的hash function来将bit array中的bit set为1

false positive

如何估算false positive rate我就跳过了……lol

bloom filter不适合那么“0 tolerance"的场合,应用在容错率的场合下, 可以通过少量错误换取大量存储空间。

如何根据输入元素个数n, 确定Bitarray的大小m,hash function个数K

  • k = (ln2) * (m/n)是错误率最小
  • m >= n * lg(1/E) * 1.44 E为error rate

将设error rate为0.01, 那么 m 约为 13*n,k大概是9个

counting bloom filter

原本的Bloom filter不支持delete, 因为会影响其他元素,将Bit换为counter, 就可以支持delete了.

Reference