当前位置：首页 > wzjs >正文

上海智能网站建设设计软文广告经典案例200字

wzjs 2025/7/28 22:03:29

上海智能网站建设设计,软文广告经典案例200字,教育网站如何做经营,公司网站做的很烂1.基本概念布隆过滤器（Bloom Filter）是1970年由 Burton Horward Bloom 提出的一种非常节省空间的概率型数据结构，运行速度快，占用内存小，但有一定误判率且无法删除元素。它实际上是一个很长的位数组（bitm…

1.基本概念

布隆过滤器（Bloom Filter）是1970年由 Burton Horward Bloom 提出的一种非常节省空间的概率型数据结构，运行速度快，占用内存小，但有一定误判率且无法删除元素。它实际上是一个很长的位数组（bitmap）和一系列随机哈希函数组成，主要用于判断一个元素是否在一个集合中。

适用于需要高效判断大量元素是否存在、以及允许一定false positive rate(假阳率)存在的业务场景：

解决Redis缓存穿透问题；
黑名单过滤（垃圾邮件地址、手机号、IP地址、域名等等）；
解决推荐过的数据不再推荐（如新闻、视频推荐等）；
部分数据库内置布隆过滤器以判断数据是否存在，从而减少数据库的IO请求，比如HBase。

2.基本原理

1.数据结构

布隆过滤器是由一个固定长度m的位数组和k个哈希函数组成的数据结构，其空间复杂度为O(m)。

位数组

- 初始时，数组的每位均为0；
- 存储元素后，用k个值为1的位点（不是唯一）标识某个元素是否存在；
- 位数组是布隆过滤器节省内存的核心所在。申请一个100w个元素的位数组只占用1000000bit / 8 = 125000Byte = 125000/1024 KB ≈ 122KB的空间。

哈希函数：用于将输入元素映射位数组的n个点位，以后续判断该元素是否不存在/可能存在。

实际运用中，我们可以通过指定预期插入元素个数（expectedInsertions）和误判率（fpp）来初始化一个布隆过滤器，例如Guava中创建一个布隆过滤器

/*** Creates a {@link BloomFilter} with the expected number of insertions and expected false* positive probability.** <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,* will result in its saturation, and a sharp deterioration of its false positive probability.** <p>The constructed {@code BloomFilter} will be serializable if the provided {@code Funnel<T>}* is.** <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of* ensuring proper serialization and deserialization, which is important since {@link #equals}* also relies on object identity of funnels.** @param funnel the funnel of T's that the constructed {@code BloomFilter} will use* @param expectedInsertions the number of expected insertions to the constructed {@code*     BloomFilter}; must be positive* @param fpp the desired false positive probability (must be positive and less than 1.0)* @return a {@code BloomFilter}*/
public static <T extends @Nullable Object> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions, double fpp) {return create(funnel, (long) expectedInsertions, fpp);
}

Google Guava BloomFilter的位数组又称位向量，是自定义了一个BitArray类来实现的，其本质上是一个Long类型数组；

@VisibleForTesting
static <T extends @Nullable Object> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {checkNotNull(funnel);checkArgument(expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);checkNotNull(strategy);if (expectedInsertions == 0) {expectedInsertions = 1;}/** TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size* is proportional to -log(p), but there is not much of a point after all, e.g.* optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!*/long numBits = optimalNumOfBits(expectedInsertions, fpp);int numHashFunctions = optimalNumOfHashFunctions(fpp);try {// 核心生成逻辑return new BloomFilter<>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);} catch (IllegalArgumentException e) {throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);}
}LockFreeBitArray(long bits) {checkArgument(bits > 0, "data length is zero!");// Avoid delegating to this(long[]), since AtomicLongArray(long[]) will clone its input and// thus double memory usage.//对于长度为m的位向量来说，对应的long数组的长度应为m/64向上取整。this.data =new AtomicLongArray(Ints.checkedCast(LongMath.divide(bits, 64, RoundingMode.CEILING)));this.bitCount = LongAddables.create();
}

2. 插入元素
布隆过滤器插入元素的时间复杂度为o(k)。当一个元素e1插入布隆过滤器时，会进行如下操作：

基于k个哈希函数，计算出k个元素对应的哈希值：h1,h2....hk，其中hi=hashi(e1)；
根据得到的哈希值，在位数组中把对应下标的值置为1，即bitmap[hi]=1;

接着插入一个元素e2，可能会有部分插入位点产生冲突：

Google Guava 插入元素相关源码
值得关注的点是，Google Guava BloomFilter中k个哈希函数的生成遵循如下公式：

上述公式中gi(x)为第i（0 < i < k）个哈希函数，初始时需要定义已知的哈希函数h1(x)和h2(x)，求得k个哈希函数的计算过程如下：

//准备阶段 h1 = hash1(input), h2 = hash2(input) 
// 求出k个hash值 g0(x) = h1 
第0个hash函数求出的hash值 g1(x) = h1+h2+1 
第1个hash函数求出的hash值 g2(x) = h1+2*h2+4 
第2个hash函数求出的hash值 ... gk-1(x) = h1+(k-1)*h2+(k-1)^2 
第k-1个hash函数求出的hash值

遵循上述计算过程，Google Guava BloomFilter生成k个哈希函数的步骤如下。
1.首先根据murmur3_128这个哈希计算输入元素的64位哈希值，将其分为两段：后32位为hash1、前32位为hash2；
2.基于公式计算combinedHash值，此处省去了加式中的第三项；
3.取combinedHash对位数组长度的mod，以获得真正插入的index。

public <T> boolean put(T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {long bitSize = bits.bitSize();long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();int hash1 = (int) hash64;int hash2 = (int) (hash64 >>> 32);boolean bitsChanged = false;for (int i = 1; i <= numHashFunctions; i++) {int combinedHash = hash1 + (i * hash2);// Flip all the bits if it's negative (guaranteed positive number)if (combinedHash < 0) {combinedHash = ~combinedHash;}bitsChanged |= bits.set(combinedHash % bitSize);}return bitsChanged;}// 位数组bits将bitIndex位设置为1
boolean set(long bitIndex) {if (get(bitIndex)) {return false;}//取出bitIndex在位数组中的实际索引值int longIndex = (int) (bitIndex >>> 6);// 与原位数组取并集获得最终结果long mask = 1L << bitIndex;long oldValue;long newValue;do {// oldValue 0011...1000...1010// mask     0000...0100...0000//          ------------------// newValue 0011...1100...1010    oldValue = data.get(longIndex);newValue = oldValue | mask;if (oldValue == newValue) {return false;}// 多线程下CAS操作} while (!data.compareAndSet(longIndex, oldValue, newValue));// We turned the bit on, so increment bitCount.bitCount.increment();return true;}

3 查找元素
布隆过滤器查找元素的时间复杂度为O(k)O(k)。查找给定元素eiei是否存在于布隆过滤器中时，会进行如下操作：
1.基于k个哈希函数，计算出k个元素对应的哈希值：h1,h2....hk，其中 hi=hashi(ei)；
2.判断位数组对应位置的值是否为均等于1，即bitmap[hi]=1。
查找结果存在以下两种情况：

存在某一位点的值不为1，此时元素ei一定不存在于布隆过滤器中。
计算出所有位点的值均为1，此时元素ei可能存在于布隆过滤器中。

可能多个元素的点位组合后将ei的点位全部占据导致误判。存储复用、哈希冲突都会导致误判。所以随着元素的增多其误判率应该会不断升高，直到趋近100%。若恰好存在ei计算出的所有哈希值和e1相等的情况，则ei就会被误判为存在于布隆过滤器中，而降低布隆过滤器误判率的方法无非为两种：

增大位数组的长度；
增加哈希函数个数或采用冲突更小的哈希函数。

Google Guava 查找元素相关源码：

public <T> boolean mightContain(T object, Funnel<? super T> funnel, int numHashFunctions,LockFreeBitArray bits) {long bitSize = bits.bitSize();byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();long hash1 = lowerEight(bytes);long hash2 = upperEight(bytes);long combinedHash = hash1;for (int i = 0; i < numHashFunctions; i++) {// Make the combined hash positive and indexableif (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {return false;}combinedHash += hash2;}return true;
}// 位数组bits判断bitIndex位的值是否为1
// 1. bitIndex >>> 6：取出bitIndex处对应的所有64位元素；
// 2. & 1L << bitIndex：获取bitIndex处对应的值
// 0011...1100...1010
// 0000...0100...0000
//         1
boolean get(long bitIndex) {return (data.get((int) (bitIndex >>> 6)) & (1L << bitIndex)) != 0;
}

4.删除元素
普通的布隆过滤器无法进行元素删除。
道理很简单，当位数组的部分位点被不同元素所复用时，若删除其中一个元素，其所有映射位点均被置位0（包括被复用的位点），这会导致复用位点的元素在查找时也会被判断为不存在。
例如，若我们删除元素e1，位点0、4、8均被置为0，再查找元素e2时，由于位点4值为0，e2会被判断为不存在，尽管我们并没有删除它。

5.布隆过滤器各项参数间的关系
布隆过滤器存在如下参数：

存储元素个数 n
误判率 p
位数组长度 m
哈希函数个数 k

在实际使用中，一般提前设定预期的n和p，来确定最佳的位数组长度m和哈希函数个数k，m和k的计算公式如下：

1.m=−(nlnp/(ln2)2)
2.k=(m/n)ln2

可以看到Guava BloomFilter中m和k的取值也是根据上述公式计算的：

// 计算最佳位数组长度m
static long optimalNumOfBits(long n, double p) {if (p == 0) {p = Double.MIN_VALUE;}return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}// 计算最佳哈希函数个数k
static int optimalNumOfHashFunctions(long n, long m) {// (m / n) * log(2), but avoid truncation due to division!return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}

假设布隆过滤器需要存储1亿个元素，且要求误判率为亿分之一，可以计算出所需要的位数组长度和哈希函数个数，可以看到存储如此大量的元素、且误判率很低的情况下，布隆过滤器也只占用了457MB内存：

计算布隆过滤器相关参数的在线网站:

https://hur.st/bloomfilter/?n=100000000&p=1.0E-7&m=&k=

3.如何删除元素

普通的布隆过滤器由于位数组的点位仅能用0/1来表示，因此删除会导致部分复用点位的信息被清空，从而使该点位的其他存储元素收到影响。

因此，布隆过滤器要实现删除元素的功能，需要将位数组的每个点位都改造成一个计数器（counter），从而诞生了可以删除元素的计数布隆过滤器（Counting Bloom Filter，CBF）。

1.基本原理

CBF将基本布隆过滤器位数组的每一位改造成一个计数器（Counter），每个计数器本身相当于一个位数组，来表示该点位被占用次数，一般来说计数器取4位就够用了。

插入元素：CBF插入元素时，通过哈希函数映射到每个位点的计数器均加1。

查找元素：CBF查找元素时，判断对应点位计数器的取值：

若所有点位计数器的值均大于0，则元素可能存在。
若存在计数器的值等于0，则元素一定不存在。

删除元素：CBF删除元素时，通过哈希函数映射到每个位点的计数器均减去1。

4.布隆过滤器使用

1.Google Guava BloomFilter

Google Guava包提供的布隆过滤器适用于简单过滤场景下的单实例应用。

引入pom依赖:

<dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>33.4.0-jre</version> <!-- 使用最新稳定版本 -->
</dependency>

package com.ds.data.bloomfilter;import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;import java.nio.charset.StandardCharsets;/*** @author: xxx* @date:2025/3/8 下午4:46* @desc:*/
public class HBaseBloom {public static void main(String[] args) {// 创建布隆过滤器（预期插入1千万，误判率0.000001%）BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.stringFunnel(StandardCharsets.UTF_8),10000000,0.00000001);// 模拟写入HBase数据for(int i=0; i<10000000; i++){String rowKey = "user_"+i;bloomFilter.put(rowKey);}// 查询验证System.out.println(bloomFilter.mightContain("user_999"));  // trueSystem.out.println(bloomFilter.mightContain("user_1000001")); // falseint count = 0;for (int i = 0; i < 20000000; i++) {// mightContain方法判断数据是否存在if (bloomFilter.mightContain("user_"+ i)) {//误判count++;}}System.out.println("总误判数:" + count);}
}

2.TairBloom

TairBloom是Tair自带的BloomFilter，适用于多实例下的分布式应用。其作为一种可扩展布隆过滤器（Scalable Bloom Filter，SBF）的实现，具有动态扩容的能力，同时保证false positive rate不变。

引入pom依赖:

<dependency><groupId>com.aliyun.tair</groupId><artifactId>tairjedis-mc-sdk</artifactId><version>0.0.1-SNAPSHOT</version>
</dependency>

@Service
@Slf4j
public class TairBloomFilterServiceImpl implements TairBloomFilterService {@Autowiredprivate JedisCluster jedisCluster;private TairBloomCluster bloomCluster;@PostConstructpublic void init() {bloomCluster = new TairBloomCluster(jedisCluster);}/*** 初始化一个 TairBloom**/@Overridepublic Boolean bfCreate(BloomParameter bloomParameter) {log.info("bfCreate.bloomParameter:{}", JSON.toJSONString(bloomParameter));try {String bloomKey = bloomParameter.getBloomKey();// 如果指定的 Key 已经创建过，则不能重复创建Boolean exists = jedisCluster.exists(bloomKey);if (exists) {log.info("bfCreate.exists, key:{}, exists:{}", bloomKey, exists);return false;}// 创建一个指定容量和容错率的TairBloomString result = bloomCluster.bfreserve(bloomKey, bloomParameter.getInitCapacity(), bloomParameter.getErrorRate());if (!SUCCESS.equals(result)) {log.info("bfCreate.fail, key:{}, result:{}", bloomKey, result);return false;}log.info("bfCreate.success, key:{}, result:{}", bloomKey, result);return true;} catch (Exception e) {log.error("bfCreate.error, bloomParameter:{}", JSON.toJSONString(bloomParameter), e);}return false;}/*** 删除指定key的 TairBloom**/@Overridepublic Boolean keyDelete(String key) {log.info("keyDelete.key:{}", key);try {if (null == key || key.isEmpty()) {return false;}Boolean exists = jedisCluster.exists(key);if (!exists) {log.info("keyDelete, key:{}, exists:{}", key, exists);return false;}Long result = jedisCluster.del(key);log.info("keyDelete.result:{}, key:{}", result, key);if(result > 0) {return true;}} catch (Exception e) {log.error("keyExists.error, key:{}", key, e);}return false;}/*** 判断指定key的 TairBloom 是否存在**/@Overridepublic Boolean keyExists(String key) {log.info("keyExists.key:{}", key);try {return jedisCluster.exists(key);} catch (Exception e) {log.error("keyExists.error, key:{}", key, e);}return false;}/*** TairBloom插入元素**/@Overridepublic Boolean bfAdd(String key, Long item) {log.info("bfAdd.key:{}, item:{}", key, item);try {if (null == key || key.isEmpty()) {return false;}Boolean exists = jedisCluster.exists(key);if (!exists) {return false;}Boolean result = bloomCluster.bfadd(key, item.toString());log.info("bfAdd.key:{}, item:{}, result:{}, ", key, item, result);if (!result) {return false;}return true;} catch (Exception e) {log.error("bfAdd.error, key:{}, item:{}", key, item, e);}return false;}/*** TairBloom查找元素**/@Overridepublic Boolean bfExists(String key, Long item) {log.info("bfAdd.key:{}, item:{}", key, item);try {if (null == key || key.isEmpty()) {return false;}Boolean exists = jedisCluster.exists(key);if (!exists) {return false;}Boolean result = bloomCluster.bfexists(key, item.toString());log.info("bfexists.key:{}, item:{}, result:{}, ", key, item, result);if (!result) {return false;}return true;} catch (Exception e) {log.error("bfexists.error, key:{}, item:{}", key, item, e);}return false;}
}

5.布隆过滤器在Hbase中的应用

Bloom过滤器在HBase中应用，比如HBase的读操作中，每个HFile都有一个Bloom过滤器，用来在读取时快速判断某个行键是否可能存在于该HFile中，避免不必要的磁盘IO。

在HBase的存储架构中，Bloom Filter主要优化两种场景：

1. StoreFile级别过滤

通过以下配置优化读取性能：

<!-- HBase表配置示例 -->
<HColumnDescriptor><NAME>user_data</NAME><BLOOMFILTER>ROW</BLOOMFILTER> <!-- ROW/ROWCOL模式 --><COMPRESSION>SNAPPY</COMPRESSION>
</HColumnDescriptor>

ROW模式：适用点查为主的场景，针对行键（RowKey）建立过滤器
ROWCOL模式：适用列级查询，针对行键+列限定符建立过滤器

特别提醒：全表扫描场景禁用Bloom Filter工作流程：

客户端发起Get请求
RegionServer先检查MemStore
若MemStore未命中，检查Bloom Filter：

- 过滤器返回"不存在" → 直接跳过该StoreFile
- 过滤器返回"可能存在" → 继续扫描HFile

2. 布隆过滤器索引

在HBase的存储架构中，布隆过滤器作为二级索引存在，核心在于通过概率判断快速过滤无关数据块。

HFile内部结构：

HFile
├── Data Blocks (实际数据存储)
├── Meta Blocks (元数据)
├── Bloom Block (布隆过滤器数据)
└── Trailer (索引指针)→ 通过布隆过滤器快速定位数据块

工作流程：

客户端发起Get(rowkey)请求
RegionServer首先检查MemStore
若MemStore未命中，遍历HFile时：

- 读取HFile的Trailer获取Bloom Block位置
- 加载Bloom Filter到内存（LRU缓存）
- 通过哈希计算判断rowkey是否可能存在于该HFile
- 若Bloom返回false，跳过该HFile的扫描

// 创建表时指定布隆过滤器类型
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("user_profile"));
HColumnDescriptor cfDesc = new HColumnDescriptor("cf");
// 设置布隆过滤器为ROWCOL类型
cfDesc.setBloomFilterType(BloomType.ROWCOL);
tableDesc.addFamily(cfDesc);
admin.createTable(tableDesc);// 底层实现关键代码, HBase的Bloom Filter写入逻辑核心片段：HFile.Writer内部实现
public void append(Cell cell) {if (bloomFilter != null) {byte[] row = CellUtil.cloneRow(cell);bloomFilter.add(row);  // ROW模式只添加行键if (bloomType == BloomType.ROWCOL) {byte[] qualifier = CellUtil.cloneQualifier(cell);Bytes.putBytes(row, 0, qualifier, 0, qualifier.length);bloomFilter.add(row);}}}
// 写入实际数据
public class HBaseWriter {public static void main(String[] args) throws IOException {Configuration conf = HBaseConfiguration.create();try (Connection conn = ConnectionFactory.createConnection(conf);Table table = conn.getTable(TableName.valueOf("user_profile"))) {Put put = new Put(Bytes.toBytes("user_1001"));put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("tags"),Bytes.toBytes("sports,music"));// 写入时自动更新布隆过滤器table.put(put);}}
}

写入过程数据流：