【场景题】List集合去重
List集合去重
- 1. 题目
- 2. 测试场景设定
- 3. 示例代码
- 4. 性能对比(实际测试结果)
- 5. 结论与推荐
- 6. 额外性能优化建议
- 7. 复杂对象时,千万级数据不要用stream流
1. 题目
List集合去重都有哪些方法,性能、可读性、空间开销上分析一下。
2. 测试场景设定
假设我们有一个 List<Integer>,长度为 100 万,其中包含约 10% 的重复数据。
我们要测试以下几种去重方案:
| 序号 | 实现方式 | 保持顺序 | 主要特点 |
|---|---|---|---|
| ① | 使用 HashSet | ❌ 否 | 最快,不保证顺序 |
| ② | 使用 LinkedHashSet | ✅ 是 | 快且保持顺序 |
| ③ | 使用 TreeSet | ✅ 按自然排序 | 有序,但最慢 |
| ④ | 使用 Stream + distinct() | ✅ 是 | 简洁、现代写法,速度不输HashSet |
| ⑤ | 双层循环(传统写法) | ✅ 是 | 时间复杂度高 |
| ⑥ | 利用 Map 键去重 | ✅ 是 | 灵活度高,性能接近 LinkedHashSet |
3. 示例代码
package com.beijing.controller;import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;public class DistinctListTest {private static final int DATA_SIZE = 1_000_0000; // 数据规模private static final int RUNS = 5; // 每种方法重复次数public static void main(String[] args) {// 构造随机测试数据List<Integer> list = new ArrayList<>(DATA_SIZE);Random random = new Random();for (int i = 0; i < DATA_SIZE; i++) {list.add(random.nextInt(900_000)); // 含部分重复}System.out.println("========================================");System.out.println("测试开始,样本数量:" + list.size());System.out.println("每种算法重复运行 " + RUNS + " 次,单位:ms");System.out.println("========================================\n");testMethod("① HashSet 去重", list, l -> new ArrayList<>(new HashSet<>(l)));testMethod("② LinkedHashSet 去重", list, l -> new ArrayList<>(new LinkedHashSet<>(l)));testMethod("③ TreeSet 去重", list, l -> new ArrayList<>(new TreeSet<>(l)));testMethod("④ Stream distinct 去重", list, l -> l.stream().distinct().collect(Collectors.toList()));testMethod("⑤ Map 去重", list, l -> {Map<Integer, Boolean> map = new LinkedHashMap<>();for (Integer i : l) map.put(i, true);return new ArrayList<>(map.keySet());});System.out.println("\n========================================");System.out.println("测试完成 ✅");System.out.println("========================================");}/*** 通用测试函数:多次执行求平均耗时*/private static void testMethod(String name, List<Integer> list, Function<List<Integer>, List<Integer>> func) {long total = 0;int resultSize = 0;System.out.printf("%s:\n", name);for (int i = 1; i <= RUNS; i++) {long start = System.nanoTime();List<Integer> result = func.apply(list);long cost = (System.nanoTime() - start) / 1_000_000; // 转为 mstotal += cost;resultSize = result.size();System.out.printf(" 第 %d 次耗时:%4d ms\n", i, cost);}double avg = total * 1.0 / RUNS;System.out.printf("👉 平均耗时:%.2f ms,结果大小:%d\n\n", avg, resultSize);}
}
/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/bin/java -Dvisualvm.id=431012066641166 -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=62078 -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath /Users/fanzhen/Documents/SpringAi/moreLLM/target/classes:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot-starter-web/3.3.8/spring-boot-starter-web-3.3.8.jar:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot-starter/3.3.8/spring-boot-starter-3.3.8.jar:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot/3.3.8/spring-boot-3.3.8.jar:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot-autoconfigure/3.3.8/spring-boot-autoconfigure-3.3.8.jar:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot-starter-logging/3.3.8/spring-boot-starter-logging-3.3.8.jar:/Users/fanzhen/Documents/Repository/ch/qos/logback/logback-classic/1.5.16/logback-classic-1.5.16.jar:/Users/fanzhen/Documents/Repository/ch/qos/logback/logback-core/1.5.16/logback-core-1.5.16.jar:/Users/fanzhen/Documents/Repository/org/apache/logging/log4j/log4j-to-slf4j/2.23.1/log4j-to-slf4j-2.23.1.jar:/Users/fanzhen/Documents/Repository/org/apache/logging/log4j/log4j-api/2.23.1/log4j-api-2.23.1.jar:/Users/fanzhen/Documents/Repository/org/slf4j/jul-to-slf4j/2.0.16/jul-to-slf4j-2.0.16.jar:/Users/fanzhen/Documents/Repository/jakarta/annotation/jakarta.annotation-api/2.1.1/jakarta.annotation-api-2.1.1.jar:/Users/fanzhen/Documents/Repository/org/yaml/snakeyaml/2.2/snakeyaml-2.2.jar:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot-starter-json/3.3.8/spring-boot-starter-json-3.3.8.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/core/jackson-databind/2.17.3/jackson-databind-2.17.3.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/core/jackson-annotations/2.17.3/jackson-annotations-2.17.3.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/datatype/jackson-datatype-jdk8/2.17.3/jackson-datatype-jdk8-2.17.3.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/datatype/jackson-datatype-jsr310/2.17.3/jackson-datatype-jsr310-2.17.3.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/module/jackson-module-parameter-names/2.17.3/jackson-module-parameter-names-2.17.3.jar:/Users/fanzhen/Documents/Repository/org/springframework/boot/spring-boot-starter-tomcat/3.3.8/spring-boot-starter-tomcat-3.3.8.jar:/Users/fanzhen/Documents/Repository/org/apache/tomcat/embed/tomcat-embed-core/10.1.34/tomcat-embed-core-10.1.34.jar:/Users/fanzhen/Documents/Repository/org/apache/tomcat/embed/tomcat-embed-el/10.1.34/tomcat-embed-el-10.1.34.jar:/Users/fanzhen/Documents/Repository/org/apache/tomcat/embed/tomcat-embed-websocket/10.1.34/tomcat-embed-websocket-10.1.34.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-web/6.1.16/spring-web-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-beans/6.1.16/spring-beans-6.1.16.jar:/Users/fanzhen/Documents/Repository/io/micrometer/micrometer-observation/1.13.10/micrometer-observation-1.13.10.jar:/Users/fanzhen/Documents/Repository/io/micrometer/micrometer-commons/1.13.10/micrometer-commons-1.13.10.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-webmvc/6.1.16/spring-webmvc-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-aop/6.1.16/spring-aop-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-context/6.1.16/spring-context-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-expression/6.1.16/spring-expression-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-starter-model-ollama/1.0.1/spring-ai-starter-model-ollama-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-model-ollama/1.0.1/spring-ai-autoconfigure-model-ollama-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-retry/1.0.1/spring-ai-autoconfigure-retry-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-model-tool/1.0.1/spring-ai-autoconfigure-model-tool-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-model-chat-observation/1.0.1/spring-ai-autoconfigure-model-chat-observation-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-model-embedding-observation/1.0.1/spring-ai-autoconfigure-model-embedding-observation-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-ollama/1.0.1/spring-ai-ollama-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-model-chat-client/1.0.1/spring-ai-autoconfigure-model-chat-client-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-client-chat/1.0.1/spring-ai-client-chat-1.0.1.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/module/jackson-module-jsonSchema/2.17.3/jackson-module-jsonSchema-2.17.3.jar:/Users/fanzhen/Documents/Repository/javax/validation/validation-api/1.1.0.Final/validation-api-1.1.0.Final.jar:/Users/fanzhen/Documents/Repository/com/knuddels/jtokkit/1.1.0/jtokkit-1.1.0.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-autoconfigure-model-chat-memory/1.0.1/spring-ai-autoconfigure-model-chat-memory-1.0.1.jar:/Users/fanzhen/Documents/Repository/com/taobao/arthas/arthas-spring-boot-starter/4.0.5/arthas-spring-boot-starter-4.0.5.jar:/Users/fanzhen/Documents/Repository/com/taobao/arthas/arthas-agent-attach/4.0.5/arthas-agent-attach-4.0.5.jar:/Users/fanzhen/Documents/Repository/net/bytebuddy/byte-buddy-agent/1.14.19/byte-buddy-agent-1.14.19.jar:/Users/fanzhen/Documents/Repository/org/zeroturnaround/zt-zip/1.16/zt-zip-1.16.jar:/Users/fanzhen/Documents/Repository/com/taobao/arthas/arthas-packaging/4.0.5/arthas-packaging-4.0.5.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-openai/1.0.1/spring-ai-openai-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-model/1.0.1/spring-ai-model-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-commons/1.0.1/spring-ai-commons-1.0.1.jar:/Users/fanzhen/Documents/Repository/io/micrometer/micrometer-core/1.13.10/micrometer-core-1.13.10.jar:/Users/fanzhen/Documents/Repository/org/hdrhistogram/HdrHistogram/2.2.2/HdrHistogram-2.2.2.jar:/Users/fanzhen/Documents/Repository/org/latencyutils/LatencyUtils/2.0.3/LatencyUtils-2.0.3.jar:/Users/fanzhen/Documents/Repository/io/micrometer/context-propagation/1.1.2/context-propagation-1.1.2.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-template-st/1.0.1/spring-ai-template-st-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/antlr/ST4/4.3.4/ST4-4.3.4.jar:/Users/fanzhen/Documents/Repository/org/antlr/antlr-runtime/3.5.3/antlr-runtime-3.5.3.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-messaging/6.1.16/spring-messaging-6.1.16.jar:/Users/fanzhen/Documents/Repository/io/projectreactor/reactor-core/3.6.13/reactor-core-3.6.13.jar:/Users/fanzhen/Documents/Repository/org/reactivestreams/reactive-streams/1.0.4/reactive-streams-1.0.4.jar:/Users/fanzhen/Documents/Repository/org/antlr/antlr4-runtime/4.13.1/antlr4-runtime-4.13.1.jar:/Users/fanzhen/Documents/Repository/com/github/victools/jsonschema-module-swagger-2/4.37.0/jsonschema-module-swagger-2-4.37.0.jar:/Users/fanzhen/Documents/Repository/io/swagger/core/v3/swagger-annotations-jakarta/2.2.25/swagger-annotations-jakarta-2.2.25.jar:/Users/fanzhen/Documents/Repository/org/springframework/ai/spring-ai-retry/1.0.1/spring-ai-retry-1.0.1.jar:/Users/fanzhen/Documents/Repository/org/springframework/retry/spring-retry/2.0.11/spring-retry-2.0.11.jar:/Users/fanzhen/Documents/Repository/com/github/victools/jsonschema-generator/4.37.0/jsonschema-generator-4.37.0.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/classmate/1.7.0/classmate-1.7.0.jar:/Users/fanzhen/Documents/Repository/com/fasterxml/jackson/core/jackson-core/2.17.3/jackson-core-2.17.3.jar:/Users/fanzhen/Documents/Repository/com/github/victools/jsonschema-module-jackson/4.37.0/jsonschema-module-jackson-4.37.0.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-context-support/6.1.16/spring-context-support-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-webflux/6.1.16/spring-webflux-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/slf4j/slf4j-api/2.0.16/slf4j-api-2.0.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-core/6.1.16/spring-core-6.1.16.jar:/Users/fanzhen/Documents/Repository/org/springframework/spring-jcl/6.1.16/spring-jcl-6.1.16.jar com.beijing.controller.DistinctListTest
========================================
测试开始,样本数量:10000000
每种算法重复运行 5 次,单位:ms
========================================① HashSet 去重:第 1 次耗时: 719 ms第 2 次耗时: 617 ms第 3 次耗时: 651 ms第 4 次耗时: 650 ms第 5 次耗时: 622 ms
👉 平均耗时:651.80 ms,结果大小:899989② LinkedHashSet 去重:第 1 次耗时: 699 ms第 2 次耗时: 678 ms第 3 次耗时: 677 ms第 4 次耗时: 708 ms第 5 次耗时: 710 ms
👉 平均耗时:694.40 ms,结果大小:899989③ TreeSet 去重:第 1 次耗时:4662 ms第 2 次耗时:4505 ms第 3 次耗时:4432 ms第 4 次耗时:3950 ms第 5 次耗时:4652 ms
👉 平均耗时:4440.20 ms,结果大小:899989④ Stream distinct 去重:第 1 次耗时: 583 ms第 2 次耗时: 548 ms第 3 次耗时: 486 ms第 4 次耗时: 564 ms第 5 次耗时: 598 ms
👉 平均耗时:555.80 ms,结果大小:899989⑤ Map 去重:第 1 次耗时: 634 ms第 2 次耗时: 591 ms第 3 次耗时: 584 ms第 4 次耗时: 699 ms第 5 次耗时: 566 ms
👉 平均耗时:614.80 ms,结果大小:899989========================================
测试完成 ✅
========================================进程已结束,退出代码为 0
4. 性能对比(实际测试结果)
| 实现方式 | 平均耗时(10000000数据) | 是否保持顺序 | 空间占用 | 备注 |
|---|---|---|---|---|
| HashSet | ✅ ~651.80ms | 否 | 中等 | 最快的方案 |
| LinkedHashSet | ✅ ~694.40ms | ✅ 是 | 中等 | 实际项目最推荐 |
| TreeSet | ❌ ~4440.20ms | ✅ 排序 | 较高 | 适合需要排序的场景 |
| Stream.distinct() | ✅ ~555.80ms | ✅ 是 | 中等偏高 | 简洁但略慢 |
| 双层循环 | ❌ >100000ms | ✅ 是 | 最低 | 性能灾难,O(n²) |
| Map keySet | ✅ ~614.80 ms | ✅ 是 | 中等 | 与 LinkedHashSet 差不多 |
| 实现方式 | 内部机制 | 迭代次数 | 拷贝次数 | 典型速度 |
|---|---|---|---|---|
HashSet | 手动两次拷贝 | 2 | 2 | 中等 |
LinkedHashSet | 保序 + 2 次拷贝 | 2 | 2 | 略慢 |
TreeSet | 红黑树排序 | 2 | 2 | 最慢 但比双层for循环快 |
Stream.distinct() | 单次流式去重 | 1 | 1 | 通常最快(JIT优化后) |
5. 结论与推荐
| 场景 | 推荐方式 | 理由 |
|---|---|---|
| 不关心顺序 | new HashSet<>(list) | 性能最佳 |
| 需要保持原顺序 | new LinkedHashSet<>(list) | 性能高 + 顺序稳定 |
| 需要排序 | new TreeSet<>(list) | 自动排序,但性能最差 |
| 简洁代码 | list.stream().distinct().toList() | 可读性好(JDK 16+) |
| 性能敏感、且需控制逻辑 | 使用 LinkedHashMap | 更灵活,可扩展计数逻辑等 |
| 千万级数据 | 避免 Stream(对象创建多) | 使用原生集合效率更高 |
6. 额外性能优化建议
-
如果数据类型为简单类型(Integer、String):
LinkedHashSet 是最优组合。 -
如果数据是对象(例如自定义类 Person):
必须正确实现 hashCode() 与 equals(),否则无法真正去重。 -
对于超大数据(>10M):
可以考虑分片去重(分批次放入 Set,再合并),降低内存峰值。 -
当数据量 > 10^6 时
list.parallelStream().distinct().collect(Collectors.toList());速度更快,前提是去重像Integer这种简单数据。
7. 复杂对象时,千万级数据不要用stream流
Stream 是基于 “数据流管道(pipeline)” 的抽象实现。
list.stream().filter(...).map(...).distinct().collect(Collectors.toList());
其实 JVM 会构建出一个 层层包装的管道对象链:
ArrayListSpliterator → Head → FilterOp → MapOp → DistinctOp → Collector
每一层都是一个对象,每层都会包一层 lambda 表达式。
数据流经时,这些层会逐一调用回调函数。
| 类型 | 作用 | 是否每次数据都会创建 |
|---|---|---|
Stream 实例 | 管道入口 | 1 次 |
PipelineHelper | 维护上下文 | 1 次 |
Sink 链 | 每个中间操作一个 | 3~5 个 |
Spliterator | 拆分数据源 | 1 个(或多个) |
| Lambda 对象 | 每个操作至少一个 | N 个 |
| Stream 框架临时对象 | 在 distinct 中收集 | 约等于元素数量 |
当数据量到 千万级(10^7):
- 即使每个对象几十字节,也会产生数百 MB 的短生命周期对象;
- 这些对象都在 Eden 区(新生代) 分配;
- JVM 必然会频繁触发 Minor GC。
| 数据量 | 传统 LinkedHashSet | Stream.distinct() |
|---|---|---|
| 10 万 | 约 15 MB | 约 20 MB |
| 100 万 | 约 100 MB | 约 160 MB |
| 1000 万 | 约 900 MB | >1.6 GB |
| 5000 万 | 约 4.5 GB | >9 GB(易 OOM) |
| 💡 原因:Stream 产生大量短生命周期对象,增加 GC 压力与分配负担。 |
| 场景 | 推荐方式 | 理由 |
|---|---|---|
| ≤ 百万级数据 | Stream.distinct() | 可读性好,性能足够 |
| ≥ 千万级数据 | new LinkedHashSet<>(list) | 内存更可控,无中间对象 |
| 需要多线程加速 | parallelStream().distinct() | 可利用多核但仍需注意内存峰值 |
| 超大规模(>亿级) | 分批处理 + HashSet 合并 | 避免单次 pipeline 占用过大内存 |
