计算机视觉与模式识别前沿一览:2025年8月arXiv 热点研究趋势解析
本推文分析了arXiv中Computer Vision and Patteren Recognition(计算机视觉与模式识别)领域2025年8月发布的近50篇论文的研究热点,旨在帮助读者快速了解近期领域内的前沿技术与研究方向。
arXiv是全球最具影响力的开放电子预印本平台之一,由美国国家科学基金会和美国能源部资助,在美国Los Alamos国家实验室创立,现由美国康奈尔大学负责管理并维护。arXiv涵盖了计算机科学、物理、数学、量化金融等多个领域学科。目前,越来越多的研究人员选择在论文正式发表之前,将最新研究成果提前发布于arXiv,极大促进了全球科研社区的交流与共享。
本推文作者为许东舟,审核为黄星宇和邱雪。
一、计算机视觉与模式识别
计算机视觉与模式识别在计算机科学与人工智能领域具有核心地位,两者相互支撑、共同发展。计算机视觉旨在使计算机从图像与视频等数据中自动获取信息并理解场景与目标,典型任务包括目标检测、图像分割、姿态估计和三维重建等;模式识别则侧重于从数据中提取特征并建立判别或生成模型,用于分类、聚类、匹配或异常检测等决策。
随着技术的成熟,它们正逐渐渗透进各行各业,不仅在人脸识别、物流分拣、交通管理等传统任务中具有广泛应用,也为具身智能、自动驾驶、医学影像分析和AIGC等前沿技术的发展奠定了基础。
二、热点分析
本文分析了2025年8月发表在arXiv上计算机视觉与模式识别领域的50篇最新论文。图1为基于本期所有论文标题中研究热点生成的词云图。表1列出了全部的50篇论文(按照时间排序)。为了进一步揭示本期研究热点,表2对论文标题中出现频率最高的10个主题词进行了整理和统计,旨在为相关领域的研究人员提供研究方向上的参考。
图1 2025年8月期Computer Vision and Patteren Recognition研究热点词云图
表1 2025年8月Computer Vision and Patteren Recognition方向的50篇论文标题汇总
编号 | 论文 / 项目标题 |
1 | LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos |
2 | Beyond Simple Edits: Composed Video Retrieval with Dense Modifications |
3 | Distilled-3DGS: Distilled 3D Gaussian Splatting |
4 | GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation |
5 | InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing |
6 | Backdooring Self-Supervised Contrastive Learning by Noisy Alignment |
7 | Online 3D Gaussian Splatting Modeling with Novel View Selection |
8 | ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans |
9 | Self-Supervised Sparse Sensor Fusion for Long Range Perception |
10 | Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment |
11 | OmViD: Omni-supervised active learning for video action detection |
12 | ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving |
13 | RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation |
14 | ViT-FIQA: Assessing Face Image Quality using Vision Transformers |
15 | DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts |
16 | PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis |
17 | SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation |
18 | Forecasting Smog Events Using ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South Asia |
19 | In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging |
20 | RICO Two: Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection |
21 | RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems |
22 | SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation |
23 | Self-Aware Adaptive Alignment: Enabling Accurate Perception for Intelligent Transportation Systems |
24 | Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering |
25 | Timestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation |
26 | A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports |
27 | VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization |
28 | Shape-from-Template with Generalised Camera |
29 | MR6D: Benchmarking 6D Pose Estimation for Mobile Robots |
30 | Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks |
31 | Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance |
32 | Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture |
33 | Diversity-enhanced Collaborative Mamba for Semi-supervised Medical Image Segmentation |
34 | HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes |
35 | DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction |
36 | OmniTry: Virtual Try-On Anything without Masks |
37 | DiffIER: Optimizing Diffusion Models with Iterative Error Reduction |
38 | RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance |
39 | TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis |
40 | Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm |
41 | PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction |
42 | Unleashing Semantic and Geometric Priors for 3D Scene Completion |
43 | Towards Efficient Vision State Space Models via Token Merging |
44 | Bridging Clear and Adverse Driving Conditions |
45 | Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model |
46 | Generative Model-Based Feature Attention Module for Video Action Analysis |
47 | The 9th AI City Challenge |
48 | Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics |
49 | DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup |
50 | Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer |
表2 高频关键词TOP10
关键词 | 出现次数 |
Image | 8 |
Segmentation | 6 |
3D | 6 |
Video | 6 |
Generation | 5 |
Gaussian/Gaussian Splatting | 4 |
LVLM / Vision-Language / VL | 4 |
Lager Language Model / LLM | 3 |
Multimodal | 3 |
Pose | 3 |
三、总结
从本期arXiv计算机视觉与模式识别方向论文的高频关键词来看(见表 2),研究热点呈现出以下特征与趋势:
本期高频热点榜首为“Image(图像)”(8 次),这表明图像仍然是计算机视觉研究的核心。无论是图像分割、图像生成、目标检测,还是多模态语言模型的构建,都离不开对图像这一基础要素的深入分析与建模。
随后是“Segmentation(分割)”、“3D(三维)”以及“Video(视频)”并列第二(均为6次)。反映出了三个重要方向:首先,分割仍是视觉研究的关键,从医学图像到多模态模型都是不可或缺的一部分;其次,三维视觉的热度依旧居高不下,相关工作涵盖三维重建、三维分割以及三维场景建模等,具有较强的实际应用价值;第三,视频研究已成为新的热点之一,从生成到检索再到动作分析,都展现出了学术界与产业界对动态场景的高度重视。
“Generation(生成,5次)”紧随其后,体现出生成式方法在图像、视频以及三维建模等方向中具有重要意义。Gaussian / Gaussian Splatting(高斯溅射)出现4次,可以看出这一方法正逐渐成为三维建模方向中最热门的领域。
“LVLM / Vision-Language(视觉-语言模型,4次)”与“Large Language Model / LLM(大语言模型,3次)”的频繁出现,则体现出跨模态与大规模预训练模型的快速发展。如何在建立视觉与语言之间更稳健的对齐机制,以及如何借助大模型增强视觉任务的泛化能力,已逐渐成为新的研究趋势。
此外,“Multimodal(多模态)”与“Pose(姿态)”均出现了3次。多模态模型突出了跨模态信息的交互与统一建模,常见于视觉、语言与文本等多源数据的融合,后者则在人机交互、虚拟现实、动作识别等场景中展现出了重要的应用价值。
总体来看,本期的研究热点主要聚焦于图像与视频分析、分割与三维建模、生成式方法、大模型的跨模态应用。随着高斯溅射、扩散模型以及视觉-语言模型的不断发展,计算机视觉正逐步迈向更加贴近真实世界应用的方向。可以预见,未来的研究将持续围绕生成式视觉、视觉-语言融合以及多模态通用大模型展开更深入的探索。