当前位置: 首页 > news >正文

AAAI2025 Accepted Papers(二)

Paper IDTitleAuthor(s)
5JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMsHongyi Li, Jiawei Ye, Wu Jie, Tianjie Yan, 王楚, Zhixin Li
8Verification of Neural Networks against Convolutional Perturbations via Parameterised KernelsBenedikt Brückner, Alessio Lomuscio
9Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality DistributionCarlos Eiras-Franco, Anna Hedström, Marina MC Höhne
15SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model SafetyPaul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy
18Is poisoning a real threat to DPO? Maybe more so than you thinkPankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang
42IBAS:Imperceptible Backdoor Attacks in Split Learning with Limited Informationpeng xi, Shaoliang Peng, Wenjuan Tang
51On the Consideration of AI Openness: Can Good Intent Be Abused?Yeeun Kim, Hyunseo Shin, Eunkyung Choi, Hongseok Oh, Hyunjun Kim, Wonseok Hwang
58Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential DiscountingJoar Max Viktor Skalse, Alessandro Abate
61Risk Controlled Image RetrievalKaiwen Cai, Chris Xiaoxuan Lu, Xingyu Zhao, Wei Huang, Xiaowei Huang
66Do Transformer Interpretability Methods Transfer to RNNs?Gonçalo Santos Paulo, Thomas Marshall, Nora Belrose
81Aligning Large Language Models for Faithful Integrity against Opposing ArgumentYong Zhao, Yang Deng, See-Kiong Ng, TatSeng Chua
87Single Character Perturbations Break LLM AlignmentLeon Lin, Hannah Brown, Kenji Kawaguchi, Michael Shieh
96ME: Modelling Ethical Values for Value AlignmentEryn Rigley, Adriane Chapman, Christine Evers, Will McNeill
100Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial PromptingYun-Da Tsai, Ting-Yu Yen, Keng-Te Liao, Shou-De Lin
108ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesFengqing Jiang, Zhangchen Xu, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
109UFID: A Unified Framework for Black-box Input-level Backdoor Detection on Diffusion ModelsZihan Guan, Mengxuan Hu, Sheng Li, Anil Kumar Vullikanti
113SMLE: Safe Machine Learning via Embedded OverapproximationMatteo Francobaldi, Michele Lombardi
119Increased Compute Efficiency and the Diffusion of AI CapabilitiesKonstantin Friedemann Pilz, Lennart Heim, Nicholas Brown
133Searching for Unfairness in Algorithms’ Outputs: Novel Tests and InsightsIan Davidson, S. S. Ravi
140Retention Score: Quantifying Jailbreak Risks for Vision Language ModelsZAITANG LI, Pin-Yu Chen, Tsung-Yi Ho
144Scaling Laws for Data Poisoning in LLMsDillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, Kellin Pelrine
166Data with High and Consistent Preference Difference Are Better for Reward ModelQi Lin, Hengtong Lu, Caixia Yuan, Xiaojie Wang, Huixing Jiang, Chen Wei
168Neurons to Words: A Novel Method for Automated Neural Network Interpretability and AlignmentLukas-Santo Puglisi, Fabio Valdés, Jakob Johannes Metzger
171Stream Aligner: Efficient Sentence-Level Alignment via Distribution InductionHantao Lou, Jiaming Ji, Kaile Wang, Yaodong Yang
173Strong Empowered and Aligned Weak Mastered Annotation for Weak-to-Strong GeneralizationYongqi Li, Xin Miao, Mayi Xu, Tieyun Qian
189Dynamic Algorithm Termination for Branch-and-Bound-based Neural Network VerificationKonstantin Kaulen, Matthias König, Holger Hoos
196Towards a Theory of AI PersonhoodFrancis Rhys Ward
198 MMJ-Bench \textit{MMJ-Bench} MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language ModelsFenghua Weng, Yue Xu, Chengyan Fu, Wenjie Wang
199Sequential Decision Making in Stochastic Games with Incomplete Preferences over Temporal ObjectivesAbhishek Ninad Kulkarni, Jie Fu, ufuk topcu
213CALM: Curiosity-Driven Auditing for Large Language ModelsXiang Zheng, Longxiang WANG, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang
215Bias Unveiled: Investigating Social Bias in LLM-Generated CodeLin Ling, Fazle Rabbi, Song Wang, Jinqiu Yang
221SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language ModelsSomnath Banerjee, Sayan Layek, Soham Tripathy, Shanu Kumar, Animesh Mukherjee, Rima Hazra
222Align-Pro: A Principled Approach to Prompt Optimization for LLM AlignmentPrashant Trivedi, Souradip Chakraborty, Avinash Reddy, Vaneet Aggarwal, Amrit Singh Bedi, George K. Atia
229Maximizing Signal in Human-Model Preference AlignmentMargaret Kroll, Kelsey Kraus
245Robust Multi-Objective Preference Alignment with Online DPORaghav Gupta, Ryan Sullivan, Yunxuan Li, Samrat Phatale, Abhinav Rastogi
246Reinforcement Learning Platform for Adversarial Black-box Attacks with Custom Distortion FiltersSoumyendu Sarkar, Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Ricardo Luna Gutierrez, Antonio Guillen, Desik Rengarajan
250DR-Encoder: Encode Low-rank Gradients with Random Prior for Large Language Models Differentially PrivatelyHuiwen Wu, Deyi Zhang, Xiaohan Li, Xiaogang Xu, Jiafei Wu, Zhe Liu
260Quantifying Misalignment Between AgentsAidan Kierans, Avijit Ghosh, Hananel Hazan, Shiri Dori-Hacohen
266MAPLE: A Framework for Active Preference Learning Guided by Large Language ModelsSaaduddin Mahmud, Mason Nakamura, Shlomo Zilberstein
267Is Your Autonomous Vehicle Safe? Understanding the Threat of Electromagnetic Signal Injection AttacksWenhao Liao, Sineng Yan, Youqian Zhang, Xinwei Zhai, Yuanyuan Wang, Eugene Fu
268Retrieving Versus Understanding Extractive Evidence in Few-Shot LearningKarl Elbakian, Samuel Carton
272Political Bias Prediction Models Focus on Source Cues, Not SemanticsSelin Chun, Daejin Choi, Taekyoung Kwon
280Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference DatasetsDuanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei
281Sequential Preference Optimization: Multi-Dimensional Preference Alignment With Implicit Reward ModelingXingzhou Lou, Junge Zhang, Jian Xie, lifeng Liu, Dong Yan, Kaiqi Huang
282AI Emergency Preparedness: Examining the federal government’s ability to detect and respond to AI-related national security threatsAkash Wasil, Everett Thornton Smith, Corin Katzke, Justin Bullock
294In Search of Trees: Decision-Tree Policy Synthesis for Black-Box Systems via SearchEmir Demirović, Christian Schilling, Anna Lukina
328Generalizing Alignment Paradigm of Text-to-Image Generation with Preferences through f f f-divergence MinimizationHaoyuan Sun, Bo Xia, Yongzhe Chang, Xueqian Wang

相关文章:

  • AWS Bedrock全托管接入国产大模型DeepSeek-R1[内涵免费使用DeepSeek-R1满血版]
  • 【0x80070666】-已安装另一个版本...(Tableau 安装失败)
  • MFC中使用Create或CreateDialog创建对话框失败,GetLastError错误码为1813(找不到映像文件中指定的资源类型)
  • linux 命令 case
  • 力扣——合并K个排序链表
  • Ubuntu 18,04 LTS 通过APT安装mips64el的交叉编译器。
  • 平安养老险广西分公司2025年“3∙15”金融消费者权益教育宣传活动暨南湖公园健步行活动
  • uni-app+SpringBoot: 前端传参,后端如何接收参数
  • 矫平机:解锁精密制造的工业之手
  • 命令行创建 Docker 网络
  • Java程序开发之Spring Security实战:JWT实现登录鉴权
  • DataWhale 速通AI编程开发:(基础篇)第1章 环境下载、安装与配置
  • 场景题:一个存储IP地址的100G 的文件, 找出现次数最多的 IP ?
  • 【Nexus】Maven 私服搭建以及上传自己的Jar包
  • Gemini 2.0 全面解析:技术突破、应用场景与竞争格局
  • 正新鸡排:在变革浪潮中领航,打造连锁餐饮新生态
  • ARM内部寄存器与常用汇编指令(ARM汇编)
  • oracle中OS BLOCK的含义
  • QGIS如何制作人口流向图
  • 蓝桥杯好题推荐---扫雷
  • 益阳通报“河水颜色异常有死鱼”:未发现排污,原因待鉴定
  • 交响4K修复版《神女》昨晚上演,观众听到了阮玲玉的声音
  • 假冒政府机构账号卖假货?“假官号”为何屡禁不绝?媒体调查
  • 国际观察丨美中东政策生变,以色列面临艰难选择
  • 浙江广厦:诚挚道歉,涉事责任人交公安机关
  • 官方数据显示:我国心血管疾病患者已超3亿人