论文翻译:BRILLM: BRAIN-INSPIRED LARGE LANGUAGE MODEL
论文地址:https://arxiv.org/pdf/2503.11299
代码:https://github.com/brillm05/BriLLM0.5
摘要
We present BriLLM, a brain-inspired large language model that fundamentally reimagines machine learning foundations through Signal Fully-connected flowing (SiFu) learning. Addressing core limitations in Transformer-based models—blackbox opacity, quadratic complexity, and context-length dependency—BriLLM incorporates two key neurocognitive principles: (1) static semantic mapping where tokens map to specialized nodes analogous to cortical regions, and (2) dynamic signal propagation simulating electrophysiological information flow. This architecture enables three breakthroughs: full model interpretability, context-length independent scaling, and the first global-scale simulation of brain-like processing. Initial 1–2B parameter models demonstrate GPT-1-level generative capabilities with stable perplexity reduction. Scalability analyses confirm feasibility of 100–200B parameter variants processing 40,000-token contexts. BriLLM establishes a new paradigm for biologically grounded AGI development. 1 2 | 我们提出了 BriLLM,一个受大脑启发的大型语言模型,它通过信号全连接流 (SiFu) 学习从根本上重塑了机器学习的基础。BriLLM 解决了基于 Transformer 模型的核心局限性——黑盒不透明性、二次复杂度和上下文长度依赖性——并融合了两个关键的神经认知原理:(1) 静态语义映射,其中 token 映射到类似于皮质区域的专用节点;(2) 模拟电生理信息流的动态信号传播。该架构实现了三大突破:完整的模型可解释性、独立于上下文长度的扩展以及首次实现类脑处理的全局规模模拟。初始的 1-20 亿参数模型展示了 GPT-1 级别的生成能力,并稳定地降低了困惑度。可扩展性分析证实了 100-2000 亿参数变体处理 40,000 个 token 上下文的可行性。BriLLM 为基于生物学的通用人工智能 (AGI) 开发建立了新的范式。1 2 |
1 INTRODUCTION
Artificial general intelligence (AGI) — enabling systems to autonomously acquire and generalize skills across domains—requires a departure from narrow, task-specific paradigms. Prevailing large language models (LLMs), built on Transformer and GPT architectures(Radford et al., 2018), increasingly face recognition of fundamental limitations in achieving AGI (Vaswani et al., 2017). Paradoxically, these frameworks may represent the apex of conventional machine learning (ML) and deep learning (DL) paradigms. Their failure to scale toward AGI thus reflects not architectural flaws, but a critical limitation in the underlying ML/DL foundation: black-box opacity, where only inputs and outputs are interpretable, with internal mechanisms inscrutable. | 通用人工智能 (AGI) —— 使系统能够自主获取并泛化跨领域技能 —— 需要突破狭隘的、特定于任务的范式。基于 Transformer 和 GPT 架构构建的主流大型语言模型 (LLM)(Radford 等人,2018)日益面临实现 AGI 的根本性局限性(Vaswani 等人,2017)。矛盾的是,这些框架或许代表了传统机器学习 (ML) 和深度学习 (DL) 范式的巅峰。因此,它们无法扩展到 AGI 并非反映架构缺陷,而是底层 ML/DL 基础的一个关键限制:黑盒不透明性,其中只有输入和输出是可解释的,而内部机制则难以捉摸。 |
This core flaw — innate to all traditional ML/DL systems — persists regardless of architectural tweaks, including advancements in attention mechanisms. Even optimal designs cannot address the fundamental opacity of conventional models. It is quite possible that Transformer or GPT has been the best design in terms of the current ML/DL paradigm. Progress toward AGI therefore demands a rewrite of ML’s foundations, creating interpretable, brain-inspired frameworks. This imperative motivates our introduction of Signal Fully-connected flowing (SiFu) learning — a paradigm shift stemming from two insights: recognition that GPT’s AGI bottlenecks are systemic, not architectural; and principles derived from macroscopic brain function Huth et al. (2016). SiFu is not an incremental improvement but a complete reimagining of learning, fundamentally distinct from prior ML/DL work. | 这一核心缺陷——所有传统机器学习/深度学习系统固有的缺陷——无论架构如何调整,包括注意力机制的改进——依然存在。即使是最优设计也无法解决传统模型的根本性不透明性。就当前的机器学习/深度学习范式而言,Transformer 或 GPT 很可能一直是最佳设计。因此,通用人工智能 (AGI) 的进步需要重写机器学习的基础,创建可解释的、受大脑启发的框架。这一需求促使我们引入信号全连接流 (SiFu) 学习——这一范式转变源于两点洞察:一是认识到 GPT 的 AGI 瓶颈是系统性的,而非架构性的;二是源自宏观大脑功能的原理(Huth 等人,2016)。SiFu 并非渐进式改进,而是对学习的彻底重塑,与之前的机器学习/深度学习工作有着根本性的区别。 |
Conventional LLMs exacerbate this systemic issue through attention-driven inefficiencies (quadratic complexity with sequence length) and parameter scaling tied to context length — contrasting sharply with the brain, which processes arbitrary context without physical expansion. While prior work has borrowed isolated neural features, none have attempted global simulation of brain-like information processing as a basis for intelligence. | 传统的LLM通过注意力驱动的低效率(序列长度的二次方复杂度)和与上下文长度相关的参数缩放加剧了这一系统性问题——这与大脑形成了鲜明对比,大脑可以处理任意上下文而无需物理扩展。虽然先前的研究借鉴了孤立的神经特征,但尚无任何研究尝试过整体模拟类脑信息处理作为智能的基础。 |
Here, we present three interconnected innovations: | 在此,我们提出三项相互关联的创新: |
• A novel paradigm replacing traditional ML/DL with brain-inspired approach to semantic representation; | • 一种用受大脑启发的语义表示方法取代传统 ML/DL 的新范式; |
• A generative framework grounded in dynamic energy-maximizing signaling; | • 基于动态能量最大化信号的生成框架; |
• The first global-scale computational model of brain-like semantic and functional mechanisms. | • 第一个类似大脑的语义和功能机制的全局规模计算模型。 |
Table 1 situates this work within ML evolution, highlighting SiFu’s divergence from conventional paradigms toward brain-aligned learning. | 表 1 将这项工作置于 ML 进化之中,突出了 SiFu 与传统范式在大脑对齐学习方面的分歧。 |
2 SIFU MECHANISM
The human brain offers a blueprint for overcoming conventional ML’s limitations. As Huth et al. (2016) demonstrated, semantic information maps to specific cortical regions consistently across individuals — every brain area contributes to interpretable processing, unlike conventional LLMs where only inputs and outputs are transparent. Additionally, cognition arises from electrophysiological signals (e.g., EEG) propagating across these regions, dynamically activating stored knowledge. These two properties — static distributed semantics and dynamic signaling — define brain function and are absent in current ML/DL and AI. | 人脑为克服传统机器学习的局限性提供了蓝图。正如 Huth 等人 (2016) 所证明的那样,语义信息在不同个体之间一致地映射到特定的皮质区域——每个大脑区域都参与可解释的处理,这与传统的 LLM 不同,后者只有输入和输出是透明的。此外,认知源于在这些区域传播的电生理信号(例如脑电图 (EEG)),从而动态地激活存储的知识。这两个特性——静态分布式语义和动态信号——定义了大脑功能,但在当前的 ML/DL 和 AI 中却缺失。 |
SiFu (Signal Fully-connected flowing) learning replicates these properties through a directed graphG = {V, E}, fundamentally redefining learning: | SiFu(信号全连接流动)学习通过有向图 = {V, E} 复制了这些属性,从根本上重新定义了学习: |
1. Static semantic grounding: Nodes V explicitly map to tokens, mirroring cortical regions encoding specific meanings Huth et al. (2016). This ensures full interpretability across every component, not just inputs/outputs, has semantic meaning. | 1. 静态语义基础:节点 V 明确映射到 token,映射了编码特定含义的皮质区域(Huth 等人,2016)。这确保了每个组件(而不仅仅是输入/输出)都具有语义,且完全可解释。 |
2. Dynamic signal propagation: Edges E enable bidirectional signaling, modeled after neural electrophysiology. Signals flow along “least resistance” paths, maximizing energy — a proxy for neural pathway strengthening during cognition. | 2. 动态信号传播:边缘E可实现双向信号传输,其模型基于神经电生理学。信号沿着“阻力最小”的路径流动,从而实现能量最大化——这代表了认知过程中神经通路的强化。 |
This design addresses conventional ML’s core failures. Prediction relies on signal dynamics, not forward computation in a black box; model size is decoupled from sequence length (like the brain); and every component is interpretable. | 这种设计解决了传统机器学习的核心缺陷。预测依赖于信号动态,而非黑箱中的前向计算;模型大小与序列长度分离(就像大脑一样);并且每个组件都是可解释的。 |
We first formalize the generative machine learning task addressed here. For language models, generative prediction — also termed autoregressive prediction — originates from the classic n-gram language modeling framework. Such tasks aim to predict the next token wiw_iwi from the preceding sequence w1,w2,...,wi−1w_1, w_2, . . . , w_{i−1}w1,w2,...,wi−1. In deep learning implementations, this requires training a model Msuch that | 我们首先形式化地阐述本文讨论的生成式机器学习任务。对于语言模型而言,生成式预测(也称为自回归预测)源自经典的 n-gram 语言建模框架。此类任务旨在根据前一个序列 w1,w2,...,wi−1w_1, w_2, ..., w_{i−1}w1,w2,...,wi−1 预测下一个标记 wiw_iwi。在深度学习实现中,这需要训练一个模型 M,使得 |
Id(wi)=Mθ(e(w1),...,e(wi−1))Id(w_i) = M_θ (e(w_1), . . . , e(w_{i−1}))Id(wi)=Mθ(e(w1),...,e(wi−1)), | |
where Id()Id()Id() denotes the token’s output representation (typically a one-hot vector), e()e()e() denotes the input representation (commonly called word vectors), and θθθ is the set of parameters to be learned byM . Here Id()Id()Id() and e()e()e() represent what is only explainable in this DL framework. Notably, such models must integrate prior outputs into the context for prediction, giving rise to ‘attention’ mechanisms — common to both RNN and Transformer architectures. Accordingly, the attention-augmented form becomes | 其中 Id()Id()Id() 表示 token 的输出表征(通常为独热向量),e()e()e() 表示输入表征(通常称为词向量),θθθ 是 M 需要学习的参数集合。这里的 Id()Id()Id() 和 e()e()e() 表示只能在本深度学习框架中解释的内容。值得注意的是,此类模型必须将先前的输出整合到上下文中进行预测,从而产生“注意力”机制——这在 RNN 和 Transformer 架构中都很常见。因此,注意力增强形式变为 |
Id(wj)=Mθ(e(w1),...,e(wi−1);Attention(wi,wi−1,...,wj−1))Id(w_j ) = M_θ (e(w_1), . . . , e(w_{i−1}); Attention(w_i, w_{i−1}, . . . , w_{j−1}))Id(wj)=Mθ(e(w1),...,e(wi−1);Attention(wi,wi−1,...,wj−1)). | |
In these models, only the input sequence w1,w2,...,wi−1w_1, w_2, . . . , w_{i−1}w1,w2,...,wi−1 and output wiw_iwi are directly understood; the model M and its parameters θθθ require dedicated analysis to elucidate their roles in the learning process. Moreover, model size scales with input context length, as M must process the entire sequence w1,w2,...,wi−1w_1, w_2, . . . , w_{i−1}w1,w2,...,wi−1 once through a unique input port. | 在这些模型中,只有输入序列 w1,w2,...,wi−1w_1, w_2, ..., w_{i−1}w1,w2,...,wi−1 和输出 wiw_iwi 是直接可理解的;模型 M 及其参数 θθθ 需要专门分析才能阐明它们在学习过程中的作用。此外,模型大小会随着输入上下文的长度而变化,因为 M 必须通过唯一的输入端口一次性处理整个序列 w1,w2,...,wi−1w_1, w_2, ..., w_{i−1}w1,w2,...,wi−1。 |
To address these limitations, formally, SiFu is defined for a vocabulary w1,...,wn{w_1, ..., w_n}w1,...,wn with nodes V=v1,...,vnV = {v_1, ..., v_n}V=v1,...,vn (each viv_ivi mapping to wiw_iwi). Edges eij∈Ee_{ij} ∈ Eeij∈E between viv_ivi and vjv_jvj govern signal transmission via learnable parameters. A signal tensor r initiates at nodes corresponding to input tokens and propagates through the graph, with transformations at nodes (⊕⊕⊕) and edges (⊗⊗⊗) determined by learnable parameters θVθ_VθV (nodes) and θEθ_EθE (edges). | 为了解决这些限制,SiFu 的正式定义是针对词汇表 w1,...,wn{w_1, ..., w_n}w1,...,wn,其中节点 V=v1,...,vnV = {v_1, ..., v_n}V=v1,...,vn(每个 viv_ivi 映射到 wiw_iwi)。viv_ivi 和 vjv_jvj 之间的边 eij∈Ee_{ij} ∈ Eeij∈E 通过可学习参数控制信号传输。信号张量 r 从与输入 token 对应的节点开始,并在图中传播,节点 (⊕⊕⊕) 和边 (⊗⊗⊗) 处的变换由可学习参数 θVθ_VθV(节点)和 θEθ_EθE(边)决定。 |
In SiFu mechanism, given input tokens w1,...,wi−1w_1, ..., w_{i−1}w1,...,wi−1 (mapped to v1,...,vi−1v_1, ..., v_{i−1}v1,...,vi−1), the signal r propagates through these nodes. The next token wiw_iwi is identified as the node viv_ivi where the signal attains maximum energy, computed as: | 在 SiFu 机制中,给定输入标记 w1,...,wi−1w_1, ..., w_{i-1}w1,...,wi−1(映射到 v1,...,vi−1v_1, ..., v_{i-1}v1,...,vi−1),信号 r 会通过这些节点传播。下一个标记 wiw_iwi 被标识为信号达到最大能量的节点 viv_ivi,计算如下: |
![]() | |
For autoregressive prediction (like GPT), the corresponding maximum energy is instead computed by | 对于自回归预测(如 GPT),相应的最大能量则通过以下公式计算: |
![]() | |
where αkα_kαk are learnable weights enabling the model to “attend” to relevant prior nodes, mimicking distributed neural integration. | 其中 αkα_kαk 是可学习的权重,使模型能够“关注”相关的先前节点,模仿分布式神经集成。 |
Figure 2 illustrates SiFu’s operation: during forward propagation (Figure 2a), signals flow through nodes, with the next token determined by maximum energy; during training (Figure 2b), parameters are optimized to ensure correct paths yield the highest energy, analogous to strengthening neural pathways through learning. | 图 2 说明了 SiFu 的运行:在前向传播期间(图 2a),信号流经节点,下一个标记由最大能量决定;在训练期间(图 2b),参数进行优化以确保正确的路径产生最高的能量,类似于通过学习来加强神经通路。 |
SiFu’s key advantages arise directly from its brain-inspired design: 1. Full interpretability: Every node maps to a token, making semantic processing transparent at all levels — replicating the brain’s distributed interpretability Huth et al. (2016). 2. Unbounded context processing: Like the brain, SiFu processes arbitrarily long sequences without expanding its structure, as signal propagation rather than parameter scaling handles longer inputs. 3. Dynamic signaling: Signal flow mirrors electrophysiological activity, enabling recall and activation patterns analogous to human cognition. 4. Cognitive traceability: Thanks to signal propagation and activation of predictions across interpretable nodes, dynamic prediction behavior is explainable throughout the process, realizing cognitive traceability. Error generation can be localized to specific signal paths (e.g., nodes or edges with abnormal activation), similar to analyzing abnormal brain activity via neuroimaging. | SiFu 的关键优势直接源于其受大脑启发的设计: 1. 完全可解释性:每个节点都映射到一个标记,使语义处理在各个层面都透明 - 复制大脑的分布式可解释性 Huth 等人(2016 年)。 2. 无界上下文处理:与大脑一样,SiFu 可以处理任意长的序列而无需扩展其结构,因为信号传播而不是参数缩放可以处理更长的输入。 3. 动态信号传导:信号流反映电生理活动,实现类似于人类认知的回忆和激活模式。 4. 认知可追溯性:由于信号在可解释节点之间传播和激活预测,动态预测行为在整个过程中都是可解释的,从而实现了认知可追溯性。错误产生可以定位到特定的信号路径(例如,具有异常激活的节点或边缘),类似于通过神经影像学分析异常的大脑活动。 |
3 BRILLM FORMULATION
BriLLM implements the SiFu mechanism to realize a language model that replicates the brain’s macroscopic properties and processing, as shown in Figure 3. Each token corresponds to a node — modeled as a GeLU-activated neuron layer with bias b∈Rdnodeb ∈ R^{d_{node}}b∈Rdnode (where dnoded_{node}dnode is node dimension) — mirroring a cortical region dedicated to specific semantics Huth et al. (2016). Edges between nodes uand v are bidirectional, with matrices Wu,v,Wv,u∈Rdnode×dnodeW_{u,v} , W_{v,u} ∈ R^{d_{node}×d_{node}}Wu,v,Wv,u∈Rdnode×dnode enabling signal transmission in both directions, analogous to reciprocal neural connections. | BriLLM 运用 SiFu 机制,实现了一个能够复制大脑宏观特性和处理过程的语言模型,如图 3 所示。每个 token 对应一个节点——该节点被建模为一个 GeLU 激活的神经元层,其偏差为 b∈Rdnodeb ∈ R^{d_{node}}b∈Rdnode(其中 dnode 为节点维度),该节点映射了 Huth 等人(2016)提出的负责特定语义的皮质区域。节点 u 和 v 之间的边是双向的,矩阵 Wu,v,Wv,u∈Rdnode×dnodeW_{u,v} , W_{v,u} ∈ R^{d_{node}×d_{node}}Wu,v,Wv,u∈Rdnode×dnode 实现了双向信号传输,类似于双向神经连接。 |
Signal propagation in BriLLM mimics electrophysiological activity, starting with an initial tensor: | BriLLM 中的信号传播模拟电生理活动,从初始张量开始: |
e0=[1,1,...,1]⊤∈Rdnodee_0 = [1, 1, . . . , 1]^⊤ ∈ R^{d_{node}} e0=[1,1,...,1]⊤∈Rdnode | |
For a sequence u1,...,uL−1,vpredictu_1, ..., u_{L−1}, v_{predict}u1,...,uL−1,vpredict, the signal ei+1∈Rdnodee_{i+1} ∈ R^{d_{node}}ei+1∈Rdnode propagating from uiu_iui to ui+1u_{i+1}ui+1 is: | 对于序列 u1,...,uL−1,vpredictu_1, ..., u_{L−1}, v_{predict}u1,...,uL−1,vpredict,信号 ei+1∈Rdnodee_{i+1} ∈ R^{d_{node}}ei+1∈Rdnode 从 uiu_iui 传播到 ui+1u_{i+1}ui+1 为: |
![]() | |
Here, positional encoding (PE) ensures sequence order is preserved, while edge-specific biases modulate signal strength. | 在这里,位置编码(PE)确保序列顺序得以保留,而边缘特定偏差则调节信号强度。 |
To predict the next token, BriLLM integrates signals from all preceding nodes using learnable weights α∈RL−1α ∈ R^{L−1}α∈RL−1: | 为了预测下一个标记,BriLLM 使用可学习的权重 α∈RL−1α ∈ R^{L−1}α∈RL−1 整合来自所有前一个节点的信号: |
![]() | |
where AAA is softmax-normalized to prioritize relevant signals. The final prediction is the node maximizing the energy of the propagated signal: | 其中,AAA经过softmax正则化处理,以优先考虑相关信号。最终预测是最大化传播信号能量的节点: |
![]() | |
Training BriLLM involves optimizing parameters to maximize signal energy for correct sequences, analogous to strengthening neural pathways through learning. As shown in Figure 4, each training sample constructs a dynamic network reflecting the sequence’s signal flow, with cross-entropy loss rewarding accurate energy-based predictions. | 训练 BriLLM 涉及优化参数,以最大化正确序列的信号能量,类似于通过学习来强化神经通路。如图 4 所示,每个训练样本构建一个反映序列信号流的动态网络,交叉熵损失函数奖励基于能量的准确预测。 |
4 EXPERIMENTS
BriLLM was designed as a generative large model targeting supervised fine-tuning (SFT) capabilities — distinct from early small pre-trained language models like GPT-1 (which focused on deep representation learning). SiFu’s departure from deep representation learning further distinguishes BriLLM, precluding direct comparisons to GPT-1’s representation learning benchmarks or standard fine-tuning evaluations. Additionally, current computational constraints limit our checkpoints to sub-scale sizes, insufficient for demonstrating GPT-LLM-like emergent abilities (SFT validation is thus future work). Instead, we validate two core generative functions: sequence continuation and stable learning dynamics — sufficient to confirm BriLLM’s design feasibility. | BriLLM 被设计为一个大型生成模型,旨在实现监督微调 (SFT) 功能,这与早期的小型预训练语言模型(例如 GPT-1,专注于深度表示学习)截然不同。SiFu 背离了深度表示学习,这进一步凸显了 BriLLM 的独特之处,使其无法与 GPT-1 的表示学习基准或标准微调评估进行直接比较。此外,当前的计算限制将我们的检查点限制在子尺度大小,不足以展示类似 GPT-LLM 的涌现能力(因此,SFT 验证是未来的工作)。因此,我们验证了两个核心生成功能:序列延续和稳定的学习动态——足以确认 BriLLM 的设计可行性。 |
4.1 SETUP
Datasets: BriLLM-Chinese and BriLLM-English were trained on Chinese and English Wikipedia (each >100M tokens), with sequences truncated to 32 tokens and a 4,000-token vocabulary. This setup tests the model’s ability to process natural language while maintaining the brain-like property of fixed size regardless of sequence length. | 数据集:BriLLM-Chinese 和 BriLLM-English 分别在中文和英文维基百科(每个都超过 1 亿个词条)上进行训练,序列被截断为 32 个词条,词汇量为 4,000 个词条。此设置测试了模型处理自然语言的能力,同时无论序列长度如何,都能保持类似大脑的固定大小特性。 |
Implementation Details: Implemented in PyTorch, BriLLM uses sine-cosine positional encoding, GeLU activation, and cross-entropy loss. Nodes have dimension dnode=32 (neurons per node), with edges as 32 × 32 matrices. Training used the AdamW optimizer (β1=0.9, β2=0.999) on 8 NVIDIA A800 GPUs for 1.5k steps. The theoretical parameter count (≈16B) reflects the fully connected graph, but sparse training (below) greatly reduces this, demonstrating efficiency akin to the brain’s sparse connectivity. | 实现细节:BriLLM 使用 PyTorch 实现,采用正弦余弦位置编码、GeLU 激活函数和交叉熵损失。节点维度为 dnode=32(每个节点的神经元数),边为 32 × 32 矩阵。训练使用 AdamW 优化器(β1=0.9,β2=0.999)在 8 块 NVIDIA A800 GPU 上进行,训练步长为 1.5k。理论参数数量(≈160 亿)反映了全连接图的有效性,但稀疏训练(见下文)大大减少了这一数量,其效率堪比大脑的稀疏连接。 |
Sparse Training: Consistent with the brain’s sparse neural connections, BriLLM leverages lowfrequency token co-occurrences to reduce parameters. Low-frequency edges share fixed matrices, reducing size to 2B (Chinese) and 1B (English)—90% smaller than theoretical (Table 2). This mirrors the brain’s ability to reuse neural pathways for infrequent concepts. | 稀疏训练:与大脑稀疏的神经连接一致,BriLLM 利用低频词条共现来减少参数。低频边缘共享固定矩阵,将大小缩减至 2B(中文)和 1B(英语),比理论值小 90%(表 2)。这反映了大脑对不常见概念复用神经通路的能力。 |
4.2 RESULTS
Learning stability: Training loss (Figure 5) shows consistent reduction, confirming effective pattern learning. | 学习稳定性:训练损失(图 5)持续减少,证实了有效的模式学习。 |
Sequence continuation: Tables 3 and 4 demonstrate contextually relevant completions, matching GPT-1’s core generative capability (its most impactful feature, despite original focus on representation learning). | 序列延续:表 3 和表 4 展示了上下文相关的完成,与 GPT-1 的核心生成能力相匹配(尽管最初专注于表征学习,但这是其最具影响力的特征)。 |
4.3 SCALABILITY
BriLLM’s size scales quadratically with node dimension, i.e., O(n2 · d2node), but as a global brain simulation, mature models will not require drastic expansion for diverse AGI tasks (like the human brain). Even with a 40,000-token vocabulary (matching modern LLMs), sparse training limits size to 100–200B parameters — competitive with current models — while maintaining unique context-length independence (O(1) model complexity vs. Transformers’ quadratic O(L2) scaling, with context lengthL).
BriLLM 的大小随节点维度呈二次方增长,即 O(n² · d²node),但作为全局大脑模拟,成熟模型无需大幅扩展即可应对各种通用人工智能 (AGI) 任务(例如人脑)。即使词汇量达到 40,000 个词条(与现代 LLM 相当),稀疏训练也能将模型大小限制在 100 到 2000 亿个参数以内——与现有模型相当——同时保持了独特的上下文长度独立性(模型复杂度为 O(1),而 Transformers 的扩展复杂度为 O(L²),上下文长度为 L)。
5 CONCLUSION, LIMITATION AND THE FUTURE
BriLLM and its SiFu foundation represent a paradigmatic shift, addressing AGI’s core barrier: the black-box nature of conventional ML/DL. By rewriting learning around brain-inspired principles—static semantic nodes (like cortical regions Huth et al. (2016)) and dynamic signal propagation — we achieve full interpretability, unbounded context, and global-scale brain-like processing. This third innovation — modeling the brain as a system-level processor — addresses a critical gap in AI research, where prior work has not attempted to replicate the brain’s global operations. | BriLLM 及其 SiFu 基金会代表着一种范式转变,解决了通用人工智能 (AGI) 的核心障碍:传统机器学习/深度学习的黑箱特性。通过围绕受大脑启发的原理——静态语义节点(例如 Huth 等人 (2016) 的皮层区域)和动态信号传播——重写学习,我们实现了完全的可解释性、无界上下文和全局类脑处理能力。第三项创新——将大脑建模为系统级处理器——填补了人工智能研究中的一个关键空白,此前的研究尚未尝试复制大脑的整体运作。 |
BriLLM’s design directly mirrors two defining properties of the brain: static mapping of semantic units to distinct components (nodes, analogous to cortical regions Huth et al. (2016)) and dynamic signal propagation (analogous to electrophysiological activity) driving cognition. This enables three key capabilities absent in conventional LLMs: full interpretability across all components, decoupling of model size from sequence length, and inherent multi-modal compatibility (since nodes can represent any semantic unit, not just language). | BriLLM 的设计直接反映了大脑的两个定义特性:语义单元到不同组件(节点,类似于 Huth 等人 (2016) 的皮层区域)的静态映射,以及驱动认知的动态信号传播(类似于电生理活动)。这使得传统 LLM 具备了三个关键能力:所有组件的完全可解释性、模型大小与序列长度的解耦,以及固有的多模态兼容性(因为节点可以表示任何语义单元,而不仅仅是语言)。 |
Our initial 1–2B parameter models validate the design of BriLLM: they replicate GPT-1’s core generative capability (sequence continuation) with stable learning dynamics, despite being engineered to target GPT-3-level performance. Limitations reflect early-stage development (sub-scale size, sparse training refinement) rather than fundamental flaws. Additionally, while BriLLM theoretically handles infinite sequences, practical performance on very long sequences requires extended training on longer samples — consistent with the brain’s need for experience to develop long-term reasoning. | 我们最初的 1-2B 参数模型验证了 BriLLM 的设计:它们复制了 GPT-1 的核心生成能力(序列延续),并具有稳定的学习动态,尽管其设计目标是达到 GPT-3 级别的性能。局限性反映的是早期开发阶段(子尺度大小、稀疏训练细化),而非根本缺陷。此外,虽然 BriLLM 理论上可以处理无限序列,但要在非常长的序列上实现实际性能,需要在更长的样本上进行更长时间的训练——这与大脑需要经验来发展长期推理的能力相一致。 |
Future work will: (1) scale to larger checkpoints to test emergent abilities; (2) add multi-modal nodes for cross-modal processing; (3) refine signaling to mimic neural plasticity; and (4) develop embodied versions with sensorimotor integration. | 未来的工作将:(1)扩展到更大的检查点以测试新兴能力;(2)添加多模式节点以进行跨模式处理;(3)改进信号以模拟神经可塑性;(4)开发具有感觉运动整合的具体版本。 |
Table 5 summarizes BriLLM’s advantages over conventional LLMs, highlighting its breakthrough in replicating the brain’s global properties. By redefining language modeling as a simulation of the brain’s macroscopic mechanisms, BriLLM paves the way for AGI rooted in the principles of biological intelligence. | 表5总结了BriLLM相较于传统LLM的优势,凸显了其在复制大脑整体特性方面的突破。通过将语言建模重新定义为对大脑宏观机制的模拟,BriLLM为根植于生物智能原理的通用人工智能(AGI)铺平了道路。 |