Linux之vmlinux文件段布局和arm64 的链接脚本vmlinux.lds.S分析
1、简介
作为一个内核工程师, 会经常见到vmlinux ELF文件,如果没有对这个ELF文件熟悉,不能称之为合格的内核工程师。在分析中断向量表的定义和内存布局时,发现自己没有写过一个和内核vmlinux ELF布局的文章。下面开始详细介绍vmlinux ELF是如何生成的以及段布局。

2、ELF布局
vmlinux 属于 ELF 文件,要想了解如何启动 vmlinux,首先需要知道 ELF 的格式。
- text段
代码段,通常是指用来存放程序执行代码的一块内存区域。这部分区域的大小在程序运行前就已经确定。
- data段
数据段,通常是指用来存放程序中已初始化的全局变量的一块内存区域。数据段属于静态内存分配。
- bss段
通常是指用来存放程序中未初始化的全局变量和静态变量的一块内存区域。BSS段属于静态内存分配。
- init段
linux定义的一种初始化过程中才会用到的段,一旦初始化完成,那么这些段所占用的内存会被释放掉,后续会继续说明。
3、vmlinux 入口:第一行运行的代码
Linux启动,会启动内核编译后的文件 vmlinux,vmlinux 是一个 ELF 文件,按照 ./arch/arm64/kernel/vmlinux.lds 设定的规则进行链接,vmlinux.lds 是 vmlinux.lds.S 编译之后生成的。所以为了确定 vmlinux 内核的起始地址, 首先通过 vmlinux.lds.S 链接脚本进行分析。如下所示:
$ readelf -h vmlinux
ELF Header:Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00Class: ELF64Data: 2's complement, little endianVersion: 1 (current)OS/ABI: UNIX - System VABI Version: 0Type: DYN (Shared object file)Machine: AArch64Version: 0x1Entry point address: 0xffff800010000000Start of program headers: 64 (bytes into file)Start of section headers: 494679672 (bytes into file)Flags: 0x0Size of this header: 64 (bytes)Size of program headers: 56 (bytes)Number of program headers: 5Size of section headers: 64 (bytes)Number of section headers: 38Section header string table index: 37
$ readelf -l vmlinuxElf file type is DYN (Shared object file)
Entry point 0xffff800010000000
There are 5 program headers, starting at offset 64Program Headers:Type Offset VirtAddr PhysAddrFileSiz MemSiz Flags AlignLOAD 0x0000000000010000 0xffff800010000000 0xffff8000100000000x0000000001beacdc 0x0000000001beacdc RWE 10000LOAD 0x0000000001c00000 0xffff800011c00000 0xffff800011c000000x00000000000c899c 0x00000000000c899c R E 10000LOAD 0x0000000001cd0000 0xffff800011cd0000 0xffff800011cd00000x0000000000876200 0x0000000000905794 RW 10000NOTE 0x0000000001bfaca0 0xffff800011beaca0 0xffff800011beaca00x000000000000003c 0x000000000000003c R 4GNU_STACK 0x0000000000000000 0x0000000000000000 0x00000000000000000x0000000000000000 0x0000000000000000 RW 10Section to Segment mapping:Segment Sections...00 .head.text .text .got.plt .rodata .pci_fixup __ksymtab __ksymtab_gpl __ksymtab_strings __param __modver __ex_table .notes01 .init.text .exit.text .altinstructions02 .init.data .data..percpu .hyp.data..percpu .rela.dyn .data __bug_table .mmuoff.data.write .mmuoff.data.read .pecoff_edata_padding .bss03 .notes04
通过上面的查询可知,此 vmlinux 为一个 aarch64 架构平台的 ELF 可执行文件,其程序入口的地址为 0xffff800010000000,此段对应的 section 为.head.text .text .got.plt......,所以 vmlinux 的入口在 .head.text 文本段。
详细查看各个段:readelf -S vmlinux
There are 38 section headers, starting at offset 0xd7dc238:Section Headers:[Nr] Name Type Address OffsetSize EntSize Flags Link Info Align[ 0] NULL 0000000000000000 000000000000000000000000 0000000000000000 0 0 0[ 1] .head.text PROGBITS ffffffc010080000 000100000000000000001000 0000000000000000 AX 0 0 4096[ 2] .text PROGBITS ffffffc010081000 0001100000000000008dcfd8 0000000000000008 AX 0 0 2048[ 3] .rodata PROGBITS ffffffc010960000 008f000000000000003032b8 0000000000000000 WA 0 0 65536[ 4] .modinfo PROGBITS ffffffc010c632b8 00bf32b800000000000115d2 0000000000000000 A 0 0 1[ 5] .init.eh_frame PROGBITS ffffffc010c74890 00c048900000000000001bb0 0000000000000000 A 0 0 8[ 6] .pci_fixup PROGBITS ffffffc010c76440 00c0644000000000000022b0 0000000000000000 A 0 0 16[ 7] __ksymtab PROGBITS ffffffc010c786f0 00c086f0000000000000d500 0000000000000000 A 0 0 4[ 8] __ksymtab_gpl PROGBITS ffffffc010c85bf0 00c15bf0000000000000da04 0000000000000000 A 0 0 4[ 9] __ksymtab_strings PROGBITS ffffffc010c935f4 00c235f4000000000002c292 0000000000000000 A 0 0 1[10] __param PROGBITS ffffffc010cbf888 00c4f8880000000000002710 0000000000000000 A 0 0 8[11] __modver PROGBITS ffffffc010cc1f98 00c51f9800000000000000c0 0000000000000000 A 0 0 8[12] __ex_table PROGBITS ffffffc010cc3000 00c530000000000000001f10 0000000000000000 A 0 0 8[13] .notes NOTE ffffffc010cc4f10 00c54f10000000000000003c 0000000000000000 A 0 0 4[14] .init.text PROGBITS ffffffc010cd0000 00c600000000000000049180 0000000000000000 AX 0 0 16[15] .exit.text PROGBITS ffffffc010d19180 00ca91800000000000002e1c 0000000000000000 AX 0 0 4[16] .altinstructions PROGBITS ffffffc010d1bf9c 00cabf9c0000000000024bc4 0000000000000000 A 0 0 1[17] .init.data PROGBITS ffffffc010d41000 00cd100000000000000169b0 0000000000000000 WA 0 0 256[18] .data..percpu PROGBITS ffffffc010d58000 00ce8000000000000000c258 0000000000000000 WA 0 0 64[19] .rela.dyn RELA ffffffc010d64258 00cf42580000000000100050 0000000000000018 A 0 0 8[20] .data PROGBITS ffffffc010e70000 00e00000000000000008f640 0000000000000000 WA 0 0 4096[21] .got.plt PROGBITS ffffffc010eff640 00e8f6400000000000000018 0000000000000008 WA 0 0 8[22] __bug_table PROGBITS ffffffc010eff658 00e8f6580000000000010824 0000000000000000 WA 0 0 4[23] .mmuoff.data[...] PROGBITS ffffffc010f10000 00ea00000000000000000018 0000000000000000 WA 0 0 2048[24] .mmuoff.data.read PROGBITS ffffffc010f10800 00ea08000000000000000008 0000000000000000 WA 0 0 8[25] .pecoff_edat[...] PROGBITS ffffffc010f10808 00ea080800000000000001f8 0000000000000000 WA 0 0 1[26] .bss NOBITS ffffffc010f11000 00ea0a000000000000060628 0000000000000000 WA 0 0 4096[27] .comment PROGBITS 0000000000000000 00ea0a00000000000000005a 0000000000000001 MS 0 0 1[28] .debug_line PROGBITS 0000000000000000 00ea0a5a00000000011db9fa 0000000000000000 0 0 1[29] .debug_info PROGBITS 0000000000000000 0207c4540000000009285ce2 0000000000000000 0 0 1[30] .debug_abbrev PROGBITS 0000000000000000 0b302136000000000048bc4e 0000000000000000 0 0 1[31] .debug_aranges PROGBITS 0000000000000000 0b78dd900000000000021cc0 0000000000000000 0 0 16[32] .debug_str PROGBITS 0000000000000000 0b7afa500000000000291d0d 0000000000000001 MS 0 0 1[33] .debug_ranges PROGBITS 0000000000000000 0ba417600000000000ca18c0 0000000000000000 0 0 16[34] .debug_loc PROGBITS 0000000000000000 0c6e30200000000000d6e817 0000000000000000 0 0 1[35] .symtab SYMTAB 0000000000000000 0d4518380000000000230d90 0000000000000018 36 76147 8[36] .strtab STRTAB 0000000000000000 0d6825c80000000000159acb 0000000000000000 0 0 1[37] .shstrtab STRTAB 0000000000000000 0d7dc093000000000000019f 0000000000000000 0 0 1
Key to Flags:W (write), A (alloc), X (execute), M (merge), S (strings), I (info),L (link order), O (extra OS processing required), G (group), T (TLS),C (compressed), x (unknown), o (OS specific), E (exclude),p (processor specific)
注意一个点: .head.text 的大小 0x1000, 一个页的大小。后面紧接着就是 exceptions.
3.1 .head.text 文本段
通过 vmlinux.lds.S 找到 vmlinux 的入口函数。具体分析如下:
/* SPDX-License-Identifier: GPL-2.0 */
/** ld script to make ARM Linux kernel* taken from the i386 version by Russell King* Written by Martin Mares <mj@atrey.karlin.mff.cuni.cz>*/#define RO_EXCEPTION_TABLE_ALIGN 8
#define RUNTIME_DISCARD_EXIT#include <asm-generic/vmlinux.lds.h>
#include <asm/cache.h>
#include <asm/hyp_image.h>
#include <asm/kernel-pgtable.h>
#include <asm/memory.h>
#include <asm/page.h>#include "image.h"OUTPUT_ARCH(aarch64)
ENTRY(_text)
根据链接脚本语法,可以知道 OUTPUT_ARCH 关键字指定了链接之后的输出文件的体系结构是 aarch64。ENTRY 关键字指定了输出文件 vmlinux 的入口 地址是 _text, 因此只需找到 _text 的定义就可以知道 vmlinux 的入口函数。接下来的代码是:
SECTIONS
{/** XXX: The linker does not define how output sections are* assigned to input sections when there are multiple statements* matching the same input section name. There is no documented* order of matching.*//DISCARD/ : {ARM_EXIT_DISCARD(EXIT_TEXT)ARM_EXIT_DISCARD(EXIT_DATA)EXIT_CALL*(.discard)*(.discard.*)*(.interp .dynamic)*(.dynsym .dynstr .hash .gnu.hash)*(.eh_frame)}. = KIMAGE_VADDR + TEXT_OFFSET; //这个当前段开始的位置.head.text : {_text = .; //将上面的段的开始位置赋值给_textHEAD_TEXT}.text : { /* Real text segment */_stext = .; /* Text and read-only data */__exception_text_start = .;*(.exception.text)__exception_text_end = .;IRQENTRY_TEXTSOFTIRQENTRY_TEXTENTRY_TEXTTEXT_TEXTSCHED_TEXTCPUIDLE_TEXTLOCK_TEXTKPROBES_TEXTHYPERVISOR_TEXTIDMAP_TEXTHIBERNATE_TEXTTRAMP_TEXT*(.fixup)*(.gnu.warning). = ALIGN(16);*(.got) /* Global offset table */}. = ALIGN(SEGMENT_ALIGN);_etext = .; /* End of text section */RO_DATA(PAGE_SIZE) /* everything from this point to */EXCEPTION_TABLE(8) /* __init_begin will be marked RO NX */NOTES. = ALIGN(PAGE_SIZE);idmap_pg_dir = .;. += IDMAP_DIR_SIZE;idmap_pg_end = .;#ifdef CONFIG_UNMAP_KERNEL_AT_EL0tramp_pg_dir = .;. += PAGE_SIZE;
#endifreserved_pg_dir = .;. += PAGE_SIZE;swapper_pg_dir = .;. += PAGE_SIZE;. = ALIGN(SEGMENT_ALIGN);__init_begin = .;__inittext_begin = .;INIT_TEXT_SECTION(8)__exittext_begin = .;.exit.text : {ARM_EXIT_KEEP(EXIT_TEXT)}__exittext_end = .;. = ALIGN(4);.altinstructions : {__alt_instructions = .;*(.altinstructions)__alt_instructions_end = .;}. = ALIGN(PAGE_SIZE);__inittext_end = .;__initdata_begin = .;.init.data : {INIT_DATAINIT_SETUP(16)INIT_CALLSCON_INITCALLINIT_RAM_FS*(.init.rodata.* .init.bss) /* from the EFI stub */}.exit.data : {ARM_EXIT_KEEP(EXIT_DATA)}PERCPU_SECTION(L1_CACHE_BYTES).rela.dyn : ALIGN(8) {*(.rela .rela*)}__rela_offset = ABSOLUTE(ADDR(.rela.dyn) - KIMAGE_VADDR);__rela_size = SIZEOF(.rela.dyn);#ifdef CONFIG_RELR.relr.dyn : ALIGN(8) {*(.relr.dyn)}__relr_offset = ABSOLUTE(ADDR(.relr.dyn) - KIMAGE_VADDR);__relr_size = SIZEOF(.relr.dyn);
#endif. = ALIGN(SEGMENT_ALIGN);__initdata_end = .;__init_end = .;_data = .;_sdata = .;RW_DATA_SECTION(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)/** Data written with the MMU off but read with the MMU on requires* cache lines to be invalidated, discarding up to a Cache Writeback* Granule (CWG) of data from the cache. Keep the section that* requires this type of maintenance to be in its own Cache Writeback* Granule (CWG) area so the cache maintenance operations don't* interfere with adjacent data.*/.mmuoff.data.write : ALIGN(SZ_2K) {__mmuoff_data_start = .;*(.mmuoff.data.write)}. = ALIGN(SZ_2K);.mmuoff.data.read : {*(.mmuoff.data.read)__mmuoff_data_end = .;}PECOFF_EDATA_PADDING__pecoff_data_rawsize = ABSOLUTE(. - __initdata_begin);_edata = .;BSS_SECTION(0, 0, 0). = ALIGN(PAGE_SIZE);init_pg_dir = .;. += INIT_DIR_SIZE;init_pg_end = .;__pecoff_data_size = ABSOLUTE(. - __initdata_begin);_end = .;STABS_DEBUGHEAD_SYMBOLS
}
对上面做个简化:

上面这个简化图和 readelf -S vmlinux 看到的各个段是一致的。
- 上图中的宏 HEAD_TEXT 定义在文件 include/asm-generic/vmlinux.lds.S 中,其定义为 .head.text 文本段。
- 上图中的 idmap_pg_dir,init_pg_dir 是页表映射,idmap_pg_dir 是 identity mapping 用到的页表,init_pg_dir 是 kernel_image_mapping 用到的页表。
/* include/asm-generic/vmlinux.lds.h文件 */
#define HEAD_TEXT KEEP(*(.head.text))/* include/linux/init.h 文件*/
#define __HEAD .section ".head.text","ax"
上面是 HEAD_TEXT 定义的宏,最终找到__HEAD;故转向 arch/arm64/kernel/head.S 中继续执行。
__HEAD
_head:/** DO NOT MODIFY. Image header expected by Linux boot-loaders.*/
#ifdef CONFIG_EFI/** This add instruction has no meaningful effect except that* its opcode forms the magic "MZ" signature required by UEFI.*/add x13, x18, #0x16b primary_entry
#elseb primary_entry // branch to kernel start, magic.long 0 // reserved
#endif
3.2 primary_entry
进入正式的初始化流程。
SYM_CODE_START(primary_entry)bl preserve_boot_argsbl el2_setup // Drop to EL1, w0=cpu_boot_modeadrp x23, __PHYS_OFFSETand x23, x23, MIN_KIMG_ALIGN - 1 // KASLR offset, defaults to 0bl set_cpu_boot_mode_flagbl __create_page_tables/** The following calls CPU setup code, see arch/arm64/mm/proc.S for* details.* On return, the CPU will be ready for the MMU to be turned on and* the TCR will have been set.*/bl __cpu_setup // initialise processorb __primary_switch
SYM_CODE_END(primary_entry)
preserve_boot_args 是用来保存从 bootloader 传递的参数,使 dcache 失效。
el2_setup 设定 core 启动状态。
set_cpu_boot_mode_flag 设置 core 启动的 EL。
__create_page_tables 创建页表
我们知道 idmap_pg_dir 是 identity mapping 用到的页表,init_pg_dir 是 kernel_image_mapping 用到的页表。这里通过 __create_page_tables 来填充这两个页表。(具体如何实现的代码细节,请查看我写的内存管理文章)
SYM_FUNC_START_LOCAL(__create_page_tables)mov x28, lr....../** Create the identity mapping.*/adrp x0, idmap_pg_diradrp x3, __idmap_text_start // __pa(__idmap_text_start)......adrp x5, __idmap_text_end....../** Map the kernel image (starting with PHYS_OFFSET).*/adrp x0, init_pg_dirmov_q x5, KIMAGE_VADDR // compile time __va(_text)add x5, x5, x23 // add KASLR displacementmov x4, PTRS_PER_PGDadrp x6, _end // runtime __pa(_end)adrp x3, _text // runtime __pa(_text)sub x6, x6, x3 // _end - _textadd x6, x6, x5 // runtime __va(_end)......
SYM_FUNC_END(__create_page_tables)

这里可以留一个 问题,让大家去查一下: idmap.text 恒等映射的原因是什么?
kernel 镜像的各个段分布, 就是由链接脚本组成的虚拟地址分布布局。至于物理地址, 是由bootloader 加载到内存的地址。然后CPU 访问虚拟地址,经过MMU,访问到物理内存上的二进制。
3.3 __cpu_setup 初始化 CPU
为开启 MMU 做一些 CPU 的初始化工作。前面都是关闭MMU。
SYM_FUNC_START(__cpu_setup)tlbi vmalle1 // Invalidate local TLBdsb nshmov x1, #3 << 20msr cpacr_el1, x1 // Enable FP/ASIMDmov x1, #1 << 12 // Reset mdscr_el1 and disablemsr mdscr_el1, x1 // access to the DCC from EL0isb // Unmask debug exceptions now,enable_dbg // since this is per-cpureset_pmuserenr_el0 x1 // Disable PMU access from EL0reset_amuserenr_el0 x1 // Disable AMU access from EL0/** Memory region attributes*/mov_q x5, MAIR_EL1_SET
前面做 TLB/FP/ASIMD/DCC/PMU/AMU 的初始化,后面做 Memory region attributes。
3.4__primary_switch 开启MMU
切换到虚拟地址,并调用 __primary_switched。
3.5 __primary_switched
__primary_switched主要完成了如下的工作:
- 为init进程设置好堆栈地址和大小,保存当前进程描述符地址到sp_el0;
- 设置异常向量表基址寄存器;
- 保存FDT地址到__fdt_pointer变量;
- 将kimage的虚拟地址和物理地址的偏移保存到kimage_voffset
- clear bss
- 跳转到start_kernel
3.6 用一张图概括:

