文章代码分析基于linux-5.19.13,架构基于aarch64(ARM64)。
涉及页表代码分析部分:
(1)假设页表映射层级是4,即配置CONFIG_ARM64_PGTABLE_LEVELS=4;
(2)虚拟地址宽度是48,即配置CONFIG_ARM64_VA_BITS=48;
(3)物理地址宽度是48,即配置CONFIG_ARM64_PA_BITS=48;

1. 入口分析1.1 链接脚本arch/arm64/kernel/vmlinux.lds.S

  这里只列举与内存初始化相关的定义,其它的采用“……”省略。

......OUTPUT_ARCH(aarch64)    '指定一个特定的输出机器架构为aarch64'                       ENTRY(_text)            '设置入口地址,实现在arch/arm64/kernel/head.S'......SECTIONS{        ......                           '在5.8内核版本发现TEXT_OFFSET没有任何作用,因此,被重新定义为0x0'. = KIMAGE_VADDR;  '内核映像虚拟的起始地址(在5.8内核之前这里为KIMAGE_VADDR + TEXT_OFFSET)'.head.text : {     '早期汇编代码的text段'_text = .; '入口地址'HEAD_TEXT   定义在include/asm-generic/vmlinux.lds.h'#define HEAD_TEXT  KEEP(*(.head.text))'}.text : ALIGN(SEGMENT_ALIGN) {/* Real text segment*/   _stext = .;/* Text and read-only data*/ 'text段起始'                ......}        ....... = ALIGN(SEGMENT_ALIGN);_etext = .;/* End of text section */ 'text段结束'/* everything from this point to __init_begin will be marked RO NX */RO_DATA(PAGE_SIZE)   '只读数据段'        ......idmap_pg_dir = .;  '恒等映射一级页表地址'. += IDMAP_DIR_SIZE;idmap_pg_end = .;#ifdef CONFIG_UNMAP_KERNEL_AT_EL0tramp_pg_dir = .;  '熔断(安全漏洞引入)'. += PAGE_SIZE;#endifreserved_pg_dir = .;. += PAGE_SIZE;swapper_pg_dir = .;. += PAGE_SIZE;. = ALIGN(SEGMENT_ALIGN);__init_begin = .;       'init段起始'__inittext_begin = .;        ....... = ALIGN(SEGMENT_ALIGN);__initdata_end = .;__init_end = .;  'init段结束'_data = .;_sdata = .;      '数据段起始'RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN)_edata = .;      '数据段结束'BSS_SECTION(SBSS_ALIGN, 0, 0)      --- 'BSS段'. = ALIGN(PAGE_SIZE);init_pg_dir = .;. += INIT_DIR_SIZE;init_pg_end = .;        ......}

1.2 入口

#arch/arm64/kernel/head.S/* * Kernel startup entry point. * --------------------------- * * The requirements are: *   MMU = off, D-cache = off, I-cache = on or off, *   x0 = physical address to the FDT blob. * * This code is mostly position independent so you call this at * __pa(PAGE_OFFSET). * * Note that the callee-saved registers are used for storing variables * that are useful before the MMU is enabled. The allocations are described * in the entry routines. */__HEAD          --- 定义在include/linux/init.h中'#define __HEAD.section".head.text","ax"',紧接着_text/* * DO NOT MODIFY. Image header expected by Linux boot-loaders. */efi_signature_nop// special NOP to identity as PE/COFF executablebprimary_entry// branch to kernel start, magic      '要重点关注分析的启动汇编代码'        ......

1.3 启动 AArch64 Linux的调用约定

  内核从上电开始到执行到内核入口”_text”,中间要经过bootloader或者bios的引导。引导程序会做一些初始化内存,设置device tree,解压内核,跳转到内核等等。在跳转到内核之前,有一些标准的约定,参见Documentation/translations/zh_CN/arm64/booting.txt。这里仅列出在跳转入内核前,必须符合以下章节的状态:

在跳转入内核前,必须符合以下状态:- 停止所有 DMA 设备,这样内存数据就不会因为虚假网络包或磁盘数据而 被破坏。这可能可以节省你许多的调试时间。- 主 CPU 通用寄存器设置 x0 = 系统 RAM 中设备树数据块(dtb)的物理地址。  x1 = 0 (保留,将来可能使用)  x2 = 0 (保留,将来可能使用)  x3 = 0 (保留,将来可能使用)- CPU 模式 所有形式的中断必须在 PSTATE.DAIF 中被屏蔽(Debug、SError、IRQ 和 FIQ)。  CPU 必须处于 EL2(推荐,可访问虚拟化扩展)或非安全 EL1 模式下。'bootloader来切'- 高速缓存、MMU MMU 必须关闭。 'mmu关闭,指令高速缓存一般可以打开,数据高速缓存必须关闭'  指令缓存开启或关闭皆可。 已载入的内核映像的相应内存区必须被清理,以达到缓存一致性点(PoC)。 当存在系统缓存或其他使能缓存的一致性主控器时,通常需使用虚拟地址 维护其缓存,而非 set/way 操作。 遵从通过虚拟地址操作维护构架缓存的系统缓存必须被配置,并可以被使能。 而不通过虚拟地址操作维护构架缓存的系统缓存(不推荐),必须被配置且 禁用。  *译者注:对于 PoC 以及缓存相关内容,请参考 ARMv8 构架参考手册 ARM DDI 0487A- 架构计时器 CNTFRQ 必须设定为计时器的频率,且 CNTVOFF 必须设定为对所有 CPU 都一致的值。如果在 EL1 模式下进入内核,则 CNTHCTL_EL2 中的 EL1PCTEN (bit 0) 必须置位。- 一致性 通过内核启动的所有 CPU 在内核入口地址上必须处于相同的一致性域中。 这可能要根据具体实现来定义初始化过程,以使能每个CPU上对维护操作的 接收。- 系统寄存器 在进入内核映像的异常级中,所有构架中可写的系统寄存器必须通过软件 在一个更高的异常级别下初始化,以防止在 未知 状态下运行。  对于拥有 GICv3 中断控制器并以 v3 模式运行的系统:  - 如果 EL3 存在: ICC_SRE_EL3.Enable (位 3) 必须初始化为 0b1。    ICC_SRE_EL3.SRE (位 0) 必须初始化为 0b1。  - 若内核运行在 EL1: ICC_SRE_EL2.Enable (位 3) 必须初始化为 0b1。    ICC_SRE_EL2.SRE (位 0) 必须初始化为 0b1。  - 设备树(DT)或 ACPI 表必须描述一个 GICv3 中断控制器。  对于拥有 GICv3 中断控制器并以兼容(v2)模式运行的系统:  - 如果 EL3 存在: ICC_SRE_EL3.SRE (位 0) 必须初始化为 0b0。  - 若内核运行在 EL1: ICC_SRE_EL2.SRE (位 0) 必须初始化为 0b0。  - 设备树(DT)或 ACPI 表必须描述一个 GICv2 中断控制器。

这里有个很关键的问题:为什么跳转到内核时指令高速缓存可以打开,数据高速缓存必须关闭?
(1)CPU启动取数据的时候首先去访问数据高速缓存,这个数据高速缓存有可能缓存了bootloader的数据,这个数据对于内核可能是错误的。因此数据高速缓存必须关闭。
(2)bootloader和内核的指令无冲突。因为bootloader指令运行完成后不会再次运行,直接运行内核的指令。因此指令高速缓存可以不关闭。

2. 启动汇编接口primary_entry分析

/* * The following callee saved general purpose registers are used on the * primary lowlevel boot path: * *  Register   Scope                      Purpose *  x21        primary_entry() .. start_kernel()        FDT pointer passed at boot in x0 *  x23        primary_entry() .. start_kernel()        physical misalignment/KASLR offset *  x28        __create_page_tables()                   callee preserved temp register *  x19/x20    __primary_switch()                       callee preserved temp registers *  x24        __primary_switch() .. relocate_kernel()  current RELR displacement */SYM_CODE_START(primary_entry)blpreserve_boot_argsblinit_kernel_el// w0=cpu_boot_modeadrpx23, __PHYS_OFFSET              --- '__PHYS_OFFSET加载到x23寄存器'andx23, x23, MIN_KIMG_ALIGN - 1// KASLR offset, defaults to 0blset_cpu_boot_mode_flagbl__create_page_tables/* * The following calls CPU setup code, see arch/arm64/mm/proc.S for * details. * On return, the CPU will be ready for the MMU to be turned on and * the TCR will have been set. */bl__cpu_setup// initialise processorb__primary_switchSYM_CODE_END(primary_entry)

2.1 preserve_boot_args

  功能:把bootloader传进来的x0 .. x3保存到boot_args数组中。

/* * Preserve the arguments passed by the bootloader in x0 .. x3 */SYM_CODE_START_LOCAL(preserve_boot_args)movx21, x0// x21=FDT(x0寄存器保存devicetree的地址),devicetree保存到x21寄存器adr_lx0, boot_args// record the contents of. boot_args数组地址保存到x0stpx21, x1, [x0]// x0 .. x3 at kernel entrystpx2, x3, [x0, #16]                  '参数x0 .. x3保存到boot_args数组中'dmbsy// needed before dc ivac with.内存屏障指令(+sy表示全系统高速缓存范围内做一次内存屏障)// MMU offaddx1, x0, #0x20// 4 x 8 bytesbdcache_inval_poc// tail call 清除boot_args数组对应的高速缓存SYM_CODE_END(preserve_boot_args)

2.2 init_kernel_el

  判断启动的模式是EL2还是非安全模式的EL1,并进行相关级别的系统配置(ARMv8中EL2是hypervisor模式,EL1是标准的内核模式),然后使用w0返回启动模式(BOOT_CPU_MODE_EL1或BOOT_CPU_MODE_EL2)。通常来讲系统启动时运行在EL3,uboot会把处理器置于EL2,内核运行到init_kernel_el会设为EL1。

/* * Starting from EL2 or EL1, configure the CPU to execute at the highest * reachable EL supported by the kernel in a chosen default state. If dropping * from EL2 to EL1, configure EL2 before configuring EL1. * * Since we cannot always rely on ERET synchronizing writes to sysregs (e.g. if * SCTLR_ELx.EOS is clear), we place an ISB prior to ERET. * * Returns either BOOT_CPU_MODE_EL1 or BOOT_CPU_MODE_EL2 in w0 if * booted in EL1 or EL2 respectively. */SYM_FUNC_START(init_kernel_el)mrsx0, CurrentEL            '获取当前PSTATE异常等级'cmpx0, #CurrentEL_EL2       b.eqinit_el2                 '如果PSTATE异常等级为EL2,则跳转到init_el2'SYM_INNER_LABEL(init_el1, SYM_L_LOCAL)mov_qx0, INIT_SCTLR_EL1_MMU_OFFmsrsctlr_el1, x0isb                              '因为前面修改了系统控制器'mov_qx0, INIT_PSTATE_EL1msrspsr_el1, x0msrelr_el1, lrmovw0, #BOOT_CPU_MODE_EL1eret                             'Return from exception'SYM_INNER_LABEL(init_el2, SYM_L_LOCAL)  --- 'EL2切向EL1'        ......msrelr_el1, x0eret1:        ......movw0, #BOOT_CPU_MODE_EL2eret__cpu_stick_to_vhe:movx0, #HVC_VHE_RESTARThvc#0movx0, #BOOT_CPU_MODE_EL2retSYM_FUNC_END(init_kernel_el)

2.3 set_cpu_boot_mode_flag

  根据w0中传递的cpu启动模式设置__boot_cpu_mode标志。

/* * Sets the __boot_cpu_mode flag depending on the CPU boot mode passed * in w0. See arch/arm64/include/asm/virt.h for more info. */SYM_FUNC_START_LOCAL(set_cpu_boot_mode_flag)adr_lx1, __boot_cpu_mode             //x1记录__boot_cpu_mode[]的地址cmpw0, #BOOT_CPU_MODE_EL2          //w0记录启动时的异常等级b.ne1f                              //如果不是从EL2启动,则跳转到1处addx1, x1, #4                      // 如果是从EL2启动,地址指向__boot_cpu_mode[1]1:strw0, [x1]// Save CPU boot mode 保存启动模式到x1指向的地址,如果是从EL1启动,地址指向__boot_cpu_mode[0]dmbsy                              // 保证str指令执行完成dcivac, x1// Invalidate potentially stale cache line 使高速缓存失效retSYM_FUNC_END(set_cpu_boot_mode_flag)

2.4 __create_page_tables

/* * Setup the initial page tables. We only setup the barest amount which is * required to get the kernel running. The following sections are required: *   - identity mapping to enable the MMU (low address, TTBR0)    (1)恒等映射 *   - first few MB of the kernel linear mapping to jump to once the MMU has *     been enabled                                               (2)内核image映射 */SYM_FUNC_START_LOCAL(__create_page_tables) ...SYM_FUNC_END(__create_page_tables)

2.4.1 保存LR值

movx28, lr                                         //#把LR的值存放到X28

2.4.2 使初始化页表无效、并清空初始化页表

/* * Invalidate the init page tables to avoid potential dirty cache lines * being evicted. Other page tables are allocated in rodata as part of * the kernel image, and thus are clean to the PoC per the boot * protocol. */adrpx0, init_pg_dir   //把init_pg_dir的物理地址赋值给x0adrpx1, init_pg_end   //把init_pg_end的物理地址赋值给x1bldcache_inval_poc  //把init_pg_dir页表对应的高速缓存清掉(入参是x0和x1)/* * Clear the init page tables.//把这个页表内容清零 */adrpx0, init_pg_diradrpx1, init_pg_endsubx1, x1, x01:stpxzr, xzr, [x0], #16 //xzr是零寄存器stpxzr, xzr, [x0], #16stpxzr, xzr, [x0], #16stpxzr, xzr, [x0], #16subsx1, x1, #64b.ne1b

(1)init_pg_dir和init_pg_end定义在arch/arm64/kernel/vmlinux.lds.S链接文件中:

#arch/arm64/kernel/vmlinux.lds.S. = ALIGN(PAGE_SIZE);init_pg_dir = .;. += INIT_DIR_SIZE;init_pg_end = .;

(2)adrp指令

作用:以页为单位的大范围的地址读取指令,这里的P就是page的意思。
原理:符号扩展一个21位的offset(immhi+immlo), 向左移动12位,PC的值的低12位清零,然后把这两者相加,结果写入到Xd寄存器,用来得到一块含有lable的4KB对齐内存区域的base地址(也就是说lable所在的地址,一定落在这个4KB的内存区域里,指令助记符里Page也就是这个意思), 可用来寻址 +/- 4GB的范围(2^33次幂)。
通俗来讲,ADRP指令就是先进行PC+imm(偏移值)。然后找到lable所在的一个4KB的页,然后取得label的基址,再进行偏移去寻址。

ADRP {cond} Rd label

其中:Rd加载的目标寄存器。lable为地址表达式。

(3)使用adrp指令获取init_pg_dir和init_pg_end的地址,页大小为4KB,由于内核启动的时候MMU还未打开(PC为物理地址),因此此时获取的地址也为物理地址。
(4)adrp通过当前PC地址的偏移地址计算目标地址,和实际的物理地址无关,因此属于位置无关码

2.4.3 保存SWAPPER_MM_MMUFLAGS到x7寄存器

mov_qx7, SWAPPER_MM_MMUFLAGS  

SWAPPER_MM_MMUFLAGS宏描述了段映射的属性,它实现在arch/arm64/include/asm/kernel-pgtable.h头文件中:

/* * Initial memory map attributes. */#define SWAPPER_PTE_FLAGS(PTE_TYPE_PAGE | PTE_AF | PTE_SHARED | PTE_UXN)#define SWAPPER_PMD_FLAGS(PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S | PMD_SECT_UXN)#if ARM64_KERNEL_USES_PMD_MAPS#define SWAPPER_MM_MMUFLAGS(PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS)  //段映射,这里要使用的#else#define SWAPPER_MM_MMUFLAGS(PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)  //页映射#endif

2.4.4 创建恒等映射

/* * Create the identity mapping. */adrpx0, idmap_pg_dir                                             ---(1)         adrpx3, __idmap_text_start// __pa(__idmap_text_start)  ---(2)  #ifdef CONFIG_ARM64_VA_BITS_52                                               ---(3) mrs_sx6, SYS_ID_AA64MMFR2_EL1andx6, x6, #(0xf << ID_AA64MMFR2_LVA_SHIFT)movx5, #52cbnzx6, 1f#endifmovx5, #VA_BITS_MIN                                             ---(4)                                 1:adr_lx6, vabits_actual                                            ---(5) strx5, [x6]dmbsy                                                          //内存屏障dcivac, x6 // Invalidate potentially stale cache line 把vabits_actual变量对应的缓存给clean掉/* * VA_BITS may be too small to allow for an ID mapping to be created * that covers system RAM if that is located sufficiently high in the * physical address space. So for the ID map, use an extended virtual * range in that case, and configure an additional translation level * if needed. * * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the * entire ID map region can be mapped. As T0SZ == (64 - #bits used), * this number conveniently equals the number of leading zeroes in * the physical address of __idmap_text_end. */adrpx5, __idmap_text_end                                           ---(6) clzx5, x5                     //前导0计数:第一个1前0的个数          ---(6)cmpx5, TCR_T0SZ(VA_BITS_MIN) // default T0SZ small enough?        ---(6)b.ge1f// .. then skip VA range extensionadr_lx6, idmap_t0szstrx5, [x6]dmbsydcivac, x6// Invalidate potentially stale cache line#if (VA_BITS < 48)#define EXTRA_SHIFT(PGDIR_SHIFT + PAGE_SHIFT - 3)#define EXTRA_PTRS(1 << (PHYS_MASK_SHIFT - EXTRA_SHIFT))/* * If VA_BITS < 48, we have to configure an additional table level. * First, we have to verify our assumption that the current value of * VA_BITS was chosen such that all translation levels are fully * utilised, and that lowering T0SZ will always result in an additional * translation level to be configured. */#if VA_BITS != EXTRA_SHIFT#error "Mismatch between VA_BITS and page size/number of translation levels"#endifmovx4, EXTRA_PTRScreate_table_entry x0, x3, EXTRA_SHIFT, x4, x5, x6#else/* * If VA_BITS == 48, we don't have to configure an additional * translation level, but the top-level table has more entries. */movx4, #1 << (PHYS_MASK_SHIFT - PGDIR_SHIFT)str_lx4, idmap_ptrs_per_pgd, x5#endif1:ldr_lx4, idmap_ptrs_per_pgd                                       ---(7) adr_lx6, __idmap_text_end// __pa(__idmap_text_end)    ---(8) map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14       ---(9) 
  1. 将加载idmap_pg_dir的物理地址x0寄存器idmap_pg_dir是恒等映射的一级页表起始地址,其定义在vmlinux.lds.S链接文件中
idmap_pg_dir = .;. += IDMAP_DIR_SIZE;idmap_pg_end = .;

  这里分配给idmap_pg_dir的页面大小为IDMAP_DIR_SIZE,而IDMAP_DIR_SIZE实现在arch/arm64/include/asm/kernel-pgtable.h头文件中,通常是3个连续的大小为4K页面。计算参考如下:

#arch/arm64/include/asm/pgtable-hwdef.h/* * Number of page-table levels required to address 'va_bits' wide * address, without section mapping. We resolve the top (va_bits - PAGE_SHIFT) * bits with (PAGE_SHIFT - 3) bits at each page table level. Hence: * *  levels = DIV_ROUND_UP((va_bits - PAGE_SHIFT), (PAGE_SHIFT - 3)) * * where DIV_ROUND_UP(n, d) => (((n) + (d) - 1) / (d)) * * We cannot include linux/kernel.h which defines DIV_ROUND_UP here * due to build issues. So we open code DIV_ROUND_UP here: * *((((va_bits) - PAGE_SHIFT) + (PAGE_SHIFT - 3) - 1) / (PAGE_SHIFT - 3)) * * which gets simplified as : */#define ARM64_HW_PGTABLE_LEVELS(va_bits) (((va_bits) - 4) / (PAGE_SHIFT - 3)).../* * Highest possible physical address supported. */#define PHYS_MASK_SHIFT(CONFIG_ARM64_PA_BITS) //48#arch/arm64/include/asm/kernel-pgtable.h#if ARM64_KERNEL_USES_PMD_MAPS //段映射一般走这个#define SWAPPER_PGTABLE_LEVELS(CONFIG_PGTABLE_LEVELS - 1)#define IDMAP_PGTABLE_LEVELS(ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT) - 1) // {((48-12)+(12-3)-1) / (12-3) = (36+9-1)/9 = 44/9 = 4}-1  =3#else#define SWAPPER_PGTABLE_LEVELS(CONFIG_PGTABLE_LEVELS)#define IDMAP_PGTABLE_LEVELS(ARM64_HW_PGTABLE_LEVELS(PHYS_MASK_SHIFT)) //3#endif...#define IDMAP_DIR_SIZE(IDMAP_PGTABLE_LEVELS * PAGE_SIZE)

  这里的CONFIG_ARM64_PA_BITS配置的是48. 这里的含义是,计算采用section mapping的话,需要几个页来存放table。ARM64_HW_PGTABLE_LEVELS,很关键,根据配置的物理地址线的宽度计算需要的页面数,注意注释处的计算方法:

((((va_bits) - PAGE_SHIFT) + (PAGE_SHIFT - 3) - 1) / (PAGE_SHIFT - 3))

结合vmlinux.lds,上面的公式就是: ((48-12)+(12-3)-1) / (12-3) = (36+9-1)/9 = 44/9 = 4,最终IDMAP_DIR_SIZE为3个页面,即一次性在连续的地址上分配三个页表—PGD/PUD/PMD页表,每一级页表占据一个页面

这里需要注意一下我们在这里只建立了一个2MB大小的段映射,也就是说对于恒等映射,2M的段映射已经够用。

  1. 将__idmap_text_start的物理地址放到x3寄存器中, __idmap_text_start标号定义在arch/arm64/kernel/vmlinux.lds.S中,是我们要进行恒等映射的起始地址(物理 == 虚拟地址):
#define IDMAP_TEXT\. = ALIGN(SZ_4K);\__idmap_text_start = .;\*(.idmap.text)\__idmap_text_end = .;.text : ALIGN(SEGMENT_ALIGN) {/* Real text segment*/_stext = .;/* Text and read-only data*/IRQENTRY_TEXTSOFTIRQENTRY_TEXTENTRY_TEXTTEXT_TEXTSCHED_TEXTCPUIDLE_TEXTLOCK_TEXTKPROBES_TEXTHYPERVISOR_TEXTIDMAP_TEXT                    '.idmap.text段'*(.gnu.warning). = ALIGN(16);*(.got)/* Global offset table*/}


除了在开机启动时打开MMU外,内核里还有很对场景需要恒等映射,我们通过.section把这些函数都放在.idmap.text段中:

# arch/arm64/kernel/head.S/* * end early head section, begin head code that is also used for * hotplug and needs to have the same protections as the text region */.section ".idmap.text","awx"

这些处于.idmap.text段中函数也可以通过System.map看到:

ffffffc00952f000 T __idmap_text_start         //ffffffc00952f000 T init_kernel_elffffffc00952f010 t init_el1ffffffc00952f038 t init_el2ffffffc00952f270 t __cpu_stick_to_vheffffffc00952f280 t set_cpu_boot_mode_flagffffffc00952f2a8 T secondary_holding_penffffffc00952f2d0 t penffffffc00952f2e4 T secondary_entryffffffc00952f2f4 t secondary_startupffffffc00952f314 t __secondary_switchedffffffc00952f3b8 t __secondary_too_slowffffffc00952f3c8 T __enable_mmu                   //重点关注ffffffc00952f42c T __cpu_secondary_check52bitvaffffffc00952f434 t __no_granule_supportffffffc00952f45c t __relocate_kernelffffffc00952f4a8 t __primary_switch               //重点关注ffffffc00952f530 t enter_vheffffffc00952f568 T cpu_resumeffffffc00952f590 T cpu_soft_restartffffffc00952f5c4 T cpu_do_resumeffffffc00952f66c T idmap_cpu_replace_ttbr1ffffffc00952f6a4 t __idmap_kpti_flagffffffc00952f6a8 T idmap_kpti_install_ng_mappingsffffffc00952f6e8 t do_pgdffffffc00952f700 t next_pgdffffffc00952f710 t skip_pgdffffffc00952f750 t walk_pudsffffffc00952f758 t next_pudffffffc00952f75c t walk_pmdsffffffc00952f764 t do_pmdffffffc00952f77c t next_pmdffffffc00952f78c t skip_pmdffffffc00952f79c t walk_ptesffffffc00952f7a4 t do_pteffffffc00952f7c8 t skip_pteffffffc00952f7d8 t __idmap_kpti_secondaryffffffc00952f820 T __cpu_setupffffffc00952f974 T __idmap_text_end           //
  1. 假设虚拟地址位宽为48(我们定义的是CONFIG_ARM64_VA_BITS_48);
  2. 虚拟地址位宽(48)保存到X5寄存器;
  3. 把立即数VA_BITS_MIN(48)保存到全局变量vabits_actual中;
  4. 将__idmap_text_end的物理地址放到x5寄存器中,计算__idmap_text_end地址第一个1前0的个数。并判断__idmap_text_end地址是否超过VA_BITS_MIN所能表达的地址范围。其中TCR_T0SZ(VA_BITS_MIN) 表示TTBR0页表所能映射的大小,因为稍后我们创建的页表会填充到TTBR0寄存器里面
  5. 把PGD页表包含的页表项保存到x4寄存器中(2^9);
  6. 把__idmap_text_end的物理地址放到x6寄存器中;
  7. 调用map_memory宏来创建这段恒等映射的页表;

map_memory x0, x1, x3, x6, x7, x3, x4, x10, x11, x12, x13, x14
(1) x0 — idmap_pg_dir
(2) x1 — 无效值,会在map_memory中根据tbl的值重新计算
(3) x3 — __idmap_text_start
(4) x6 — __idmap_text_end
(5) x7 — SWAPPER_MM_MMUFLAGS
(6) x3 — __idmap_text_start
(7) x4 — idmap_ptrs_per_pgd

2.4.5 map_memory宏的解析

map_memory宏一共12个参数,参数的解释在下面代码的批注中解释的非常清楚。重要参数说明如下:

  • tbl : 页表起始地址(pgd)
  • rtbl : 下级页表起始地址(typically tbl + PAGE_SIZE)
  • vstart: 要映射虚拟地址的起始地址
  • vend : 要映射虚拟地址的结束地址
  • flags : 最后一级页表的属性
  • phys : 要映射物理地址的起始地址
  • flags : pgd entries个数
/* * Map memory for specified virtual address range. Each level of page table needed supports * multiple entries. If a level requires n entries the next page table level is assumed to be * formed from n pages. * *tbl:location of page table 页表起始地址(pgd) *rtbl:address to be used for first level page table entry (typically tbl + PAGE_SIZE)下级页表起始地址 *vstart:virtual address of start of range 要映射虚拟地址的起始地址 *vend:virtual address of end of range - we map [vstart, vend - 1]要映射虚拟地址的结束地址 *flags:flags to use to map last level entries  最后一级页表的属性 *phys:physical address corresponding to vstart - physical memory is contiguous 要映射物理地址的起始地址 *pgds:the number of pgd entries :pgd entries个数 * * Temporaries:istart, iend, tmp, count, sv - these need to be different registers * Preserves:vstart, flags * Corrupts:tbl, rtbl, vend, istart, iend, tmp, count, sv */.macro map_memory, tbl, rtbl, vstart, vend, flags, phys, pgds, istart, iend, tmp, count, svsub \vend, \vend, #1add \rtbl, \tbl, #PAGE_SIZE                                                      ---(1)mov \sv, \rtblmov \count, #0compute_indices \vstart, \vend, #PGDIR_SHIFT, \pgds, \istart, \iend, \count      ---(2)populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp  ---(3)mov \tbl, \svmov \sv, \rtbl#if SWAPPER_PGTABLE_LEVELS > 3    //我们这里不成立compute_indices \vstart, \vend, #PUD_SHIFT, #PTRS_PER_PUD, \istart, \iend, \countpopulate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmpmov \tbl, \svmov \sv, \rtbl#endif#if SWAPPER_PGTABLE_LEVELS > 2                                                           ---(4)        compute_indices \vstart, \vend, #SWAPPER_TABLE_SHIFT, #PTRS_PER_PMD, \istart, \iend, \countpopulate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmpmov \tbl, \sv#endif                               compute_indices \vstart, \vend, #SWAPPER_BLOCK_SHIFT, #PTRS_PER_PTE, \istart, \iend, \count  ---(5)    bic \count, \phys, #SWAPPER_BLOCK_SIZE - 1populate_entries \tbl, \count, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp.endm
  1. 计算PUD基地址,rtbl是下级页表地址:PUD = PGD+PAGE_SIZE

  2. compute_indices宏的功能:根据虚拟地址计算各级页表的索引值index

/* * Compute indices of table entries from virtual address range. If multiple entries * were needed in the previous page table level then the next page table level is assumed * to be composed of multiple pages. (This effectively scales the end index). * *vstart:virtual address of start of range *vend:virtual address of end of range - we map [vstart, vend] *shift:shift used to transform virtual address into index *ptrs:number of entries in page table *istart:index in table corresponding to vstart *iend:index in table corresponding to vend *count:On entry: how many extra entries were required in previous level, scales *  our end index. *On exit: returns how many extra entries required for next page table level * * Preserves:vstart, vend, shift, ptrs * Returns:istart, iend, count */.macro compute_indices, vstart, vend, shift, ptrs, istart, iend, countlsr\iend, \vend, \shiftmov\istart, \ptrssub\istart, \istart, #1and\iend, \iend, \istart// iend = (vend >> shift) & (ptrs - 1)mov\istart, \ptrsmul\istart, \istart, \countadd\iend, \iend, \istart// iend += count * ptrs// our entries span multiple tableslsr\istart, \vstart, \shiftmov\count, \ptrssub\count, \count, #1and\istart, \istart, \countsub\count, \iend, \istart.endm
  1. populate_entries宏的功能:填充索引值index对应的页表项
/* * Macro to populate page table entries, these entries can be pointers to the next level * or last level entries pointing to physical memory. * *tbl:page table address *rtbl:pointer to page table or physical memory *index:start index to write *eindex:end index to write - [index, eindex] written to *flags:flags for pagetable entry to or in *inc:increment to rtbl between each entry *tmp1:temporary variable * * Preserves:tbl, eindex, flags, inc * Corrupts:index, tmp1 * Returns:rtbl */.macro populate_entries, tbl, rtbl, index, eindex, flags, inc, tmp1.Lpe\@:phys_to_pte \tmp1, \rtblorr\tmp1, \tmp1, \flags// tmp1 = table entrystr\tmp1, [\tbl, \index, lsl #3]add\rtbl, \rtbl, \inc// rtbl = pa next leveladd\index, \index, #1cmp\index, \eindexb.ls.Lpe\@.endm
  1. 设置PUD页表项 ;
  2. 设置PMD页表项(因为我们用的是段映射,因此这里是最后一级,没有PTE);

2.4.6 创建内核image的映射

/* * Map the kernel image (starting with PHYS_OFFSET). */adrpx0, init_pg_dir                                               ---(1)     mov_qx5, KIMAGE_VADDR// compile time __va(_text)   ---(2)addx5, x5, x23// add KASLR displacement //x23 = __PHYS_OFFSETmovx4, PTRS_PER_PGDadrpx6, _end// runtime __pa(_end)         ---(3)内核映像结束物理地址adrpx3, _text// runtime __pa(_text)        ---(4)内核映像起始物理地址subx6, x6, x3// _end - _text                   //内核映像的大小addx6, x6, x5// runtime __va(_end)         ---(5)内核映像结束地址       map_memory x0, x1, x5, x6, x7, x3, x4, x10, x11, x12, x13, x14        ---(6)
  1. 这里是加载init_pg_dir的物理地址到x0寄存器,init_pg_dir是kernel image的映射使用的页表起始地址(与恒等映射不同),其定义在vmlinux.lds.S链接文件中。
BSS_SECTION(SBSS_ALIGN, 0, 0). = ALIGN(PAGE_SIZE);init_pg_dir = .;. += INIT_DIR_SIZE;init_pg_end = .;
  1. 加载内核映像虚拟的起始地址KIMAGE_VADDR到x5寄存器,注意这里使用的是mov_q指令。KIMAGE_VADDR定义在vmlinux.lds.S链接文件中。
SECTIONS{        ......                           '在5.8内核版本发现TEXT_OFFSET没有任何作用,因此,被重新定义为0x0'. = KIMAGE_VADDR;  '内核映像虚拟的起始地址(在5.8内核之前这里为KIMAGE_VADDR + TEXT_OFFSET)' .head.text : {     '早期汇编代码的text段'_text = .; '入口地址'HEAD_TEXT   定义在include/asm-generic/vmlinux.lds.h'#define HEAD_TEXT  KEEP(*(.head.text))'}
  1. 这里是加载内核映像结束物理地址到x3寄存器;
  2. 这里是加载内核映像起始物理地址到x6寄存器;
  3. 换算得到内核映像起始虚拟地址,并加载到x6寄存器;
  4. 调用map_memory宏来创建这段内核映像映射的页表;