记第六章完善内核出现的问题

2018-10-18 本文已影响23人 mecury

[TOC]

在《操作系统真相还原》一书中的第6章：加载内核中，有个问题，需要特别注意一下，避免一下再次踩坑，在这里记录一下：

1. 问题起因

问题：在第6章，实现完成代码后，通过6.3.2完成编译后，程序没有正常执行，异常退出。

执行的操作为：

编译连接程序

nasm -f elf64 -o lib/kernel/print.o lib/kernel/print.S
gcc -I lib/kernel -c -o kernel/main.o  kernel/main.c
ld -Ttext 0xc0001500 -e main -o kernel/kernel.bin kernel/main.o lib/kernel/print.o

写入启动盘

bximage -hd -mode="flat" -size=60 -q hd60M.img
dd if=/home/haifei/software/bookspace/elephant/code/c6/a/boot/mbr.bin of=hd60M.img bs=512 count=1 conv=notrunc
dd if=/home/haifei/software/bookspace/elephant/code/c6/a/boot/loader.bin of=hd60M.img bs=512 count=4 seek=2 conv=notrunc
dd if=/home/haifei/software/bookspace/elephant/code/c6/a/kernel.bin of=hd60M.img bs=512 count=200 seek=9 conv=notrunc

启动

bochs -f .bochsrc

程序没有按照正常的显示打印字符串，下面是正常的情况，错误的情况没有kernel与13显示：

图片.png

2. 问题查找

出现这种问题，我的第一反应是查看编译的源码，先是与书中的代码仔细对比了一下，发现没有什么问题。后来直接编译书中的源码，还是会出现这种情况
代码没有问题，就开始在bochs中开始调试debug。这个过程中，一段异常跳转吸引了我的注意：

(0) [0x000000000d2b] 0008:00000d2b (unk. ctxt): mov cr0, eax              ; 0f22c0
<bochs:96> n
Next at t=156794499
(0) [0x000000000d2e] 0008:00000d2e (unk. ctxt): lgdt ds:0xb04             ; 0f0115040b0000
<bochs:97> n
Next at t=156794500
(0) [0x000000000d35] 0008:00000d35 (unk. ctxt): jmp far 0008:00000d3c     ; ea3c0d00000800
<bochs:98> ^C
Next at t=156794501
(0) [0x000000000d3c] 0008:00000d3c (unk. ctxt): call .+10 (0x00000d4b)    ; e80a000000
<bochs:99> n
(0).[156798010] [0x000000000d9c] 0008:00000d9c (unk. ctxt): add byte ptr ds:[ecx+ebx*2+12174173], ah ; 00a4595dc3b900
Next at t=156798011
(0) [0x0000fffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0

在call .+10 (0x00000d4b)调用后，程序出现了跳转，可是直接跳转到了地址 0x0000fffffff0，开始继续执行，正常的话应该是执行完call指令返回（因为使用了ret指令）。
这段对应的汇编指令为, 上面的call对应的是call指令：call kernel_init：

   ;在开启分页后,用gdt新的地址重新加载
   lgdt [gdt_ptr]             ; 重新加载

;;;;;;;;;;;;;;;;;;;;;;;;;;;;  此时不刷新流水线也没问题  ;;;;;;;;;;;;;;;;;;;;;;;;
;由于一直处在32位下,原则上不需要强制刷新,经过实际测试没有以下这两句也没问题.
;但以防万一，还是加上啦，免得将来出来莫句奇妙的问题.
   jmp SELECTOR_CODE:enter_kernel         ;强制刷新流水线,更新gdt
enter_kernel:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
   call kernel_init
   mov esp, 0xc009f000
   jmp KERNEL_ENTRY_POINT                 ; 用地址0x1500访问测试，结果ok

于是问题定位在call kernel_init,于是继续打断点进入，断点为提示的：0x00000d4b（call .+10 (0x00000d4b) ）
kernel_init:对应的代码：

;-----------------   将kernel.bin中的segment拷贝到编译的地址   -----------
kernel_init:
   xor eax, eax
   xor ebx, ebx     ;ebx记录程序头表地址
   xor ecx, ecx     ;cx记录程序头表中的program header数量
   xor edx, edx     ;dx 记录program header尺寸,即e_phentsize

   mov dx, [KERNEL_BIN_BASE_ADDR + 42]    ; 偏移文件42字节处的属性是e_phentsize,表示program header大小
   mov ebx, [KERNEL_BIN_BASE_ADDR + 28]   ; 偏移文件开始部分28字节的地方是e_phoff,表示第1 个program header在文件中的偏移量
                      ; 其实该值是0x34,不过还是谨慎一点，这里来读取实际值
   add ebx, KERNEL_BIN_BASE_ADDR
   mov cx, [KERNEL_BIN_BASE_ADDR + 44]    ; 偏移文件开始部分44字节的地方是e_phnum,表示有几个program header
.each_segment:
   cmp byte [ebx + 0], PT_NULL        ; 若p_type等于 PT_NULL,说明此program header未使用。
   je .PTNULL

   ;为函数memcpy压入参数,参数是从右往左依然压入.函数原型类似于 memcpy(dst,src,size)
   push dword [ebx + 16]          ; program header中偏移16字节的地方是p_filesz,压入函数memcpy的第三个参数:size
   mov eax, [ebx + 4]             ; 距程序头偏移量为4字节的位置是p_offset
   add eax, KERNEL_BIN_BASE_ADDR      ; 加上kernel.bin被加载到的物理地址,eax为该段的物理地址
   push eax               ; 压入函数memcpy的第二个参数:源地址
   push dword [ebx + 8]           ; 压入函数memcpy的第一个参数:目的地址,偏移程序头8字节的位置是p_vaddr，这就是目的地址
   call mem_cpy               ; 调用mem_cpy完成段复制
   add esp,12                 ; 清理栈中压入的三个参数
.PTNULL:
   add ebx, edx               ; edx为program header大小,即e_phentsize,在此ebx指向下一个program header
   loop .each_segment
   ret

;----------  逐字节拷贝 mem_cpy(dst,src,size) ------------
;输入:栈中三个参数(dst,src,size)
;输出:无
;---------------------------------------------------------
mem_cpy:
   cld
   push ebp
   mov ebp, esp
   push ecx        ; rep指令用到了ecx，但ecx对于外层段的循环还有用，故先入栈备份
   mov edi, [ebp + 8]      ; dst
   mov esi, [ebp + 12]     ; src
   mov ecx, [ebp + 16]     ; size
   rep movsb           ; 逐字节拷贝

   ;恢复环境
   pop ecx
   pop ebp
   ret

进入这个方法后，继续debug，终于在mem_cpy中看到了异常跳转，是应为rep movsb引起的：

<bochs:37> 
Next at t=156794524
(0) [0x000000000d99] 0008:00000d99 (unk. ctxt): mov ecx, dword ptr ss:[ebp+16] ; 8b4d10
<bochs:38> 
Next at t=156794525
(0) [0x000000000d9c] 0008:00000d9c (unk. ctxt): rep movsb byte ptr es:[edi], byte ptr ds:[esi] ; f3a4
<bochs:39> n
(0).[156798010] [0x000000000d9c] 0008:00000d9c (unk. ctxt): add byte ptr ds:[ecx+ebx*2+12174173], ah ; 00a4595dc3b900
Next at t=156798011
(0) [0x0000fffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0

这个时候，表面原因很明显了，rep movsb是系统给的汇编指令，不会出错，那么出错的肯定是Stack中的数据。于是我这边又重新debug。打印出stack中的信息：

 Stack address size 4
 | STACK 0xc00008ec [0x00000d86] 
 | STACK 0xc00008f0 [0x00000000] 
 | STACK 0xc00008f4 [0x00080102]//目的地址 
 | STACK 0xc00008f8 [0x003e0002]//source地址，将会把这个地址的字符复制到Dst中
 | STACK 0xc00008fc [0x00000d41]//复制的字节数
 | STACK 0xc0000900 [0x00000000]
 | STACK 0xc0000904 [0x00000000]
 | STACK 0xc0000908 [0x0000ffff]
 | STACK 0xc000090c [0x00cf9900]
 | STACK 0xc0000910 [0x0000ffff]
 | STACK 0xc0000914 [0x00cf9300]
 | STACK 0xc0000918 [0x80000007]
 | STACK 0xc000091c [0xc0c0930b]
 | STACK 0xc0000920 [0x00000000]

这个里面push进去的值，怎么看都是不正确的。里面的地址看着和我们的地址没有关系。

其实这个时候就应该想到的是编译文件出现了问题。可惜我还没有想到，于是又回到第四章重新看了一下代码解释。因为第四章没有给验证测试，这里的代码是没有测试过，就去写了后面的代码。所以就根据书中的说明重新分析了前面的代码，确认stack中的数据不正确。又向上查找了数据来源，发现是编译生成的bin文件ELF格式中的数据格式问题。
于是就通过readelf指令查看数据：

$ readelf -h kernel.bin 
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0xc0001500
  Start of program headers:          64 (bytes into file)
  Start of section headers:          6768 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         2
  Size of section headers:           64 (bytes)
  Number of section headers:         7
  Section header string table index: 4


$ sh ~/xxd.sh kernel.bin 0 7216
0000000: 7F 45 4C 46 02 01 01 00 00 00 00 00 00 00 00 00  .ELF............
0000010: 02 00 3E 00 01 00 00 00 00 15 00 C0 00 00 00 00  ..>.............
0000020: 40 00 00 00 00 00 00 00 70 1A 00 00 00 00 00 00  @.......p.......
0000030: 00 00 00 00 40 00 38 00 02 00 40 00 07 00 04 00  ....@.8...@.....
0000040: 01 00 00 00 05 00 00 00 00 00 00 00 00 00 00 00  ................
0000050: 00 00 00 C0 00 00 00 00 00 00 00 C0 00 00 00 00  ................
0000060: 88 16 00 00 00 00 00 00 88 16 00 00 00 00 00 00  ................
0000070: 00 00 20 00 00 00 00 00 51 E5 74 64 07 00 00 00  .. .....Q.td....
0000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0000090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000a0: 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00  ................
00000b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
*

然后这里对于数据格式进行分析，猛然发现ELF中的某些字段应该是4个字节，可这里显示8个字节。猛然惊醒是不是编译的时候数据格式不正确导致的。上面也明确表示了ELF Version是02，代表64位。
这里推荐几篇关于：elf格式的文章：

ELF文件格式解析
 Linux ELF文件格式分析

分析到这里，答案也呼之欲出了。于是重新搜索了下，开始重新编译
解决：

将文件重新编译为32位程序：

64位Linux 编译32位程序

nasm -f elf -o lib/kernel/print.o lib/kernel/print.S
gcc -m32 -I lib/kernel -c -o kernel/main.o  kernel/main.c
ld -m elf_i386 -Ttext 0xc0001500 -e main -o kernel/kernel.bin kernel/main.o lib/kernel/print.o

导入bochs中成功显示：

图片.png

3. 总结：

从发现问题到解决问题，三天的时间过去了，真的是想打自己。问题原因就是把64位程序当作32位在bochs中操作导致。

仔细想想，在编译kernel.bin的时候，因为报错了下面，导致我加了一些编译参数在里面：

ld: i386 architecture of input file `lib/kernel/print.o' is incompatible with i386:x86-64 output

于是就对于print.o的编译，加上了特殊参数,就解决了问题。这也为后面埋下了伏笔

nasm -f elf -o lib/kernel/print.o lib/kernel/print.S
==》nasm -f elf64 -o lib/kernel/print.o lib/kernel/print.S
```：

记第六章完善内核出现的问题

1. 问题起因

2. 问题查找

3. 总结：

猜你喜欢

热点阅读