[译]Qemu Dynamic Translator论文

2016-10-23 本文已影响84人 Elinx

Tittle

QEMU, a Fast and Portable Dynamic Translator

Abstract

qemu可以模拟多种CPU, 支持full system emulation，无需修改待模拟的操作系统
支持linux user mode模拟

Introduction

Qemu可以在不同的host上面运行，包括:
- Linux
- Windows
- Mac OS X
Qemu支持target操作系统的的无修改运行
Qemu主要用来运行操作系统，debug等功能
Qemu集成了一个user mode的模拟器，它使用linux的一个process，作为一个target CPU.这可以用来测试cross compiler的结果。
Qemu支持的子系统包括：
- CPU emulator
- emulated devices(VGA, serial port, keyboard, mouse, IDE hard disk, network card, ...)
- Generic devices(block device, character device, network device)，用来连接模拟设备和对应的host devices.
- Machine描述用来初始化模拟设备
- debugger
- user interface
这篇文章主要讲dynamic translator，它用来 performs a runtime conversiont of target CPU instructions into the host instruction set.
dynamic translator的结果放到一个translation cache里边，可以重用
他与interpreter的不同是target instructions只用fetch/decode一次
Qemu的dynamic translator是可移植的，他只是把GCC offline产生的machine code连接到了一块儿而已
CPU emulator的难题:
- code cache management
- register allocation
- condition code optimization
- direct block chaining
- memory management
- self-modifying code support
- exception support
- hardware interrupts
- user mode emulation

Protable dynamic translation

描述

首先，要把target CPU instruction split into simpler instructions called micro operations. 这些macro operation由小的C code组成，由GCC编译成object file。
micro operation的数量为几百个，由target CPU instruction转成micro operation完全由hand coded code组成
编译攻击叫做dyngen：它读取object file产生dynamic code generator（它会被运行时调用，用来产生host function）
过程与[1]类似，但是工作主要实在编译期做的。

A key idea is that in QEMU const parameters can be given to micro operations. For that purpose, dummy code relocations are generated with GCC for each constant parameter. This enable dyngen tool to locate the relocations and generate the appropriate C code to resolve them when building the dynamic code. Relocations are also supported to enable references to static data and to oher functions in the micro operations.

例子

如何把PowerPC指令addi r1, r1, -16 # r1 = r1 - 16转换成x86指令呢？
PowerPC code translator会产生如下的micro operations：

movl_T0_r1        # T0 = r1
addl_T0_im -16    # T0 = T0 - 16
movl_r1_T0        # r1 = T0

上面三条micro对应的代码如下所示（env表示target CPU的state, 32个ppc寄存器存在env->regs[32]里边):

void op_movl_T0_r1(void)
{
     T0 = env->regs[1];
}

extern int __op_param1;
void op_add1_T0_im(void)
{
    T0 = T0 + ((long)(&__op_param1));
}

dygen产生的micro operation stream用指针opc_ptr指出，gen_code_ptr指向output的host code。micro operation的参数用opparam_ptr指出，逻辑流程如下：

for (;;) {
  switch(*opc_ptr++) {
  case INDEX_op_movl_T0_r1: {
    extern void op_mov1_T0_r1();
    memcpy(gen_code_ptr, (char *)&op_movl_T0_r1 + 0, 3);
    gen_code_ptr += 3;
    break;
  }
  case INDEX_op_add1_T0_im: {
    long param1;
    extern void op_addl_T0_im();
    memcpy(gen_code_ptr, (char *)&op_addl_T0_im+0, 6);
    param1 = *opparam_ptr++;
    *(uint32_t *)(gen_code_ptr + 2) = param1;
    gen_code_ptr += 6;
    break;
  }
  [...]
  }
  [...]
}

对于大多数的micro operation，比如movl_T0_r1，只用把GCC产生的code copy过去就行了
当使用const parameters的时候，dyngen利用GCC产生的relocation在runtime对parameter进行patch。
比如上面的三条指令产生的host code如下所示。

# movl_T0_r1
# ebx = env->reg[1]
mov 0x4(%ebp), %ebx
# add1_T0_im - 16
# ebx = ebx - 16
add $0xfffffff0, %ebx
# movl_r1_T0
# env->regs[1] = ebx
mov %ebx, 0x4(%ebp)

Dyngen的实现

解析object file的符号表，relocation entries和code section。
用symbol table找到micro operation对应的code section的code。用host specific method 找到需要copy的code的start和end，function的prologue和epilogue通常被省略。
每条micro operations的relocations都要被examined，用来获得const parameter的个数。怎么知道那个relocation是相关的呢？使用了特殊的symbol name来标示：__op_paramN。
memcpy micro operation code到output area。如果由const parameter还有利用relocation对copy的code进行patch。
对特殊的平台比如ARM，要进行一次统一的relocate，处理consts。
prologue和epilogue要处理掉，为了方便处理，A dummy assembly macro forces GCC to always terminate the function corresponding to each micro with a single return instruction。要看代码才知道什么意思。

实现细节

Translated Blocks and Translated Cache

用basic block定义translated block，QEMU会把知道遇到next jump/modify static CPU state的指令为止。
用16MB的cache储存最近使用的TBs，满的时候全部flush
static CPU state是指进入TB的时候CPU的state

Register allocation

使用fixed register allocation，也就是说target CPU的register被英射雕固定的host register/memory address。
大多数的host，我们都会把registers map到memory上，把临时变量储存在host registers。
temporary variables的分配也是被hard coded到每个target CPU的。
未来可能使用dynamic temporary register allocator

Condition code optimizations

condition code的模拟关乎性能
Qemu使用lazy condition code模拟：不计算每个指令的condition code，只存储数据和操作符，这样，用到condition code的指令可以重新计算上一个影响condition code的数据，以恢复condition code。
translation time可以进一步优化

Direct block chainning

当前TB执行完毕后，Qemu使用hash table，PC和static CPU state找到下一个TB：如果没有生产就开始translate，如果生成了就直接执行
如果the new simulated PC is known(for example after a condition jump), Qemu can patch a TB so that it jumps directly to the next one.
间接跳转的code portable能力更强，一些host上，branch指令会直接被patch，以让block chainning 没有overhead。

Memory management

对于system emulation，QEMU使用mmap()模拟target MMU
支持software MMU，模拟physical MMU，这时候使用了address translation cache加速

自修改code和code invalidation

大多数平台上方便实现，只用把icache invalidate掉就行
x86不能icache invalidate。当TB翻译完后，相应的host page如果不是已经read only就被写保护。如果再次write，那么QEMU就把该page的所有code都invalidate掉，然后开启write access权限
一个page的所有TB被用list连起来，这样就很方便的进行上面的page invalidate操作

异常支持

用longjump()实现跳转
不是要software MMU的话，异常memory访问会被host 捕捉
QEMU也支持精确异常，因为发生异常时候的CPU状态是已知的

硬件中断

Qemu不是在每个TB执行的时候都检查hardware interrupt的pending状态。
user要异步的call function来判断是否有中断pending。这个function会reset TB，然后快速返回main loop，然后main loop执行它。

User mode emulation

不支持MMU模拟
自动大小端转换
target thread与host thread一一对应

Porting work

dyngen
tempory variables要映射到host的registers上
host cpu的icache和memory同步指令
为了直接连接跳转的模块，需要提供汇编宏
QEMU移植的难度和dynamic linker的难度相当

Performance

integer慢４倍
float慢10倍
full system用software MMu慢两倍
比bochs快30倍
比valgrind快1.2倍

总结和未来工作

cache simulation and cycle counters could be added to make a debugger as in SIMICS这个实现了没呢？

References

[1] Optimizing direct threaded code by selective inlining.

[译]Qemu Dynamic Translator论文

Tittle

Abstract

Introduction

Protable dynamic translation

描述

例子

Dyngen的实现

实现细节

Translated Blocks and Translated Cache

Register allocation

Condition code optimizations

Direct block chainning

Memory management

自修改code和code invalidation

异常支持

硬件中断

User mode emulation

Porting work

Performance

总结和未来工作

References

猜你喜欢

热点阅读