This blog post details some internals of the QEMU TCG engine, the machinery responsible for executing target instructions on the host.
You should have already read Execution loop and Breakpoints handling blog posts to have some pointers.
The vCPU thread executes instructions through
tcg_cpu_exec
which finds and/or generates translated blocks.
As previously explained,
tb_gen_code
will generate intermediate representation (IR) code thanks to
gen_intermediate_code
and then host architecture assembly instructions with
tcg_gen_code
:
TranslationBlock *tb_gen_code(CPUState *cpu,
target_ulong pc, target_ulong cs_base,
uint32_t flags, int cflags)
{
tb = tb_alloc(pc);
...
/* generate IR code */
gen_intermediate_code(cpu, tb, max_insns);
...
/* generate machine code */
gen_code_size = tcg_gen_code(tcg_ctx, tb);
...
}
The QEMU TCG has a notion of frontend and backend operations. The frontend-ops are the generated intermediate code which is what QEMU will execute. The backend-ops are the operations implemented on the host CPU, which is where the code is executed.
The QEMU git tree has a README introduction about the TCG which details the IR language. It's an interesting read for sure.
The
gen_intermediate_code
function is a VM architecture dependent wrapper to the
translator_loop
generic function. Let's imagine we were emulating PowerPC code on an
Intel x86 host.
The
gen_intermediate_code
would look like:
static const TranslatorOps ppc_tr_ops = {
.init_disas_context = ppc_tr_init_disas_context,
.tb_start = ppc_tr_tb_start,
.insn_start = ppc_tr_insn_start,
.breakpoint_check = ppc_tr_breakpoint_check,
.translate_insn = ppc_tr_translate_insn,
.tb_stop = ppc_tr_tb_stop,
.disas_log = ppc_tr_disas_log,
};
void gen_intermediate_code(CPUState *cs, TranslationBlock *tb, int max_insns)
{
DisasContext ctx;
translator_loop(&ppc_tr_ops, &ctx.base, cs, tb, max_insns);
}
While the
translator_loop
remains the same for any architecture, it relies on target specific
translator
operators. In
our case, the PowerPC ones (ppc_tr_ops
).
void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
CPUState *cpu, TranslationBlock *tb, int max_insns)
{
ops->init_disas_context(db, cpu);
...
gen_tb_start(db->tb);
ops->tb_start(db, cpu);
while (true) {
ops->translate_insn(db, cpu);
}
...
ops->tb_stop(db, cpu);
gen_tb_end(db->tb, db->num_insns - bp_insn);
}
Each TB has a prologue (tb_start
), and an epilogue (tb_end
) with a
generic and target specific part (if needed). Usually epilogues are
placeholders for block chaining, which is an optimization feature of
the TCG that enables TBs to be called successively after execution,
without the need to get back to the QEMU code and look for the next TB
to execute.
The architecture dependent
DisasContext
is created alongside a generic
DisasContextBase
.
For the PPC target, the
ppc_tr_init_disas_context
handler will record current cpu state information. This means that TBs
are highly contextual and might not be reused in any situation where
we execute at a previously translated location.
static void ppc_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cs)
{
DisasContext *ctx = container_of(dcbase, DisasContext, base);
CPUPPCState *env = cs->env_ptr;
int bound;
ctx->exception = POWERPC_EXCP_NONE;
ctx->spr_cb = env->spr_cb;
ctx->pr = msr_pr;
ctx->mem_idx = env->dmmu_idx;
ctx->dr = msr_dr;
...
ctx->insns_flags = env->insns_flags;
ctx->insns_flags2 = env->insns_flags2;
ctx->access_type = -1;
ctx->need_access_type = !(env->mmu_model & POWERPC_MMU_64B);
ctx->le_mode = !!(env->hflags & (1 << MSR_LE));
ctx->default_tcg_memop_mask = ctx->le_mode ? MO_LE : MO_BE;
ctx->flags = env->flags;
...
The generic TB prologue is generated from
gen_tb_start
static inline void gen_tb_start(TranslationBlock *tb)
{
TCGv_i32 count, imm;
tcg_ctx->exitreq_label = gen_new_label();
if (tb_cflags(tb) & CF_USE_ICOUNT) {
count = tcg_temp_local_new_i32();
} else {
count = tcg_temp_new_i32();
}
tcg_gen_ld_i32(count, cpu_env,
-ENV_OFFSET + offsetof(CPUState, icount_decr.u32));
if (tb_cflags(tb) & CF_USE_ICOUNT) {
imm = tcg_temp_new_i32();
/* We emit a movi with a dummy immediate argument. Keep the insn index
* of the movi so that we later (when we know the actual insn count)
* can update the immediate argument with the actual insn count. */
tcg_gen_movi_i32(imm, 0xdeadbeef);
icount_start_insn = tcg_last_op();
tcg_gen_sub_i32(count, count, imm);
tcg_temp_free_i32(imm);
}
tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx->exitreq_label);
if (tb_cflags(tb) & CF_USE_ICOUNT) {
tcg_gen_st16_i32(count, cpu_env,
-ENV_OFFSET + offsetof(CPUState, icount_decr.u16.low));
}
tcg_temp_free_i32(count);
}
It injects instructions to check for instruction count and an exit condition.
The epilogue is generated from
gen_tb_end
. It
injects instructions to exit from the TB (tcg_gen_exit_tb
).
static inline void gen_tb_end(TranslationBlock *tb, int num_insns)
{
if (tb_cflags(tb) & CF_USE_ICOUNT) {
/* Update the num_insn immediate parameter now that we know
* the actual insn count. */
tcg_set_insn_param(icount_start_insn, 1, num_insns);
}
gen_set_label(tcg_ctx->exitreq_label);
tcg_gen_exit_tb(tb, TB_EXIT_REQUESTED);
}
The TB_EXIT_REQUESTED
is a special value that tells QEMU to use the
exitreq_label
created during the prologue. This label will
eventually be updated with the address of the next TB to be executed.
The operator used to translate target instructions to IR is
translate_insn
(ppc_tr_translate_insn
for PowerPC). This function makes use of the target CPU
opcodes
handlers table which implements IR generation for every target native
instructions.
static void ppc_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
{
opc_handler_t **table, *handler;
table = cpu->opcodes;
handler = table[opc1(ctx->opcode)];
...
(*(handler->handler))(ctx);
}
struct PowerPCCPU {
...
/* Those resources are used only during code translation */
/* opcode handlers */
opc_handler_t *opcodes[PPC_CPU_OPCODES_LEN];
...
}
static opcode_t opcodes[] = {
GEN_HANDLER(cmp, 0x1F, 0x00, 0x00, 0x00400000, PPC_INTEGER),
GEN_HANDLER(cmpi, 0x0B, 0xFF, 0xFF, 0x00400000, PPC_INTEGER),
GEN_HANDLER(cmpl, 0x1F, 0x00, 0x01, 0x00400001, PPC_INTEGER),
...
GEN_LDS(lbz, ld8u, 0x02, PPC_INTEGER)
GEN_LDS(lha, ld16s, 0x0A, PPC_INTEGER)
...
};
For the PowerPC cmp
instruction, the associated handler will be
gen_cmp
which will emit the following IR instructions:
static void gen_cmp(DisasContext *ctx)
{
if ((ctx->opcode & 0x00200000) && (ctx->insns_flags & PPC_64B)) {
gen_op_cmp(cpu_gpr[rA(ctx->opcode)], cpu_gpr[rB(ctx->opcode)],
1, crfD(ctx->opcode));
} else {
gen_op_cmp32(cpu_gpr[rA(ctx->opcode)], cpu_gpr[rB(ctx->opcode)],
1, crfD(ctx->opcode));
}
}
static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
{
TCGv t0 = tcg_temp_new();
TCGv t1 = tcg_temp_new();
TCGv_i32 t = tcg_temp_new_i32();
tcg_gen_movi_tl(t0, CRF_EQ);
tcg_gen_movi_tl(t1, CRF_LT);
tcg_gen_movcond_tl((s ? TCG_COND_LT : TCG_COND_LTU),
t0, arg0, arg1, t1, t0);
tcg_gen_movi_tl(t1, CRF_GT);
tcg_gen_movcond_tl((s ? TCG_COND_GT : TCG_COND_GTU),
t0, arg0, arg1, t1, t0);
tcg_gen_trunc_tl_i32(t, t0);
tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);
tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], t);
tcg_temp_free(t0);
tcg_temp_free(t1);
tcg_temp_free_i32(t);
}
The PowerPC code here after shows 3 types of operations:
- simple instructions: arithmetic, immediate operands (
lis, ori, xor
) - a memory write (
stw
) - a system register write (
mtmsr
)
0xfff00100: lis r1,1
0xfff00104: ori r1,r1,0x409c
0xfff00108: xor r0,r0,r0
0xfff0010c: stw r0,4(r1)
0xfff00110: mtmsr r0
The following code gives the TCG IR equivalent:
0xfff00100: movi_i32 r1,$0x10000
0xfff00104: movi_i32 tmp0,$0x409c
or_i32 r1,r1,tmp0
0xfff00108: movi_i32 r0,$0x0
0xfff0010c: movi_i32 tmp1,$0x4
add_i32 tmp0,r1,tmp1
qemu_st_i32 r0,tmp0,beul,3
0xfff00110: movi_i32 nip,$0xfff00114
mov_i32 tmp0,r0
call store_msr,$0,tmp0
movi_i32 nip,$0xfff00114
exit_tb $0x0
set_label $L0
exit_tb $0x7f5a0caf8043
We notice that the memory write operation as well as the MSR
access
are translated into strange IR opcodes:
- qemu_st_i32
- call store_msr
We will detail them in a blog post dedicated to TCG helpers.
Also notice the operands of the exit_tb
opcode. The first one is
$0x0
and might eventually be fixed with the absolute host
address of the next TB to be executed.
The last one is also a host address ($0x7f5a0caf8043
) where code
is directly executed by the physical CPU. In this case, it goes back
to the QEMU translator.
At this point however, no target translated code is still executed. We only end up with IR code which needs a final translation step to the host architecture.