分享到plurk 分享到twitter 分享到facebook

版本 b1b8eb6930f31b98f10d0d4947a534783dfd397a

xvisor

協作者

  • 2015 年春季
    • 沈宗穎, 李育丞, 蘇誌航, 張仁傑, 鄧維岱

Hackpad

  • xvisor<https://mycpp.hackpad.com/xvisor-xQUUkRHWwmm>_
  • design doc整理<https://mycpp.hackpad.com/Xvisor--WzicRuv3qqf>_
  • 報告整理區<https://mycpp.hackpad.com/Xvisor--nSiBMBB9RNN>_

虛擬化技術 (Virtualization)

以下簡介虛擬化技術,並以Xvisor on ARMv8</embedded/ARMv8>_ 為例探討虛擬化技術的實作

Hypervisor (virtual machine monitor)

  • Hypervisor (Virtual Machine Monitor)的功用是去管理Virtual Machine (VM)

    • Host Machine: 運行Hypervisor的實體主機,但有時運行在Hypervisor之上

    • Guest Machine: 運行在Hypervisor之上的虛擬主機

    .. image:: http://electronicdesign.com/site-files/electronicdesign.com/files/archive/electronicdesign.com/files/29/20211/fig_01.jpg

  • 使用時機:

    • 工作負載整合 (Workload Consolidation)
    • 支援舊有軟體 (Legacy Software)
    • 啟用多核 (Multicore Enablement)
    • 提高可靠性 (Improved Reliability)
    • 安全監控 (Secure Monitoring)

Hypervisor的類型

  • Type-1: 本地(native)、裸機(bare-metal) hypervisors
    • Hypervisor直接運行在host的硬體上,並直接控制硬體及管理Guest作業系統,此時host把Guest作業系統當成一個process
      • 如: XenServer<http://en.wikipedia.org/wiki/Xen>_ 、 Hyper-V<http://en.wikipedia.org/wiki/Hyper-V>_ 、Xvisor
  • Type-2: 託管(hosted) hypervisors
    • Hypervisor運行在host的作業系統上,再去提供虛擬化服務
      • 如: VMware<http://en.wikipedia.org/wiki/VMware_Workstation>VirtualBox<http://en.wikipedia.org/wiki/VirtualBox>
  • 比較:
    • Type-1: 有較高的安全性及可靠性,但因直接運行於硬體上,所以時常為single purpose
    • Type-2: 支援較多的I/O device及服務,且較易安裝及使用,但效能較Type-1低,常應用於效率較不重要的客戶端

虛擬化指令(ARM)

虛擬化定理 (Virtualization theorems) …………………………………………………………………….. - Popek and Goldberg virtualization requirements<http://en.wikipedia.org/wiki/Popek_and_Goldberg_virtualization_requirements>_

  • 虛擬化需要滿足:

    • 等價性 (Equivalence): 在hypervisor下執行的程式行為必須與直接跑在machine上相同。
    • 資源控制 (Resource control): hypervisor必須完全控制虛擬化資源。
    • 效率 (Efficiency): 統計上經常使用的機器指令hypervisor不應該介入。
  • 定理:對於任何傳統的第三代計算機,只要其敏感指令是特權指令的一個子集,就可以為其建立VMM (from Wiki)

    • 原本由OS在kernel mode執行的敏感指令,因OS被移到user mode而無法正常執行,所以需要被trap 給 hypervisor來執行。
  • 特權指令 (Privileged instructions): 若執行在user mode 會觸發trap

  • 控制敏感指令 (Control sensitive instructions): 會改變處理器組態或模式的指令

  • 行為敏感指令 (Behavior sensitive instructions): 其行為取決於處理器的狀態

問題指令(Problematic Instructions) ……………………………… - Type I: 在user mode執行會產生未定義的指令異常 - MCR、MRC: 需要依賴協處理器(coprocessor)

  • Type II: 在user mode執行會沒有作用
    • MSR、MRS: 需要操作系統暫存器
  • Type III: 在user mode執行會產生不可預測的行為
    • MOVS PC, LR: 返回指令,改變PC並跳回user mode,在user mode執行會產生不可預測的結果
  • ARM 的敏感指令:
    • 存取協處理器: MRC / MCR / CDP / LDC / STC
    • 存取SIMD/VFP 系統暫存器: VMRS / VMSR
    • 進入TrustZone 安全狀態: SMC
    • 存取 Memory-Mapped I/O: Load/Store instructions from/into memory-mapped I/O locations
    • 直接存取CPSR: MRS / MSR / CPS / SRS / RFE / LDM (conditional execution) / DPSPC
    • 間接存取CPSR: LDRT / STRT – Load/Store Unprivileged (“As User”)
    • 存取Banked Register: LDM / STM

Solutions ………………………………

  • 軟體技術: trap and emulate
    • Dynamic Binary Translation
      • 把問題指令取代為hypercall以進行trap 及emulate
    • Hypercall
      • 對type and original instruction bits進行編碼
      • trap 到 hypervisor,進行解碼及模擬指令
  • 硬體技術:
    • 特權指令轉換(待補)
    • MMU強制執行trap
    • 虛擬化擴充

記憶體虛擬化(without hardware support) ……………………………………… - Shadow page tables: - Map guest virtual address to host physical address - Guest OS maintain自己的page table到 guest實體記憶體框架 - hypervisor 把所有的guest實體記憶體框架 map 到host實體記憶體框架

.. image:: /embedded/shadow_page_table.png (from System Virtualization Memory Virtualization - 國立清華大學)

- 為每一個guest page table 建立 Shadow page table
- hypervisor要保護放著guest page table的host frame

.. image:: /embedded/write_protect.png (from System Virtualization Memory Virtualization - 國立清華大學)

ARM Virtualization Extensions

可參考:ARMv8#虛擬化<http://wiki.csie.ncku.edu.tw/embedded/ARMv8#%E8%99%9B%E6%93%AC%E5%8C%96-virtualization>_ (暫存器待補)

CPU virtualization ……………………………..

  • ARM 增加運行在Non-secure privilege level 2 的 Hypervisor mode

  • CPU 虛擬化擴充

    • Guest OS kernel執行在EL1,userspace執行在EL0
    • 使大部分的敏感指令可以本地執行(native-run)在EL1上而不必trap及emulation
    • 而仍需要trap的敏感指令會被trap到EL2 (hypervisor mode HYP)
      • Guest OS’s Load/Store
      • 會影響其他Guest OS的指令
      • Hypervisor Syndrome Register(HSR) 會保存被trapped的指令的資訊,因此hypervisor就能emulate它
  • Xvisor- cpu_vcpu_helper.c<https://github.com/xvisor/xvisor/blob/master/arch/arm/cpu/arm64/cpu_vcpu_helper.c>_:

.. code-block:: c

/* Initialize Hypervisor Configuration */
INIT_SPIN_LOCK(&arm_priv(vcpu)->hcr_lock);
arm_priv(vcpu)->hcr =  (HCR_TSW_MASK |
    HCR_TACR_MASK |
    HCR_TIDCP_MASK |
    HCR_TSC_MASK |
    HCR_TWE_MASK |
    HCR_TWI_MASK |
    HCR_AMO_MASK |
    HCR_IMO_MASK |
    HCR_FMO_MASK |
    HCR_SWIO_MASK |
    HCR_VM_MASK);

將EL1的敏感指令(MCR、MRC、SMC、WFE、WFI)及中斷(irq、fiq)trap到EL2,並啟動stage 2 address translation

Xvisor instruction emulate …………………………………………………… cpu_entry.S 內初始化Hyp vector base

.. code-block:: c

vectors:
ventry	hyp_sync_invalid	/* Synchronous EL1t */
ventry	hyp_irq_invalid		/* IRQ EL1t */
ventry	hyp_fiq_invalid		/* FIQ EL1t */
ventry	hyp_error_invalid	/* Error EL1t */

ventry	hyp_sync		/* Synchronous EL1h */
ventry	hyp_irq			/* IRQ EL1h */
ventry	hyp_fiq_invalid		/* FIQ EL1h */
ventry	hyp_error_invalid	/* Error EL1h */
........

EXCEPTION_HANDLER hyp_sync
    PUSH_REGS
    mov	x1, EXC_HYP_SYNC_SPx
    CALL_EXCEPTION_CFUNC do_sync    
    PULL_REGS

/* 
 *         .macro	PUSH_REGS
 *         sub	sp, sp, #0x20
 *         push	x28, x29
 *         push	x26, x27
 *         push	x24, x25
 *         push	x22, x23
 *        ......
 *         push	x0, x1
 *         add	x21, sp, #0x110
 *         mrs	x22, elr_el2
 *         mrs	x23, spsr_el2
 *         stp	x30, x21, [sp, #0xF0]
 *         stp	x22, x23, [sp, #0x100]
 *
 *
 *  .macro CALL_EXCEPTION_CFUNC cfunc
 *      mov	x0, sp                  x0 放下面 arch_regs_t *regs 的參數
 *      bl	\cfunc
 *  .endm
 */
  • ESR_EL2, Exception Syndrome Register: 保存跳到EL2的exception的syndrome 資訊

cpu_interrupt.c (請參考手冊p.1905頁對照ESR的編碼)

.. code-block:: c

void do_sync(arch_regs_t *regs, unsigned long mode)
{

.......

    esr = mrs(esr_el2);
    far = mrs(far_el2);
    elr = mrs(elr_el2);

    ec = (esr & ESR_EC_MASK) >> ESR_EC_SHIFT;
    il = (esr & ESR_IL_MASK) >> ESR_IL_SHIFT;
    iss = (esr & ESR_ISS_MASK) >> ESR_ISS_SHIFT;

.......


    switch (ec) {
    case EC_UNKNOWN:
        /* We dont expect to get this trap so error */
        rc = VMM_EFAIL;
        break;
    case EC_TRAP_WFI_WFE:
        /* WFI emulation */
        rc = cpu_vcpu_emulate_wfi_wfe(vcpu, regs, il, iss);
        break;
    case EC_TRAP_MCR_MRC_CP15_A32:
        /* MCR/MRC CP15 emulation */
        rc = cpu_vcpu_emulate_mcr_mrc_cp15(vcpu, regs, il, iss);
        break;
.........
        break;
    case EC_TRAP_HVC_A64:
        /* HVC emulation for A64 guest */
        rc = cpu_vcpu_emulate_hvc64(vcpu, regs, il, iss);
        break;
    case EC_TRAP_MSR_MRS_SYSTEM:
        /* MSR/MRS/SystemRegs emulation */
        rc = cpu_vcpu_emulate_msr_mrs_system(vcpu, regs, il, iss);
        break;
    case EC_TRAP_LWREL_INST_ABORT:
        /* Stage2 instruction abort */
        fipa = (mrs(hpfar_el2) & HPFAR_FIPA_MASK) >> HPFAR_FIPA_SHIFT;
        fipa = fipa << HPFAR_FIPA_PAGE_SHIFT;
        fipa = fipa | (mrs(far_el2) & HPFAR_FIPA_PAGE_MASK);
        rc = cpu_vcpu_inst_abort(vcpu, regs, il, iss, fipa);
        break;
    case EC_TRAP_LWREL_DATA_ABORT:
        /* Stage2 data abort */
        fipa = (mrs(hpfar_el2) & HPFAR_FIPA_MASK) >> HPFAR_FIPA_SHIFT;
        fipa = fipa << HPFAR_FIPA_PAGE_SHIFT;
        fipa = fipa | (mrs(far_el2) & HPFAR_FIPA_PAGE_MASK);
        rc = cpu_vcpu_data_abort(vcpu, regs, il, iss, fipa);
        break;

最後呼叫 emulate_arm.c 或 cpu_vcpu_emulate.c內相對應的函式做指令模擬

Memory virtualization …………………………….. 請參考: armv8 virtual-memory-system-architecture <http://wiki.csie.ncku.edu.tw/embedded/ARMv8#%E8%99%9B%E6%93%AC%E8%A8%98%E6%86%B6%E9%AB%94%E7%B3%BB%E7%B5%B1%E6%9E%B6%E6%A7%8B-virtual-memory-system-architecture>_

  • ARM 增加 Intermediate Physical Address,使得Guest OS不能直接存取實體位址(physical address)

  • 二階位址轉換 two stage address translation: => 實體位址(physical address)

    • 第一階段: 虛擬位址(virtual address) => 中間實體位址(Intermediate physical address)
      • 由Guest OS控制,並認為IPA就是PA
    • 第二階段: 中間實體位址(Intermediate physical address) => 實體位址(physical address)
      • 由hypervisor控制

interrupt virtualization ………………………………….

GIC(Generic Interrupt Controller)是ARM裡唯一的Interrupt Controller,其中包含一個Interrupt Distributor韌體和一個CPU Interface。

  • Interrupt Distributor:
    • 可以將不同類型的interrupt安排(route)到該類型的state
    • 需在booting time被設定

ARM新增了一個硬體元件Virtual CPU Interface,並透過他提供Virtual Interrupt。如此一來,便不需要去模擬I/O Device,Guest OS也不用trap進Hypervisor便能acknowledge and clear interrupts。不過Hypervisor還是需要模擬Virtual Interrupt Distributor,以提供Guest OS存取trap。

  • Virtual Interrupt Distributor:
    • 將interrupt分為兩類,分別被安排(route)到Hypervisor或Guest OS的Vector Table
  • Virtual CPU interface:
    • 協助Hypervisor Designer實作Virtual Interrupt Distributor的硬體

I/O device virtualization

  • ARM 增加 Virtual Generic Interrupt Controller 介面去執行interrupt

virtio ………………………..

Xvisor

booting 流程

可參考 AJ NOTE<https://arm4fun.hackpad.com/ARM64-Booting-in-Xvisor-EdnevstgC0t>_

  • 需要在MMU啟動前在virtual space下啟動Xvisor
  • 在assembly time時, 把img加入至.text,使用.incbin

.. image:: /embedded/xvisor_memory.png

  • cpu_entry.S<https://github.com/xvisor/xvisor/blob/master/arch/arm/cpu/arm64/cpu_entry.S>_

.. code-block:: c

_start_mmu_init:
    /* Setup SP as-per load address */
    ldr	x0, __hvc_stack_end
    mov	sp, x0
    sub	sp, sp, x6
    add	sp, sp, x4

    .........

    bl	_setup_initial_ttbl
  • mmu_lpae.c<https://github.com/xvisor/xvisor/blob/6807a137dbcbef5b182c78b95986079a610af81d/arch/arm/cpu/common/mmu_lpae.c>_

.. code-block:: c

u8 __attribute__ ((aligned(TTBL_TABLE_SIZE))) def_ttbl[TTBL_INITIAL_TABLE_SIZE] = { 0 };

初始化translation table * mmu_lpae_entry_ttbl.c<https://github.com/xvisor/xvisor/blob/master/arch/arm/cpu/common/mmu_lpae_entry_ttbl.c>_

.. code-block:: c

void __attribute__ ((section(".entry")))
    _setup_initial_ttbl(virtual_addr_t load_start, virtual_addr_t load_end,
                virtual_addr_t exec_start, virtual_addr_t exec_end)
{
    ..........

    lpae_entry.ttbl_base = to_load_pa((virtual_addr_t)&def_ttbl);   /* def_ttbl之後要放到 ttbr0_el2(Translation Base Register) 裡*/
    lpae_entry.next_ttbl = (u64 *)lpae_entry.ttbl_base;

    ..........

    /* Map physical = logical
     * Note: This mapping is using at boot time only
     */
    __setup_initial_ttbl(&lpae_entry, load_start, load_end, load_start,
            AINDEX_NORMAL_WB, TRUE);

    /* Map to logical addresses which are
     * covered by read-only linker sections
     * Note: This mapping is used at runtime
     */
    SETUP_RO_SECTION(lpae_entry, text);
    SETUP_RO_SECTION(lpae_entry, init);
    SETUP_RO_SECTION(lpae_entry, cpuinit);
    SETUP_RO_SECTION(lpae_entry, spinlock);
    SETUP_RO_SECTION(lpae_entry, rodata);

    /* Map rest of logical addresses which are
     * not covered by read-only linker sections
     * Note: This mapping is used at runtime
     */
    __setup_initial_ttbl(&lpae_entry, exec_start, exec_end, load_start,
                            AINDEX_NORMAL_WB, TRUE);
}

void __attribute__ ((section(".entry")))
    __setup_initial_ttbl(struct mmu_lpae_entry_ctrl *lpae_entry,
                virtual_addr_t map_start, virtual_addr_t map_end,
                virtual_addr_t pa_start, u32 aindex, bool writeable)
{
    ........
    u64 *ttbl;

    /* align start addresses */
    map_start &= TTBL_L3_MAP_MASK;  /* 0xFFFFFFFFFFFFF000ULL 後面12的bit 直接map 到 output*/
    pa_start &= TTBL_L3_MAP_MASK;

    page_addr = map_start;

    while (page_addr < map_end) {

            /* Setup level1 table */
            ttbl = (u64 *) lpae_entry->ttbl_base;
            index = (page_addr & TTBL_L1_INDEX_MASK) >> TTBL_L1_INDEX_SHIFT;
            if (ttbl[index] & TTBL_VALID_MASK) {
                /* Find level2 table */
                ttbl =
                (u64 *) (unsigned long)(ttbl[index] &
                                    TTBL_OUTADDR_MASK);
            } else {
                /* Allocate new level2 table */
                if (lpae_entry->ttbl_count == TTBL_INITIAL_TABLE_COUNT) {
                    while (1) ;	/* No initial table available */
                }
                for (i = 0; i < TTBL_TABLE_ENTCNT; i++) {
                    lpae_entry->next_ttbl[i] = 0x0ULL;
                }
                lpae_entry->ttbl_tree[lpae_entry->ttbl_count] =
                    ((virtual_addr_t) ttbl -
                    lpae_entry->ttbl_base) >> TTBL_TABLE_SIZE_SHIFT;
                lpae_entry->ttbl_count++;
                ttbl[index] |=
                    (((virtual_addr_t) lpae_entry->next_ttbl) &
                    TTBL_OUTADDR_MASK);
                ttbl[index] |= (TTBL_TABLE_MASK | TTBL_VALID_MASK);
                ttbl = lpae_entry->next_ttbl;
                lpae_entry->next_ttbl += TTBL_TABLE_ENTCNT;
            }

            /* Setup level2 table */
            index = (page_addr & TTBL_L2_INDEX_MASK) >> TTBL_L2_INDEX_SHIFT;
            if (ttbl[index] & TTBL_VALID_MASK) {
                /* Find level3 table */
                ttbl =
                    (u64 *) (unsigned long)(ttbl[index] &
                        TTBL_OUTADDR_MASK);
            } else {
                /* Allocate new level3 table */
                ......
            }

            /* Setup level3 table */
            index = (page_addr & TTBL_L3_INDEX_MASK) >> TTBL_L3_INDEX_SHIFT;
            if (!(ttbl[index] & TTBL_VALID_MASK)) {
                /* Update level3 table */
                .......
            }

            /* Point to next page */
            page_addr += TTBL_L3_BLOCK_SIZE;
    }
}

TCR_EL2, Translation Control Register (EL2)

.. image:: /embedded/vtcr.png

.. code-block:: c

/* Setup Hypervisor Translation Control Register */
ldr	x0, __htcr_set  
msr     tcr_el2, x0
  • __htcr_set: (TCR_T0SZ_VAL(39) | TCR_PS_40BITS | (0x3 << TCR_SH0_SHIFT) | (0x1 << TCR_ORGN0_SHIFT) | (0x1 << TCR_IRGN0_SHIFT))
    • TCR_T0SZ_VAL(39) = (64-39) & 0x3f: The size offset of the memory region addressed by TTBR0_EL2. The region size is 2^64-T0SZ bytes. 將 region size 設為 2^39 bytes
      • Initial lookup level為level 1
    • TCR_PS_40BITS = 2 << 16: Physical Address Size 設為40 bits, 1 TB。
    • TCR_SH0_SHIFT = 12; 0x3: Shareability attribute for memory associated: Inner Shareable
    • TCR_ORGN0_SHIFT = 10; 0x1: Outer cacheability attribute for memory associated: Write-Back Write-Allocate Cacheable
    • TCR_IRGN0_SHIFT = 8; 0x1: Inner cacheability attribute for memory associated: Write-Back Write-Allocate Cacheable

.. code-block:: c

/* Setup Hypervisor Translation Base Register */
ldr	x0, __httbr_set  /* def_ttbl 的位置 */
msr	ttbr0_el2, x0

GIC 配置

先看foundation model的手冊 - DGIC_DIST_BASE=0x2c001000 - DGIC_CPU_BASE=0x2c002000

.. image:: /embedded/gic_dist_map.png (from GICv2 Architecture Specification P.74)

.. code-block:: c

/* GIC Distributor Interface Init */
    mrs	x4, mpidr_el1
    ldr	x5, __mpidr_mask
    and	x4, x4, x5			/* CPU affinity */
__gic_dist_init:
    ldr	x0, __gic_dist_base		/* Dist GIC base */
    mov	x1, #0				/* non-0 cpus should at least */
    cmp	x4, xzr				/* program IGROUP0 */
    bne	1f
    mov	x1, #3				/* Enable group0 & group1 */
    str	w1, [x0, #0x00]			/* Ctrl Register */

.. image:: /embedded/gic1.png

(from GICv2 Architecture Specification)

.. code-block:: c

ldr	w1, [x0, #0x04]			/* Type Register */
1:  and	x1, x1, #0x1f			/* No. of IGROUPn registers */
    add	x2, x0, #0x080			/* IGROUP0 Register */
    movn	x3, #0				/* All interrupts to group-1 */
2:  str	w3, [x2], #4
    subs	x1, x1, #1
    bge	2b

.. image:: /embedded/gic_cpu_interface.png

(from GICv2 Architecture Specification P.76)

  • GICC_CTLR

.. image:: /embedded/gic_cpu_reg.png

  • GICC_PMR

.. image:: /embedded/gicc_pmr.png

.. code-block:: c

__gic_cpu_init:
    /* GIC Secured CPU Interface Init */
    ldr	x0, __gic_cpu_base		/* GIC CPU base */
    mov	x1, #0x80
    str	w1, [x0, #0x4]			/* GIC CPU Priority Mask */
    mov	x1, #0x3			/* Enable group0 & group1 */
    str	w1, [x0]			/* GIC CPU Control */

在不同Arm架構下的虛擬化機制

without virtualization extension

arm32 指令虛擬化 ………………………………

  • elf2cpatch.py<https://github.com/xvisor/xvisor/blob/6807a137dbcbef5b182c78b95986079a610af81d/arch/arm/cpu/arm32/elf2cpatch.py>_
    • 將敏感的非特權指令編碼成SVC 指令,並給每一個被編碼的指令一個獨特的inst_id
    • 以mrs為例

.. code-block:: python

# MSR (immediate)
#	Syntax:
# 		msr<c> <spec_reg>, #<const>
#	Fields:
#		cond = bits[31:28]
#		R = bits[22:22]
#		mask = bits[19:16]
#		imm12 = bits[11:0]
#	Hypercall Fields:
#		inst_cond[31:28] = cond
#		inst_op[27:24] = 0xf
#		inst_id[23:20] = 0
#		inst_subid[19:17] = 2
#		inst_fields[16:13] = mask
#		inst_fields[12:1] = imm12
#		inst_fields[0:0] = R


def convert_msr_i_inst(hxstr):
	hx = int(hxstr, 16)
	inst_id = 0
	inst_subid = 2
	cond = (hx >> 28) & 0xF
	R = (hx >> 22) & 0x1
	mask = (hx >> 16) & 0xF
	imm12 = (hx >> 0) & 0xFFF
	rethx = 0x0F000000
	rethx = rethx | (cond << 28)
	rethx = rethx | (inst_id << 20)
	rethx = rethx | (inst_subid << 17)
	rethx = rethx | (mask << 13)
	rethx = rethx | (imm12 << 1)
	rethx = rethx | (R << 0)
	return rethx
  • instruction emulation

cpu_entry.S<https://github.com/xvisor/xvisor/blob/6807a137dbcbef5b182c78b95986079a610af81d/arch/arm/cpu/arm32/cpu_entry.S>_中註冊Exception vector

.. code-block:: c

_start_vect:
	ldr	pc, __reset
	ldr	pc, __undefined_instruction
	ldr	pc, __software_interrupt  /* __software_interrupt:	.word _soft_irq  */
	......

cpu_interrupts.c<https://github.com/xvisor/xvisor/blob/6807a137dbcbef5b182c78b95986079a610af81d/arch/arm/cpu/arm32/cpu_interrupts.c>_

.. code-block:: c

void do_soft_irq(arch_regs_t *regs)
{
    ........
    /* If vcpu priviledge is user then generate exception
     * and return without emulating instruction
     */
    if ((arm_priv(vcpu)->cpsr & CPSR_MODE_MASK) == CPSR_MODE_USER) {
        vmm_vcpu_irq_assert(vcpu, CPU_SOFT_IRQ, 0x0);
    } else {
        if (regs->cpsr & CPSR_THUMB_ENABLED) {
            rc = cpu_vcpu_hypercall_thumb(vcpu, regs,
                    *((u32 *)regs->pc));
        } else {
            rc = cpu_vcpu_hypercall_arm(vcpu, regs,
                    *((u32 *)regs->pc));
            }
        }
    ........
}

以arm emulation為例, in cpu_vcpu_hypercall_arm.c<https://github.com/xvisor/xvisor/blob/6807a137dbcbef5b182c78b95986079a610af81d/arch/arm/cpu/arm32/cpu_vcpu_hypercall_arm.c>_:

.. code-block:: c

int cpu_vcpu_hypercall_arm(struct vmm_vcpu *vcpu,
                           arch_regs_t *regs, u32 inst)
{
    u32 id = ARM_INST_DECODE(inst,
                             ARM_INST_HYPERCALL_ID_MASK,
                             ARM_INST_HYPERCALL_ID_SHIFT);

    return hcall_funcs[id](id, inst, regs, vcpu);
}

將之前編碼後的指令解碼

.. code-block:: c

static int (* const hcall_funcs[])(u32 id, u32 inst,
                                   arch_regs_t *regs, struct vmm_vcpu *vcpu) = {
    arm_hypercall_cps_and_co,	/* ARM_HYPERCALL_CPS_ID */
    arm_hypercall_ldm_ue,		/* ARM_HYPERCALL_LDM_UE_ID0 */
    arm_hypercall_ldm_ue,		/* ARM_HYPERCALL_LDM_UE_ID1 */
    arm_hypercall_ldm_ue,		/* ARM_HYPERCALL_LDM_UE_ID2 */
    .......
};

以id = 0的為例(CPS, MRS, MSR, RFE, SRS, WFI, WFE, SEV, SMC等) (Q: 如何分類的?)

.. code-block:: c

static int arm_hypercall_cps_and_co(u32 id, u32 inst,
                                    arch_regs_t *regs, struct vmm_vcpu *vcpu)
{
    u32 subid = ARM_INST_DECODE(inst,
                                ARM_INST_HYPERCALL_SUBID_MASK,
                                ARM_INST_HYPERCALL_SUBID_SHIFT);

    return cps_and_co_funcs[subid](inst, regs, vcpu);
}

再做一次decode(即可知其指令為何),呼叫模擬該指令的function

.. code-block:: c

static int (* const cps_and_co_funcs[])(u32 inst,
                                        arch_regs_t *regs, struct vmm_vcpu *vcpu) = {
    arm_hypercall_cps,	/* ARM_HYPERCALL_CPS_SUBID */
    arm_hypercall_mrs,	/* ARM_HYPERCALL_MRS_SUBID */
    arm_hypercall_msr_i,	/* ARM_HYPERCALL_MSR_I_SUBID */
    arm_hypercall_msr_r,	/* ARM_HYPERCALL_MSR_R_SUBID */
    arm_hypercall_rfe,	/* ARM_HYPERCALL_RFE_SUBID */
    arm_hypercall_srs,	/* ARM_HYPERCALL_SRS_SUBID */
    arm_hypercall_wfx,	/* ARM_HYPERCALL_WFI_SUBID */
    arm_hypercall_smc	/* ARM_HYPERCALL_SMC_SUBID */
};

以mrs為例:

.. code-block:: c

/* Emulate 'mrs' hypercall */
static int arm_hypercall_mrs(u32 inst,
                             arch_regs_t *regs, struct vmm_vcpu *vcpu)
{
    register u32 Rd;
    Rd = ARM_INST_BITS(inst,
                       ARM_HYPERCALL_MRS_RD_END,
                       ARM_HYPERCALL_MRS_RD_START);
    if (Rd == 15) {
        arm_unpredictable(regs, vcpu, inst, __func__);
        return VMM_EFAIL;
    }
    if (ARM_INST_BIT(inst, ARM_HYPERCALL_MRS_R_START)) {
        cpu_vcpu_reg_write(vcpu, regs, Rd,
                           cpu_vcpu_spsr_retrieve(vcpu));
    } else {
        cpu_vcpu_reg_write(vcpu, regs, Rd,
                           cpu_vcpu_cpsr_retrieve(vcpu, regs));
    }
    regs->pc += 4;
    return VMM_OK;
}

armv5 ………………..

  • ARM9: With MMU

armv7 ……………………. - 有安全性擴展,但沒有虛擬化擴展

with virtualization extension

armv7ve …………………… - 有安全性擴展、LPAE及虛擬化擴展 - Cortex-A15 / Cortex-A7 (with big-LITTLE support)

armv8 ……………..

設計文件整理

原文: https://github.com/xvisor/xvisor/blob/master/docs/DesignDoc

Chapter1: Modeling Virtual Machine

  • 何謂VM(virtual machine),通常分為兩種
    • system virtual machine: support the execution of a complete OS
    • process virtual machine: support a single process
  • Xvisor為硬體系統的虛擬化軟體,可直接運行於主機機器,為Native/Type-1的Hypervisor/VMM system virtual machine通常稱為guest,而guest裏的CPU稱為VCPU(Virtual CPU),VCPU又可分為兩種
    • 屬於Guest的稱為 Normal VCPU
    • 不屬於任何Guest的稱為 Orphan VCPU (Orphan VCPU是為了不同的背景程序及運行中的管理daemon而建立的)
  • 當今CPU至少有兩種privilege mode:
    • User mode為最低特權,運行Normal VCPUs
    • Supervisor mode為最高特權,運行Orphan VCPUs
  • Xvisor當執行various background process和執行management daemons使用Orphan VCPUs

下圖為Xvisor的System Virtual Machine Model

.. image:: /embedded/xvisor_model.png Chapter2: Hypervisor Configuration ——————————————– 在早期,因為只有單一CPU執行OS,所以當我們要對一個系統設置OS是相當簡單的,但今日我們有強大的單一核心及多核心的處理器可針對許多的應用下作配置 一個多核心處理器可被單一對稱多重處理(symmetric multiprocessing SMP)作業系統管理,每一個核心在非對稱多重處理(asymmetric multiprocessing AMP)的處理下可被指派給不同的OS,每一個核心都當成是獨立的處理器,Process可以在不同的處理器移動,達到平衡,使系統效率提升

下圖為傳統單核心 vs AMP vs SMP 運作方式:

.. image:: /embedded/hackpad_smp.png (圖片來源: http://www.rtcmagazine.com/articles/view/101663)

  • SMP及AMP面臨的挑戰:
    • SMP依照workload時常不適合擴充(scale),因為使用多個bus或crossbar switch,使得功耗過高
    • AMP則很難去配置哪一個OS去存取哪一個device,作業系統認為他們對他們偵測到的硬體有完整的掌控權,所以這常常讓AMP有所牴觸Xvisor

另外: vSMP(virtual symmetric multiprocessing):多擁有一個「協同處理核心」co-processor,通常為低功耗製程,所以較常處理低頻率的運作,另外,此核心對OS來說如同不存在,也就是說OS和應用程式都不知道核心的存在,卻會自動地運用這個核心,所以我們不需要撰寫新的程式碼並透過軟體去控制,而vSMP主要是把兩個以上的virtual processor map到單一虛擬機,這讓我們能指派多個虛擬處理器到一個擁有至少兩個邏輯處理器的虛擬機上,優點為可以達到快取記憶體一致性,使作業系統效率高,功耗最佳化

  • Xvisor提供技術去劃分或虛擬化處理器核心、記憶體和在多個OS使用的device

    • 利用稱為device tree的樹結構,去簡化運行在單核心或多核心系統的配置,因此系統設計者可以輕易地混合運用許多不同的AMP, SMP, 核心虛擬化配置去建立系統
    • 在Linux之中,如果為of_platform架構(e.g. PowerPC),啟動的時候,Linux Kernel會等由boot loader產生的DTB file(Device Tree Blob/Flattened device tree),DTB file為經過DTC(Device Tree Compiler)編譯的DTS所產生的binary file,而of_platform只probe那些在DTB file提到的drivers,這些drivers相容或符合在device tree的devices.
    • Xvisor不需和Linux of_platform一樣必須從DTB去設定device tree,Xvisor的device tree的來源可以很多種,不僅可從DTB去獲取,也可從ACPI table (Advanced Configuration and Power Interface),簡單來說,Xvisor的Device Tree只是個管理hypervisor設定的資料結構,Xvisor預設上都會支持利用DTS設置device tree,也包含Linux kernel source code的DTC及輕量級的DTB parsing library(libfdt)去設置device tree
  • 以下為Xvisor device tree的規則限制,以便更新或設置

    • Node Name: It must have characters from one of the following only,
      • digit: [0-9]
      • lowercase letter: [a-z]
      • uppercase letter: [A-Z]
      • underscore: _
      • dash: -
    • Attribute Name: It must have characters from one of the following only,
      • digit: [0-9]
      • lowercase letter: [a-z]
      • uppercase letter: [A-Z]
      • underscore: _
      • dash: -
      • hash: #
  • Attribute String Value: A string attribute value must end with NULL character (i.e. ‘\0’ or character value 0). For a string list, each string must be separated by exactly one NULL character.

  • Attribute 32-bit unsigned Value: A 32-bit integer value must be represented in big-endian format or little-endian format based on the endianess of host CPU architecture.

  • Attribute 64-bit unsigned Value: A 64-bit integer value must be represented in big-endian format or little-endian format based on the endianess of host CPU architecture.

    • Note: Architecture specific code must ensure that the above constraints are satisfied while populating device tree)
    • Note: For standard attributes used by Xvisor refer source code.) Chapter3: Hypervisor Timer ——————————————–
  • 就像許多作業系統一樣,hypervisor需要用一個timekeeping subsystem去追蹤經過的時間,我們稱Xvisor的timekeeping subsystem為hypervisor timer

    • OS的timekeeping subsystem作以下兩件重要的事情:
      • 1.追蹤經過的時間:最簡易的方法去實現此事為去數週期性的中斷,但是這方法非常不準確,更精確的作法是使用clocksource device(i.e.free running cycle accurate hardware counter)來當作時間參考
      • 2.排程接下來的event:OS會讓CPU依據每一個事件的到期時間做一個list,最早到期的會先被執行Timekeeping對於hypervisor來說是不容易達成的,最主要的原因為時間被host端和許多的guest端一起使用,guest interrupt的進入以及相關的時脈資源和realtime中間不完全同步,這會使得realtime的通道發生問題。
  • Timekeeping routine被用來追蹤週期性的中斷所需要的時間,但對於legacy guest OS卻有一個嚴重的問題會發生。這些中斷可能來自於PIT或RTC。但有可能host virtualization engine 無法以適當的比率傳送中斷,導致guest time有可能落後,在high interrupt rate (EX.1000Hz)這問題會更為明顯

    • 因此有三個解決方法被提出
      • 如果guest有自己的time source使得我們不需去調整’wall clock’ or ’real time’時,這問題就可以被忽略
      • 如果這樣不夠我們還會需要對guest加入額外的interrupts使得interrupt rate被提升,但是只有在host load或guest lag非常多以至於無法補償的情況才會使用這個方法
      • guest端必須主動去意識到lost ticks並且在內部進行補償,但這通常只有在理論上可行,實作上會有許多問題需要克服
  • 結論: 在Hypervisor裡,timekeeping必須是tickless和high resolution,此外PIT emulators必須要keep backlogs of pending periodic interrupts

  • 由於Xvisor timer 是從Linux 的hrtimer subsystems演變而來所以會有下列幾點特色

    • 64-bit Timestamp : Timestamp代表當Xvispr被boot後時間為nanosecond等級
    • Timer events : 可以利用expiry call back handler以及相關的expiry time來創造或消滅一個timer event
    • Timer event當到期便會自動終止
    • 為了要獲得週期性的timer event我們必須手動地重新啟動從它的expiry call back handler
  • Hypervisor為了要達到上述幾點必須能夠提供下列兩個條件

    • one global clocksource device for each host CPU
    • one clockevent device for each host CPU 下圖解釋tick distortion .. image:: /embedded/tickless.png

(圖片來源:https://arm4fun.hackpad.com/Xvisor-ARMv8-Timer-BVc6RIuDGcD)

測試

  1. 註冊ARM帳號
  2. 下載ARM Foundation v8 model<http://www.arm.com/zh/products/tools/models/fast-models/foundation-model.php>_
  3. 解壓縮到workspace
  4. 安裝toolchain

.. code-block:: prettyprint

sudo apt-get install gcc-aarch64-linux-gnu
sudo apt-get install genext2fs
  1. 下載及編譯Xvisor

.. code-block:: prettyprint

cd workspace
git clone https://github.com/avpatel/xvisor-next.git && cd xvisor-next
export CROSS_COMPILE=aarch64-linux-gnu-
make ARCH=arm generic-v8-defconfig
make
make dtbs
  1. 準備guest OS

.. code-block:: prettyprint

make -C tests/arm64/virt-v8/basic
mkdir -p ./build/disk/images/arm64/virt-v8
./build/tools/dtc/dtc -I dts -O dtb -o ./build/disk/images/arm64/virt-v8x2.dtb ./tests/arm64/virt-v8/virt-v8x2.dts
cp -f ./build/tests/arm64/virt-v8/basic/firmware.bin ./build/disk/images/arm64/virt-v8/firmware.bin
cp -f ./tests/arm64/virt-v8/basic/nor_flash.list ./build/disk/images/arm64/virt-v8/nor_flash.list
genext2fs -B 1024 -b 16384 -d ./build/disk ./build/disk.img
./tools/scripts/memimg.py -a 0x80010000 -o ./build/foundation_v8_boot.img ./build/vmm.bin@0x80010000 ./build disk.img@0x81000000
aarch64-linux-gnu-gcc -nostdlib -nostdinc -e _start -Wl,-Ttext=0x80000000 -DGENTIMER_FREQ=100000000 -DUART_BASE=0x1c090000 -DGIC_DIST_BASE=0x2c001000 -DGIC_CPU_BASE=0x2c002000 -DSPIN_LOOP_ADDR=0x84000000 -DIMAGE=./build/foundation_v8_boot.img -DDTB=./build/arch/arm/board/generic/dts/foundation-v8/one_guest_virt-v8.dtb ./docs/arm/foundation_v8_boot.S -o ./build/foundation_v8_boot.axf
  1. 測試

.. code-block:: prettyprint

cd tests/arm64/virt-v8/basic
tclsh armv8_test.tcl ~/workspace/Foundation_Platformpkg/models/Linux64_GCC-4.1/Foundation_Platform

(需先安裝tclsh,後面路徑假設Foundation v8 model解壓縮在workspace)

reference

  • http://www.slideshare.net/jserv/xvisor
  • https://github.com/xvisor/xvisor/blob/master/docs/DesignDoc
  • https://samlin.hackpad.com/Xvisor–chtGSqPYWG8
  • http://www.slideshare.net/badaindonesia/linux-on-arm-64bit-architecture?related=1
  • ARMv8-A_Architecture_Reference_Manual_(Issue_A.a) (需登入)<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0487a.e/index.html>_
  • A Virtualization Infrastructure that Supports Pervasive Computing.<http://165.193.233.120/files/pdf/partners/academic/vmware-pervasive-computing_LRudolph-en.pdf>_
  • A Choices Hypervisor on the ARM architecture<http://www.russellgreenspan.com/software/A%20Choices%20Hypervisor%20on%20the%20ARM%20Architecture.pdf>_
  • Extensions to the ARMv7-A architecture<http://www.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.23.220-1-Brash-ARMv7A.pdf>_
  • 前瞻 資訊科技 - 虛擬化 (1) - Virtualization( V12N ). 薛智文 教授<http://www.slideserve.com/quynn-dickson/cwhsueh-csie-ntu-tw-csie-ntu-tw-cwhsueh>_
  • 前瞻 資訊科技 - 虛擬化 (2) - Virtualization( V12N ). 薛智文 教授<http://www.slideserve.com/candid/cwhsueh-csie-ntu-edu-tw-http-www-csie-ntu-edu-tw-cwhsueh-100-fall-nov-4-fri-678-dth-104>_
  • An Overview of Microkernel, Hypervisor and Microvisor Virtualization Approaches for Embedded Systems, Asif Iqbal, Nayeema Sadeque and Rafika Ida Mutia, Lund University, Sweden<http://www.eit.lth.se/fileadmin/eit/project/142/virtApproaches.pdf>_
  • Hardware accelerated Virtualization in the ARM Cortex™ Processors<http://www-archive.xenproject.org/files/xensummit_seoul11/nov2/2_XSAsia11_JGoodacre_HW_accelerated_virtualization_in_the_ARM_Cortex_processors.pdf>_
  • Popek and Goldberg virtualization requirements<http://en.wikipedia.org/wiki/Popek_and_Goldberg_virtualization_requirements>_
  • Syllabus for CS5410 Virtualization Techniques (國立清華大學)<http://www.cs.nthu.edu.tw/~ychung/syllabus/Virtualization.htm>_ (內有課程簡報)
  • GICv2 Architecture Specification<http://www.cl.cam.ac.uk/research/srg/han/ACS-P35/zynq/arm_gic_architecture_specification.pdf>_