This tutorial will teach you how to build and test a fully working RISC-V Single cycle core at the RTL Level using System Verilog HDL.
We'll use a set of open-source tools to allow everyone to complete this project at home using little to no specific resources.
Before starting, please check out the setup manual (at the root of the tutorial's repo). It also contains important information on the prerequisites for your system.
This tutorial heavily based on DDCA lectures, chapter 7 and on the Digital Design and Computer Architecture, RISC-V Edition Book from Sarah & David Harris. (I'll let you do your own research to get your hands on the PDF).
Why Holy core ? Because I like this name and wanted to a little credit to god. It is also a reference to Terry A. Davis (RIP). But the naming does not really impact this project, at all. My goal with this project was to learn, and make you learn along the way. I plan on improving the design later on. Contributions are welcome
In this tutorial, we will build the following core :
Which aims at implementing all of the RV32I base instruction set. You can find a table here describing each instruction we'll implement.
We will implement these using SystemVerilog at the Register-Transfer Level (RTL), meaning we will focus on the underlying logic of the CPU rather than low-level basics (e.g. full adders, shifts, ... and gate-level design in general).
You can also find some tables for instructions here.
Definitive source code may differ a bit as I changed simulator along the way and also fixed a few minor bugs. This should not stop you from completing it on your own ! Hope you enjoy !
In order to build this core, we'll start by following the DDCA lectures, chapter 7 lectures (available for free on youtube). You can also use their book cited in the intro. This will allow you to have another reference to start this project, which can be extremely useful if there is something you struggle to understand.
The plan now consists in thinking about each type of instruction we can encounter and create the logic blocks necessary to make them happen. As stated in the setup manual, we'll test these blocks individually to make sure each feature works, and we'll them put them together to form our core.
Note that the testbenches and verification techniques can be improved but once again, this learning material for a non-critical usage.
Of course, the first couple of instructions we'll implement will require the most amount of work as we'll start from 0. But once we implement a few, the others will be way easier to implement ! So keep in mind that the first few instructions are the hardest, and once this is done, it only gets more practical.
Let's get started !
Below is a I-type instruction example (I standing for "Immediate") that loads data into reg x6, from the pointer in x9 with an offset of -4 on the address :
lw x6, -4(x9)
We would translate it like this in binary and in hex, as an I-type instruction :
111111111100 01001 010 00110 0000011
0xFFC4A303
here is a quick breakdown :
IMM [11:0] | rs1 | f3 | rd | op | |
---|---|---|---|---|---|
binary | 111111111100 | 01001 | 010 | 00110 | 0000011 |
Value | -4 | 9 (as in x9) | 2 (lw) | 6 (as in x6) | I-type |
Here is an overview of what we'll try to do :
As you can see, before doing any actual hardware digital interpretation of this instruction, we first need to build some basic logic blocks :
- A register file
- An instruction memory
- Some data memory too
- A sign extender
- A basic ALU
- And a decoder/control unit we will improve as time goes on
We'll start by ceazting basic versions of the different building blocks and test their behavior separately.
We'll then assemble them to form our first version of our datapath.
We'll implement some basic piece of memory that can store X amount of words and it will respond in 1 clock cycle (which is way too good to be true, but memory is a pain so we'll conveniently ignore that for now...).
So, let's get to work shall we ? We create a memory.sv file in which we'll write some RTL logic :
// memory.sv
module memory #(
parameter WORDS = 64
) (
input logic clk,
input logic [31:0] address,
input logic [31:0] write_data,
input logic write_enable,
input logic rst_n,
output logic [31:0] read_data
);
/*
* This memory is byte addressed
* But have no support for mis-aligned write nor reads.
*/
reg [31:0] mem [0:WORDS-1]; // Memory array of words (32-bits)
always @(posedge clk) begin
// reset logic
if (rst_n == 1'b0) begin
for (int i = 0; i < WORDS; i++) begin
mem[i] <= 32'b0;
end
end
else if (write_enable) begin
// Ensure the address is aligned to a word boundary
// If not, we ignore the write
if (address[1:0] == 2'b00) begin
//here, address[31:2] is the word index
mem[address[31:2]] <= write_data;
end
end
end
// Read logic
always_comb begin
//here, address[31:2] is the word index
read_data = mem[address[31:2]];
end
endmodule
Note the trick here, we use a byte adressed momory (watch the video if you don't know the difference with word addressed memory). However, the memory stays fairly simple as we do not add support for non-aligned read and writes (we'll add byte and halfword R/W towards then end of the tutorial).
If you are not fully used to verilog/systemVerilog syntax yet, take your time to get your head around it but you can learn along the way. I will also leave "blanks" in this tutorial sometimes, where you will have some room for creativity which should allow you to learn.
Each and everytime we implement something, we also test it, as stated in the setup manual, we will use cocotb and verilator to verify our HDL.
When it comes to verifying memory, we'll simply do some writes while tinkering with the write_enable
flag.
It turns out LLMs are bad at digital design and I don't recommend using it for this course, especially for test benches so please try to write them yourself (written in late 2024).
Here is the testbench :
# test_memory.py
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge, Timer
@cocotb.test()
async def memory_data_test(dut):
# Start a 10 ns clock
cocotb.start_soon(Clock(dut.clk, 1, units="ns").start())
await RisingEdge(dut.clk)
# Reset
dut.rst_n.value = 0
dut.write_enable.value = 0
dut.address.value = 0
dut.write_data.value = 0
await RisingEdge(dut.clk)
dut.rst_n.value = 1
await RisingEdge(dut.clk)
# All is 0 after reset
for address in range(dut.WORDS.value):
dut.address.value = address
await Timer(1, units="ns")
assert dut.read_data.value == "00000000000000000000000000000000"
# Test: Write and read back data
test_data = [
(0, 0xDEADBEEF),
(4, 0xCAFEBABE),
(8, 0x12345678),
(12, 0xA5A5A5A5)
]
for address, data in test_data:
# Write data to memory
dut.address.value = address
dut.write_data.value = data
dut.write_enable.value = 1
await RisingEdge(dut.clk)
# Disable write after one cycle
dut.write_enable.value = 0
await RisingEdge(dut.clk)
# Verify the write by reading back
dut.address.value = address
await RisingEdge(dut.clk)
assert dut.read_data.value == data
# Test: Write to multiple addresses, then read back
for i in range(40,4):
dut.address.value = i
dut.write_data.value = i + 100
dut.write_enable.value = 1
await RisingEdge(dut.clk)
# Disable write, then read back values to check
dut.write_enable.value = 0
for i in range(40,4):
dut.address.value = i
await RisingEdge(dut.clk)
expected_value = i + 100
assert dut.read_data.value == expected_value
Once again, we increment memory by 4 because it is byte addressed.
To run this, check out the setup manual.
For the reg file, it's just 32x32bits registers. We'll implement it like memory except the size is fixed to 32 bits with 5bits addressing.
The I/Os are a bit different though as we have to accommodate all the instruction types. E.g. in R-Types (which operates ONLY on registers) we can write to a register whilst getting data from 2 of them at the same time.
// regfile.sv
module regfile (
// basic signals
input logic clk,
input logic rst_n,
// Reads
input logic [4:0] address1,
input logic [4:0] address2,
output logic [31:0] read_data1,
output logic [31:0] read_data2,
// Writes
input logic write_enable,
input logic [31:0] write_data,
input logic [4:0] address3
);
// 32bits register. 32 of them (addressed with 5 bits)
reg [31:0] registers [0:31];
// Write logic
always @(posedge clk) begin
// reset support, init to 0
if(rst_n == 1'b0) begin
for(int i = 0; i<32; i++) begin
registers[i] <= 32'b0;
end
end
// Write, except on 0, reserved for a zero constant according to RISC-V specs
else if(write_enable == 1'b1 && address3 != 0) begin
registers[address3] <= write_data;
end
end
// Read logic, async
always_comb begin : readLogic
read_data1 = registers[address1];
read_data2 = registers[address2];
end
endmodule
Now to verify this HDL, we'll simply use random write on A3, and read after each write on both addresses. We then compare to a theoretical golden state update in software in the testbench.
We also add small tests at the end to test the 0 constant.
Note that we use small timer delay to test out the async properties of our design.
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge, Timer
import random
import numpy as np
@cocotb.test()
async def random_write_read_test(dut):
# Start a 10 ns clock
cocotb.start_soon(Clock(dut.clk, 10, units="ns").start())
await RisingEdge(dut.clk)
# Init and reset
dut.rst_n.value = 0
dut.write_enable.value = 0
dut.address1.value = 0
dut.address2.value = 0
dut.address3.value = 0
dut.write_data.value = 0
await RisingEdge(dut.clk)
dut.rst_n.value = 1 # realease reset_n
await RisingEdge(dut.clk)
# fill a heorical state of the regs, all 0s for starters
theorical_regs = [0 for _ in range(32)]
# Loop to write and read random values, 1000 test shall be enough
for _ in range(1000):
# Generate a random register address (1 to 31, skip 0)
address1 = random.randint(1, 31)
address2 = random.randint(1, 31)
address3 = random.randint(1, 31)
write_value = random.randint(0, 0xFFFFFFFF)
# perform reads
await Timer(1, units="ns") # wait a ns to test async read
dut.address1.value = address1
dut.address2.value = address2
await Timer(1, units="ns")
assert dut.read_data1.value == theorical_regs[address1]
assert dut.read_data2.value == theorical_regs[address2]
# perform a random write
dut.address3.value = address3
dut.write_enable.value = 1
dut.write_data = write_value
await RisingEdge(dut.clk)
dut.write_enable.value = 0
theorical_regs[address3] = write_value
await Timer(1, units="ns")
# try to write at 0 and check if it's still 0
await Timer(1, units="ns")
dut.address3.value = 0
dut.write_enable.value = 1
dut.write_data = 0xAEAEAEAE
await RisingEdge(dut.clk)
dut.write_enable.value = 0
theorical_regs[address3] = 0
await Timer(1, units="ns") # wait a ns to test async read
dut.address1.value = 0
await Timer(1, units="ns")
print(dut.read_data1.value)
assert int(dut.read_data1.value) == 0
print("Random write/read test completed successfully.")
For the Load Word datapath, we only need to add :
- The content of a source register, containing a target address
- A 12bits immediate / offset
Here is a very basic implementation, note that this design will evolve heavily !.
module alu (
// IN
input logic [2:0] alu_control,
input logic [31:0] src1,
input logic [31:0] src2,
// OUT
output logic [31:0] alu_result,
output logic zero
);
always_comb begin
case (alu_control)
3'b000 : alu_result = src1 + src2;
default: alu_result = 32'b0;
endcase
end
assign zero = alu_result == 32'b0;
endmodule
We also add a alu_control
option, to later select other arithmetic operation. We default the result to 0 if the requested arithmetic isn't implemented and we add a "zero" flag that we'll use in later designs.
Simple design, simple testbench, but this time, the alu being pur combinational logic, we do not use a clock :
import cocotb
from cocotb.triggers import Timer
import random
@cocotb.test()
async def add_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b000
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
# We mask expected to not take account of overflows
expected = (src1 + src2) & 0xFFFFFFFF
# Await 1 ns for the infos to propagate
await Timer(1, units="ns")
assert int(dut.alu_result.value) == expected
@cocotb.test()
async def default_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b111
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
expected = 0
# Await 1 ns for the infos to propagate
await Timer(1, units="ns")
assert int(dut.alu_result.value) == expected
@cocotb.test()
async def zero_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b000
dut.src1.value = 123
dut.src2.value = -123
await Timer(1, units="ns")
print(int(dut.alu_result.value))
assert int(dut.zero.value) == 1
assert int(dut.alu_result.value) == 0
Here we declare multiple tests, it's exactly the same as making a single block but it improves readability so why not.
In order to manipulate the immediate in other computation block, we need to make it 32bit wide. Also, Immediates can be "scattered" around in the instruction in RISC-V (e.g. Store Word sw
). This means that we'll need to :
- Gather the immediate in the instruction, depending on the op code (ie, include some control inputs)
- Extend the gathered immediate sign to 32bits. Here is a basic implemention for our basic lw only with some preparations for the future :
// signext.sv
module signext (
// IN
input logic [24:0] raw_src,
input logic [1:0] imm_source,
// OUT (immediate)
output logic [31:0] immediate
);
logic [11:0] gathered_imm;
always_comb begin
case (imm_source)
1'b00 : gathered_imm = raw_src[24:13];
default: gathered_imm = 12'b0;
endcase
end
assign immediate = {{20{gathered_imm[11]}}, gathered_imm};
endmodule
Simple enough right ? no magic here, simply an raw application of the DDCA lecture. Now we test this design !
Here is the test benchench, if you are not used to bitwise operations, take a minute to get your head around these :
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge, Timer
import random
import numpy as np
@cocotb.test()
async def random_write_read_test(dut):
# TEST POSITIVE IMM = 123 WITH SOURCE = 0
imm = 0b000001111011 #123
imm <<= 13 # leave "room" for random junk
source = 0b00
# 25 bits sent to sign extend contains data before that will be ignored (rd, f3,..)
# masked to leave room for imm "test payload"
random_junk = 0b000000000000_1010101010101
raw_data = random_junk | imm
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source = source
await Timer(1, units="ns") # let it propagate ...
assert dut.immediate.value == "00000000000000000000000001111011"
assert int(dut.immediate.value) == 123
# TEST Negative IMM = -42 WITH SOURCE = 0
imm = 0b111111010110 #-42
imm <<= 13 # leave "room" for ramdom junk
source = 0b00
# 25 bits sent to sign extend contains data before that will be ignred (rd, f3,..)
# masked to leave room for imm "test payload"
random_junk = 0b000000000000_1010101010101
raw_data = random_junk | imm
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source = source
await Timer(1, units="ns") # let it propagate ...
assert dut.immediate.value == "11111111111111111111111111010110"
# Python interprets int as uint. we sub 1<<32 as int to get corresponding negative value
assert int(dut.immediate.value) - (1 << 32) == -42
Once again, we'll add other features to this a bit later ;)
Below is an image of what we need to do implement for the control unit. Note that the following image contains the logic for the FULL controller, for now, we'll focus on implementing the lw
logic.
You can find the definitives tables for the HOLY CORE in the
Holy_Reference_Tables.pdf
file.
First we lay down the only I/Os we need so far for lw
:
module control (
// IN
input logic [6:0] op,
input logic [2:0] func3,
input logic [6:0] func7,
input logic alu_zero,
// OUT
output logic [2:0] alu_control,
output logic [1:0] imm_source,
output logic mem_write,
output logic reg_write
);
// lorem ipsum...
endmodule
This will help us focus on the important stuff to get a first lw
example working.
As you can see, there is an ALU control as well. This is because a single instruction type can require different kinds of arithmetics (e.g. R-Types that can be add
, sub
, mul
, ...). So we put it here now because it is one of the main purpose of the control unit to assert what arithmetic to use using this signal.
To know what alu_control
to use, the plan is to deduce a general alu_op
depending on the instruction type and then add an alu_decoder
unit will deduce the arithmetic from indicators like func3
and func7
(Also called simply f3 & f7). This will finally assert some alu_control
control signals to tell the ALU what to do, here is another truth table to use that :
You can find the full table for the entire course in the
Holy_Reference_Tables.pdf
file.
This process of seperating alu_op
and alu_control
may seem weird but trust me, it is for the better as we add multiple arithmetic options for each type of instruction :
module control (
// IN
input logic [6:0] op,
input logic [2:0] func3,
input logic [6:0] func7,
input logic alu_zero,
// OUT
output logic [2:0] alu_control,
output logic [1:0] imm_source,
output logic mem_write,
output logic reg_write
);
/**
* MAIN DECODER
*/
logic [1:0] alu_op;
always_comb begin
case (op)
// LW
7'b0000011 : begin
reg_write = 1'b1;
imm_source = 2'b00;
mem_write = 1'b0;
alu_op = 2'b00;
end
// EVERYTHING ELSE
default: begin
reg_write = 1'b0;
imm_source = 2'b00;
mem_write = 1'b0;
alu_op = 2'b00;
end
endcase
end
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// LW, SW
1'b00 : alu_control = 3'b000;
// EVERYTHING ELSE
default: alu_control = 3'b111;
endcase
end
endmodule
And everything is ready for the future instruction to be added in control !
The testbench is very straight forward, we emulate ONLY the important signals described in the truth tables for a given instruction (we don't care about the other one being X
or Z
in simulation). And we assert the outputs states :
import cocotb
from cocotb.triggers import Timer
import random
from cocotb.binary import BinaryValue
@cocotb.coroutine
async def set_unknown(dut):
# we'll see what this is in a minute ...
@cocotb.test()
async def control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR LW
await Timer(1, units="ns")
dut.op.value = 0b0000011 #lw
await Timer(1, units="ns")
assert dut.alu_control.value == "000"
assert dut.imm_source.value == "00"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
await set_unknown(dut)
is here to set the signals to defaultX
values. You can check the section on tests setup or the source code for more info. If you don't use it, you may have a passing test whereas you are checking on other tests states. It will be up to you to update it (by uncommenting the assignements) as we add input to the control unit. Don't, worry, it will come to mind naturally.
Here is what the set_unknown
function looks like :
@cocotb.coroutine
async def set_unknown(dut):
# Set all input to unknown before each test
await Timer(1, units="ns")
dut.op.value = BinaryValue("XXXXXXX")
#
# Uncomment the following throughout the course when needed
#
# dut.func3.value = BinaryValue("XXX")
# dut.func7.value = BinaryValue("XXXXXXX")
# dut.alu_zero.value = BinaryValue("X")
# dut.alu_last_bit.value = BinaryValue("X")
await Timer(1, units="ns")
For the curious who may ask "so what is the f3 for in the lw
instruction then ?". Great question. We can use F3 to implement different flavors of the load
instruction
"The LW instruction loads a 32-bit value from memory into rd. LH loads a 16-bit value from memory, then sign-extends to 32-bits before storing in rd. LHU loads a 16-bit value from memory but then zero extends to 32-bits before storing in rd. LB and LBU are defined analogously for 8-bit values."
(from the RISC-V Vol1 User-level ISA, Page 19)
Use the tables to check out different f3 values for loads
.
We can now start to edit cpu.sv
and add the pieces together ! From there (a working lw datapath), we'll be able to add functionalities and build more advanced features !
Here is the complete lw
specific datapth :
So we implement it !
module cpu (
input logic clk,
input logic rst_n
);
/**
* PROGRAM COUNTER
*/
reg [31:0] pc;
logic [31:0] pc_next;
always_comb begin : pcSelect
pc_next = pc + 4;
end
always @(posedge clk) begin
if(rst_n == 0) begin
pc <= 32'b0;
end else begin
pc <= pc_next;
end
end
/**
* INSTRUCTION MEMORY
*/
// Acts as a ROM.
wire [31:0] instruction;
memory #(
.mem_init("./test_imemory.hex")
) instruction_memory (
// Memory inputs
.clk(clk),
.address(pc),
.write_data(32'b0),
.write_enable(1'b0),
.rst_n(1'b1),
// Memory outputs
.read_data(instruction)
);
/**
* CONTROL
*/
// Intercepts instructions data, generate control signals accordingly
// in control unit
logic [6:0] op;
assign op = instruction[6:0];
logic [2:0] f3;
assign f3 = instruction[14:12];
wire alu_zero;
// out of control unit
wire [2:0] alu_control;
wire [1:0] imm_source;
wire mem_write;
wire reg_write;
control control_unit(
.op(op),
.func3(f3),
.func7(7'b0),
.alu_zero(alu_zero),
// OUT
.alu_control(alu_control),
.imm_source(imm_source),
.mem_write(mem_write),
.reg_write(reg_write)
);
/**
* REGFILE
*/
logic [4:0] source_reg1;
assign source_reg1 = instruction[19:15];
logic [4:0] source_reg2;
assign source_reg2 = instruction[24:20];
logic [4:0] dest_reg;
assign dest_reg = instruction[11:7];
wire [31:0] read_reg1;
wire [31:0] read_reg2;
logic [31:0] write_back_data;
always_comb begin : wbSelect
write_back_data = mem_read;
end
regfile regfile(
// basic signals
.clk(clk),
.rst_n(rst_n),
// Read In
.address1(source_reg1),
.address2(source_reg2),
// Read out
.read_data1(read_reg1),
.read_data2(read_reg2),
// Write In
.write_enable(reg_write),
.write_data(write_back_data),
.address3(dest_reg)
);
/**
* SIGN EXTEND
*/
logic [24:0] raw_imm;
assign raw_imm = instruction[31:7];
wire [31:0] immediate;
signext sign_extender(
.raw_src(raw_imm),
.imm_source(imm_source),
.immediate(immediate)
);
/**
* ALU
*/
wire [31:0] alu_result;
logic [31:0] alu_src2;
always_comb begin : srcBSelect
alu_src2 = immediate;
end
alu alu_inst(
.alu_control(alu_control),
.src1(read_reg1),
.src2(alu_src2),
.alu_result(alu_result),
.zero(alu_zero)
);
/**
* DATA MEMORY
*/
wire [31:0] mem_read;
memory #(
.mem_init("./test_dmemory.hex")
) data_memory (
// Memory inputs
.clk(clk),
.address(alu_result),
.write_data(32'b0),
.write_enable(1'b0),
.rst_n(1'b1),
// Memory outputs
.read_data(mem_read)
);
endmodule
This one is large but fairly simple, no fancy logic here as we pretty much just assemble legos according to the plan with a bunch of additional wires. Note the "always comb" muxes we add in preparation for further improvements, even though they are pretty useless right now.
Tip : to navigate such HDL files, use the "find" feature of your text editor extensively ! It will be you best friend when it comes to finding out what goes where when these files gets big.
Note that I added some .mem_init("blablabla")
parameters to the memory. This has to do with verification afterward. Here is the updated memory's verilog to accommodate this change :
module memory #(
parameter WORDS = 64,
parameter mem_init = ""
) (
// same I/Os ...
);
reg [31:0] mem [0:WORDS-1]; // Memory array of words (32-bits)
initial begin
$readmemh(mem_init, mem); // Load memory for simulation
end
// same logic as before ....
endmodule
see below verification for explanations...
To test this, we need to instanciate instruction and data memory with some known data. We then check the regfile's states and check if the said states are the one we expected when writting the instructions.
So here is our todo list to lay down the tests :
- Write some basic memories file to init the memory for testing
- Loads these files for simulation
- Write the testbench
Sounds simple enough, but our current project testing setup has some limitations that have to be carefully taken into account. These limitations lead to :
- We will only have 1 memory (it is what it is).
- We have to load the initial "ROMs" memory hexfiles directly via hardcoded verilog. Thus the modifications and limitations described above.
- The cocotb framework is great but when test benches and data get more complex, we have to use these kinds of tricks.
With all of these facts in mind, let's write some test ROMs for our lw datapath !
For the instruction memory to test our data path, we'll use a simple lw test :
lw x18 8(x0) // loads daata from addr 0x00000008 in reg x18 (s2)
Which translates as this in HEX format (comments like //blablabla
are ignored by $readmemh("rom.hex")
):
00802903 //LW TEST START : lw x18 8(x0)
00000013 //NOP
00000013 //NOP
//(Filled the rest with NOPs...)
To translate ASM to HEX, you can use this website (better than doing it on paper like I did haha).
And here is the data we'll try to load :
AEAEAEAE // @ 0x00000000 Useless data
00000000 // @ 0x00000004 Useless data
DEADBEEF // @ 0x00000008 What we'll try to get in x18
00000000
00000000
//(...)
Great ! Here is how we are going to organize our cpu tb(testbench) folder (we put the .hex
file in there as the HDL file are called from here so $readmemh("myrom.hex")
will gets the .hex
files from there) :
tb/
- cpu/
- Makefile
- test_cpu.py
- test_dmemory.hex
- test_imemory.hex
And now we can design a test bench ! First, we design some helper functions that will convert str data from HEX
to BIN
as needed (python tricks to deal with multiple data types expressed as str
in cocotb), we also declare a cocotb.coroutine
that will handle cpu resets :
# test_cpu.py
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge
def binary_to_hex(bin_str):
# Convert binary string to hexadecimal
hex_str = hex(int(str(bin_str), 2))[2:]
hex_str = hex_str.zfill(8)
return hex_str.upper()
def hex_to_bin(hex_str):
# Convert hex str to bin
bin_str = bin(int(str(hex_str), 16))[2:]
bin_str = bin_str.zfill(32)
return bin_str.upper()
@cocotb.coroutine
async def cpu_reset(dut):
# Init and reset
dut.rst_n.value = 0
await RisingEdge(dut.clk) # Wait for a clock edge after reset
dut.rst_n.value = 1 # De-assert reset
await RisingEdge(dut.clk) # Wait for a clock edge after reset
Great ! Now I added a small test to see if memory reads worked on my side, and we also write a test to check if out lw
instruction worked as expected :
# test_cpu.py
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge
def binary_to_hex(bin_str):
...
def hex_to_bin(hex_str):
...
@cocotb.coroutine
async def cpu_reset(dut):
...
@cocotb.test()
async def cpu_init_test(dut):
"""Reset the cpu and check for a good imem read"""
cocotb.start_soon(Clock(dut.clk, 1, units="ns").start())
await RisingEdge(dut.clk)
await cpu_reset(dut)
assert binary_to_hex(dut.pc.value) == "00000000"
# Load the expected instruction memory as binary
# Note that this is loaded in sim directly via the verilog code
# This load is only for expected
imem = []
with open("test_imemory.hex", "r") as file:
for line in file:
# Ignore comments
line_content = line.split("//")[0].strip()
if line_content:
imem.append(hex_to_bin(line_content))
# We limit this inital test to the first couple of instructions
# as we'll later implement branches
for counter in range(5):
expected_instruction = imem[counter]
assert dut.instruction.value == expected_instruction
await RisingEdge(dut.clk)
@cocotb.test()
async def cpu_insrt_test(dut):
"""Runs a lw datapath test"""
cocotb.start_soon(Clock(dut.clk, 1, units="ns").start())
await RisingEdge(dut.clk)
await cpu_reset(dut)
# The first instruction for the test in imem.hex load the data from
# dmem @ adress 0x00000008 that happens to be 0xDEADBEEF into register x18
# Wait a clock cycle for the instruction to execute
await RisingEdge(dut.clk)
print(binary_to_hex(dut.regfile.registers[18].value))
# Check the value of reg x18
assert binary_to_hex(dut.regfile.registers[18].value) == "DEADBEEF"
As you can see, the helper functions does help a lot indeed ! Using them, we can easily compare our expected values by switching between data representations as needed.
Here is what enhancements we need to make to add basic sw (S-type)
support in our CPU :
As you can see, it is simply about adding a wire from RD2 to write data and an other imm_source control signal.
Below is a S-type instruction example (S standing for "Store") that loads data from reg x18 (s1), to the address pointer in x5 (t0) with an offset of C on the address :
sw x18, 0xC(x0)
We would translate it like this in binary and in hex, as an S-type instruction :
0000000 10010 00000 010 01100 0100011
0x01202623
here is a quick breakdown :
IMM [11:5] | rs2 | rs1 | f3 | IMM [4:0] | op | |
---|---|---|---|---|---|---|
binary | 0000000 | 10010 | 00000 | 010 | 01100 | 0100011 |
Value | 0 | x18 (s1) | x0 (0) | 2 (sw) | 0xC | S-type |
Here is a todo list to implement these new changes :
- The immediate is now "scattered" around the instruction, we'll need to:
- Tell the control to select another source for the IMM
- Tell the sign extender unit how to interpret that
- We'll also need to update the control unit to :
- Not enable write for the regs
- Enable write for the memory
So let's get to work shall we ? We'll statrt by updating the sign extender to take into account our new source type
// signext.sv
module signext (
// IN
input logic [24:0] raw_src,
input logic [1:0] imm_source,
// OUT (immediate)
output logic [31:0] immediate
);
logic [11:0] gathered_imm;
always_comb begin
case (imm_source)
// For I-Types
2'b00 : gathered_imm = raw_src[24:13];
// For S-types
2'b01 : gathered_imm = {raw_src[24:18],raw_src[4:0]};
default: gathered_imm = 12'b0;
endcase
end
assign immediate = {{20{gathered_imm[11]}}, gathered_imm};
endmodule
As you can see, just a simple application of the S-Type instruction Imm format.
Now to verify that, we update the test_signext.py
testbench file by adding another, improved test :
# test_signext.py
import cocotb
from cocotb.triggers import Timer
import random
import numpy as np
@cocotb.test()
async def signext_i_type_test(dut):
# Old fully manual test for I_Types instrs
# ...
@cocotb.test()
async def signext_s_type_test(dut):
# 100 randomized tests
for _ in range(100):
# TEST POSITIVE IMM
await Timer(100, units="ns")
imm = random.randint(0,0b01111111111)
imm_11_5 = imm >> 5
imm_4_0 = imm & 0b000000011111
raw_data = (imm_11_5 << 18) | (imm_4_0) # the 25 bits of data
source = 0b01
dut.raw_src.value = raw_data
dut.imm_source = source
await Timer(1, units="ns") # let it propagate ...
assert int(dut.immediate.value) == imm
# TEST Negative IMM
# Get a random 12 bits UINT and gets its base 10 neg value by - (1 << 12)
imm = random.randint(0b100000000000,0b111111111111) - (1 << 12)
imm_11_5 = imm >> 5
imm_4_0 = imm & 0b000000011111
raw_data = (imm_11_5 << 18) | (imm_4_0) # the 25 bits of data
source = 0b01
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source = source
await Timer(1, units="ns") # let it propagate ...
# print(bin(imm),dut.raw_src.value)
# print(int(dut.immediate.value), imm)
assert int(dut.immediate.value) - (1 << 32) == imm
As we can see, we randomized the tests and used more bitwise manipulation for assertions to make the whole testing more robust.
(This also serves as a great biwise operations exercise !)
As you can see in the lecture and as stated before, we need to update the reg_write_enable
and mem_write_enable
signals.
Here is the updated main decode, nothing else changes :
// control.sv
//...
/**
* MAIN DECODER
*/
logic [1:0] alu_op;
always_comb begin
case (op)
// I-type (lw)
7'b0000011 : begin
reg_write = 1'b1;
imm_source = 2'b00;
mem_write = 1'b0;
alu_op = 2'b00;
end
// S-Type (sw)
7'b0100011 : begin
reg_write = 1'b0;
imm_source = 2'b01;
mem_write = 1'b1;
alu_op = 2'b00;
end
// EVERYTHING ELSE
default: begin
reg_write = 1'b0;
imm_source = 2'b00;
mem_write = 1'b0;
alu_op = 2'b00;
end
endcase
end
//...
As you can see it is simple a matter of adding a decoding case.
For the verification, it is also pretty simple :
# test_control.py
import cocotb
from cocotb.triggers import Timer
@cocotb.test()
async def lw_control_test(dut):
# ...
@cocotb.test()
async def sw_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR SW
await Timer(10, units="ns")
dut.op.value = 0b0100011 #sw
await Timer(1, units="ns")
assert dut.alu_control.value == "000"
assert dut.imm_source.value == "01"
assert dut.mem_write.value == "1"
assert dut.reg_write.value == "0"
Note that these tests will change, we will later add "flavors" to these I and S types :
lb
,sb
, ... which will have another f3, which will require a bit more decoding and logic, but for now, this will do just fine !
Globally in the datapath, nothing much changes, we just link the signals we previously kept on 0 for the memory write inputs :
// cpu.sv
// non changed logic ...
/**
* DATA MEMORY
*/
wire [31:0] mem_read;
memory #(
.mem_init("./test_dmemory.hex")
) data_memory (
// Memory inputs
.clk(clk),
.address(alu_result),
.write_data(read_reg2),
.write_enable(mem_write),
.rst_n(1'b1),
// Memory outputs
.read_data(mem_read)
);
// non changed logic ...
To verify, once again, we set up the memory files on a scenario that will be easily predictible in testing so we can verify the CPU behavior, whilst keeping of course the previous lw
tests in our memory files :
//test_imemory.hex
00802903 //LW TEST START : lw x18 0x8(x0)
01202623 //SW TEST START : sw x18 0xC(x0)
00000013 //NOP
00000013 //NOP
00000013 //NOP
//...
As you can see, we add a new instruction that will take the value we loaded in x18 and store it @ addr 0x0000000C in memory.
rpeaking of memory, the file did not really change, except I changed the 0xC value to 0xF2F2F2F2
to avoid asserting 00000000 as it is too common of a value :
//test_dmemory.hex
AEAEAEAE
00000000
DEADBEEF
F2F2F2F2
00000000
00000000
//...
And for the testbench, I simply did some assertions based on how the CPU should react to these instructions. We also get rid of the "init" test that tested for init memory state as it executed the instruction to verify PC & memory behavior, which messed up all of the memory state for assertions. Here is the final result :
# test_cpu.py
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge
def binary_to_hex(bin_str):
# ...
def hex_to_bin(hex_str):
# ...
@cocotb.coroutine
async def cpu_reset(dut):
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
"""Runs a lw datapath test"""
cocotb.start_soon(Clock(dut.clk, 1, units="ns").start())
await RisingEdge(dut.clk)
await cpu_reset(dut)
##################
# LOAD WORD TEST
# lw x18 0x8(x0)
##################
print("\n\nTESTING LW\n\n")
# The first instruction for the test in imem.hex load the data from
# dmem @ adress 0x00000008 that happens to be 0xDEADBEEF into register x18
# Wait a clock cycle for the instruction to execute
await RisingEdge(dut.clk)
# Check the value of reg x18
assert binary_to_hex(dut.regfile.registers[18].value) == "DEADBEEF"
##################
# STORE WORD TEST
# sw x18 0xC(x0)
##################
print("\n\nTESTING SW\n\n")
test_address = int(0xC / 4) #mem is byte adressed but is made out of words in the eyes of the software
# The second instruction for the test in imem.hex stores the data from
# x18 (that happens to be 0xDEADBEEF from the previous LW test) @ adress 0x0000000C
# First, let's check the inital value
assert binary_to_hex(dut.data_memory.mem[test_address].value) == "F2F2F2F2"
# Wait a clock cycle for the instruction to execute
await RisingEdge(dut.clk)
# Check the value of mem[0xC]
assert binary_to_hex(dut.data_memory.mem[test_address].value) == "DEADBEEF"
Okay ! It's going great, we implemented a second kind of instruction ! so let's recap what we did so far :
- Created all the basic logic blocks
- Layed down the datapath for
I-Type
instructions and addedlw
support - Layed down the datapath for
S-Type
instructions and addedsw
support - Created a basic control unit accordingly
- Tested everything along the way
We now have a very strong base to build on ! and we can almost say that the CPU is starting to look like a true processing unit ! Let's take a look at what remains to be done :
- Implement
R-Type
format (arithmetic between regitser) - Implement
B-Type & J-Type
formats (Jumps and conditional branches) - Implement
U-Type
Instructions (Just a convenient way to build constants with immediates)
Oh.. That's actually quite a lot ! But do not fret, as most of these mostly uses what we already built. In this section, we'll focus on implementing R-Type
support through the examples of add
, and
& or
instructions.
Here is what we'll try to implement :
What I suggest we do is to first implement the add
instruction (because our ALU already had addition arithmetic) and than we exercice a bit by adding and
& or
before moving on to jumps & branches instructions.
First, we'll add add
support, and there isn't much to do outside of the control unit as we already have the ALU add logic availible. The idea will be to only operate with registers as source for the ALU and use the alu_result directly as write-back data for reg_write.
Here is what an add
instruction look like :
f7 | rs2 | rs1 | f3 | rd | OP |
---|---|---|---|---|---|
0000000 | xxxxx | xxxxx | 000 | xxxxx | 0110011 |
So, to accomodate the new requirements, we add the following signals as outputs to our control unit :
alu_source
which tells our alu not to get its second operand from the immediate, but rather from the second read register.write_back_source
Which tells our registers to get data from the alu for writing back to reg3, instead of the memory_read.
Here is the new HDL code. A new thing is that we take f3 into account, because when we'll implement or
& and
, this will be the factor that will differenciate them.
module control (
// IN
input logic [6:0] op,
input logic [2:0] func3,
input logic [6:0] func7,
input logic alu_zero,
// OUT
output logic [2:0] alu_control,
output logic [1:0] imm_source,
output logic mem_write,
output logic reg_write,
output logic alu_source,
output logic write_back_source
);
/**
* MAIN DECODER
*/
logic [1:0] alu_op;
always_comb begin
case (op)
// I-type
7'b0000011 : begin
reg_write = 1'b1;
imm_source = 2'b00;
mem_write = 1'b0;
alu_op = 2'b00;
alu_source = 1'b1; //imm
write_back_source = 1'b1; //memory_read
end
// S-Type
7'b0100011 : begin
reg_write = 1'b0;
imm_source = 2'b01;
mem_write = 1'b1;
alu_op = 2'b00;
alu_source = 1'b1; //imm
end
// R-Type
7'b0110011 : begin
reg_write = 1'b1;
mem_write = 1'b0;
alu_op = 2'b10;
alu_source = 1'b0; //reg2
write_back_source = 1'b0; //alu_result
end
// EVERYTHING ELSE
default: begin
reg_write = 1'b0;
imm_source = 2'b00;
mem_write = 1'b0;
alu_op = 2'b00;
end
endcase
end
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 3'b000;
// R-Types
2'b10 : begin
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : alu_control = 3'b000;
// ALL THE OTHERS
default: alu_control = 3'b111;
endcase
end
// EVERYTHING ELSE
default: alu_control = 3'b111;
endcase
end
endmodule
We also need to update our test bench, to see if the new signals are okay for our previous instructions, and add a new one for add
:
import cocotb
from cocotb.triggers import Timer
@cocotb.test()
async def lw_control_test(dut):
# TEST CONTROL SIGNALS FOR LW
await Timer(1, units="ns")
dut.op.value = 0b0000011 # I-TYPE
await Timer(1, units="ns")
# Logic block controls
assert dut.alu_control.value == "000"
assert dut.imm_source.value == "00"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "1"
assert dut.write_back_source.value == "1"
@cocotb.test()
async def sw_control_test(dut):
# TEST CONTROL SIGNALS FOR SW
await Timer(10, units="ns")
dut.op.value = 0b0100011 # S-TYPE
await Timer(1, units="ns")
assert dut.alu_control.value == "000"
assert dut.imm_source.value == "01"
assert dut.mem_write.value == "1"
assert dut.reg_write.value == "0"
# Datapath mux sources
assert dut.alu_source.value == "1"
@cocotb.test()
async def add_control_test(dut):
# TEST CONTROL SIGNALS FOR ADD
await Timer(10, units="ns")
dut.op.value = 0b0110011 # R-TYPE
# Watch out ! F3 is important here and now !
dut.func3.value = 0b000
await Timer(1, units="ns")
assert dut.alu_control.value == "000"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "0"
assert dut.write_back_source.value == "0"
Note that if a signal is not necessary to the instruction (eg write_back_source
for sw
) we just don't check for it. You can see a reference for the signals values in the Holy_Reference_Tables.pdf
file.(with 'X' values meaning that we don't care about what it is as we don't have a use for it).
To implement R-Type
, we simply follow the guidelines of the schematics you saw at the beggining of Section 3.
Here is a fully updated version of the cpu datapath :
module cpu (
input logic clk,
input logic rst_n
);
/**
* PROGRAM COUNTER
*/
reg [31:0] pc;
logic [31:0] pc_next;
always_comb begin : pc_select
pc_next = pc + 4;
end
always @(posedge clk) begin
if(rst_n == 0) begin
pc <= 32'b0;
end else begin
pc <= pc_next;
end
end
/**
* INSTRUCTION MEMORY
*/
// Acts as a ROM.
wire [31:0] instruction;
memory #(
.mem_init("./test_imemory.hex")
) instruction_memory (
// Memory inputs
.clk(clk),
.address(pc),
.write_data(32'b0),
.write_enable(1'b0),
.rst_n(1'b1),
// Memory outputs
.read_data(instruction)
);
/**
* CONTROL
*/
// Intercepts instructions data, generate control signals accordignly
// in control unit
logic [6:0] op;
assign op = instruction[6:0];
logic [2:0] f3;
assign f3 = instruction[14:12];
wire alu_zero;
// out of control unit
wire [2:0] alu_control;
wire [1:0] imm_source;
wire mem_write;
wire reg_write;
// out muxes wires
wire alu_source;
wire write_back_source;
control control_unit(
.op(op),
.func3(f3),
.func7(7'b0), // we still don't use f7 (YET)
.alu_zero(alu_zero),
// OUT
.alu_control(alu_control),
.imm_source(imm_source),
.mem_write(mem_write),
.reg_write(reg_write),
// muxes out
.alu_source(alu_source),
.write_back_source(write_back_source)
);
/**
* REGFILE
*/
logic [4:0] source_reg1;
assign source_reg1 = instruction[19:15];
logic [4:0] source_reg2;
assign source_reg2 = instruction[24:20];
logic [4:0] dest_reg;
assign dest_reg = instruction[11:7];
wire [31:0] read_reg1;
wire [31:0] read_reg2;
logic [31:0] write_back_data;
always_comb begin : write_back_source_select
case (write_back_source)
1'b1: write_back_data = mem_read;
default: write_back_data = alu_result;
endcase
end
regfile regfile(
// basic signals
.clk(clk),
.rst_n(rst_n),
// Read In
.address1(source_reg1),
.address2(source_reg2),
// Read out
.read_data1(read_reg1),
.read_data2(read_reg2),
// Write In
.write_enable(reg_write),
.write_data(write_back_data),
.address3(dest_reg)
);
/**
* SIGN EXTEND
*/
logic [24:0] raw_imm;
assign raw_imm = instruction[31:7];
wire [31:0] immediate;
signext sign_extender(
.raw_src(raw_imm),
.imm_source(imm_source),
.immediate(immediate)
);
/**
* ALU
*/
wire [31:0] alu_result;
logic [31:0] alu_src2;
always_comb begin : alu_source_select
case (alu_source)
1'b1: alu_src2 = immediate;
default: alu_src2 = read_reg2;
endcase
end
alu alu_inst(
.alu_control(alu_control),
.src1(read_reg1),
.src2(alu_src2),
.alu_result(alu_result),
.zero(alu_zero)
);
/**
* DATA MEMORY
*/
wire [31:0] mem_read;
memory #(
.mem_init("./test_dmemory.hex")
) data_memory (
// Memory inputs
.clk(clk),
.address(alu_result),
.write_data(read_reg2),
.write_enable(mem_write),
.rst_n(1'b1),
// Memory outputs
.read_data(mem_read)
);
endmodule
As you can see, I chose not to add F7 just yet, as we still don't need it for supporting the very few instruction we have so far.
before moving any further, we check that our old tests still works, because they should ! on my side, they do, great ! So let's move on.
As usual, we create a predectible program to test our new datapath. When it come to add
the program is pretty simple :
//test_imemory.hex
00802903 //LW TEST START : lw x18 0x8(x0)
01202623 //SW TEST START : sw x18 0xC(x0)
01002983 //ADD TEST START : lw x19 0x10(x0)
01390A33 // add x20 x18 x19
00000013 //NOP
00000013 //NOP
//...
We simply load some data in a register and then add it. I added the said data to memory for this test (0x0000AAA
for the addition) :
//test_dmemory.hex
AEAEAEAE
00000000
DEADBEEF
F2F2F2F2
00000AAA
00000000
//...
And here is the updated test :
# test_cpu.py
# Rest of the file ...
@cocotb.test()
async def cpu_insrt_test(dut):
# OTHER TESTS ...
##################
# ADD TEST
# lw x19 0x10(x0) (this memory spot contains 0x00000AAA)
# add x20 x18 x19
##################
# Expected result of x18 + x19
expected_result = (0xDEADBEEF + 0x00000AAA) & 0xFFFFFFFF
await RisingEdge(dut.clk) # lw x19 0x10(x0)
assert binary_to_hex(dut.regfile.registers[19].value) == "00000AAA"
await RisingEdge(dut.clk) # add x20 x18 x19
assert binary_to_hex(dut.regfile.registers[20].value) == hex(expected_result)[2:].upper()
And now, ladies and gentlemen, it is time to be true to my words and implement the bitwise or
operation.
In terms of math, it is extremely basic, so let's go over what we need to do :
- Update the control unit to add support for
or
- Add OR operation to our ALU
And that's pretty much it ! the same will go for and
!
BTW, Here are the tables I use (From the Harris' DDCA book just like many other temporary tables) for my control values, which can be whatever as long as it is consistent throughout your design :
You can find the definitives tables for the HOLY CORE in the
Holy_Reference_Tables.pdf
file.
f7 | rs2 | rs1 | f3 | rd | OP |
---|---|---|---|---|---|
0000000 | xxxxx | xxxxx | 111 | xxxxx | 0110011 |
First we update control...
// control.sv
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 3'b000;
// R-Types
2'b10 : begin
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : alu_control = 3'b000;
// AND
3'b111 : alu_control = 3'b011; // NEW !
// ALL THE OTHERS
default: alu_control = 3'b111;
endcase
end
// EVERYTHING ELSE
default: alu_control = 3'b111;
endcase
end
Then we add a test case in the testbench and make sure control stills runs smoothly...
# test_control.py
# (...)
# All the other tests ...
@cocotb.test()
async def and_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR AND
await Timer(10, units="ns")
dut.op.value = 0b0110011 # R-TYPE
# Watch out ! F3 is important here and now !
dut.func3.value = 0b111
await Timer(1, units="ns")
assert dut.alu_control.value == "010"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "0"
assert dut.write_back_source.value == "0"
We then add some AND
logic to our ALU...
// alu.sv
module alu (
// IN
input logic [2:0] alu_control,
input logic [31:0] src1,
input logic [31:0] src2,
// OUT
output logic [31:0] alu_result,
output logic zero
);
always_comb begin
case (alu_control)
// ADD STUFF
3'b000 : alu_result = src1 + src2;
// AND STUFF
3'b010 : alu_result = src1 & src2;
// NON IMPLEMENTED STUFF
default: alu_result = 32'b0;
endcase
end
assign zero = alu_result == 32'b0;
endmodule
Add a test case for the alu and verify...
# test_alu.py
# (...)
# Other tests ...
@cocotb.test()
async def and_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b010
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
expected = src1 & src2
# Await 1 ns for the infos to propagate
await Timer(1, units="ns")
assert int(dut.alu_result.value) == expected
# (...)
# Other tests ...
And guess what ? No need to change the datapath ! R-Type
was already implemented ! So we get to work to immediatly write a test case and add it to our testbench hex program !
00802903 //LW TEST START : lw x18 0x8(x0) | x18 <= DEADBEEF
01202623 //SW TEST START : sw x18 0xC(x0) | 0xC <= DEADBEEF
01002983 //ADD TEST START : lw x19 0x10(x0) | x19 <= 00000AAA
01390A33 // add x20 x18 x19 | x20 <= DEADC999
01497AB3 //AND TEST START : and x21 x18 x20 | x21 <= DEAD8889
00000013 //NOP
00000013 //NOP
//...
As you can see, I made the comments a bit better, and we don't even have to touch data memory for this one as we can start to re-use previous results ! After all, we are working wth a CPU ;)
And now for the testing...
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# AND TEST
# and x21 x18 x20 (result shall be 0xDEAD8889)
##################
# Use last expected result, as this instr uses last op result register
expected_result = expected_result & 0xDEADBEEF
await RisingEdge(dut.clk) # and x21 x18 x20
assert binary_to_hex(dut.regfile.registers[21].value) == "DEAD8889"
And (it) works !
f7 | rs2 | rs1 | f3 | rd | OP |
---|---|---|---|---|---|
0000000 | xxxxx | xxxxx | 110 | xxxxx | 0110011 |
And we do the exact same thing !
- control
// control.sv
//...
2'b10 : begin // (ALU_OP case)
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : alu_control = 3'b000;
// AND
3'b111 : alu_control = 3'b010;
// OR
3'b110 : alu_control = 3'b011;
// ALL THE OTHERS
default: alu_control = 3'b111;
endcase
end
//...
# test_control.py
@cocotb.test()
async def or_control_test(dut):
await set_unknown(dut)
await Timer(10, units="ns")
dut.op.value = 0b0110011
dut.func3.value = 0b110
await Timer(1, units="ns")
# only thing that changes comp to add / and
assert dut.alu_control.value == "011"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
assert dut.alu_source.value == "0"
assert dut.write_back_source.value == "0"
- ALU
// alu.sv
// ...
always_comb begin
case (alu_control)
// ADD STUFF
3'b000 : alu_result = src1 + src2;
// AND STUFF
3'b010 : alu_result = src1 & src2;
// OR STUFF
3'b011 : alu_result = src1 | src2;
// NON IMPLEMENTED STUFF
default: alu_result = 32'b0;
endcase
end
// ...
# test_alu.py
# ...
@cocotb.test()
async def or_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b011
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
expected = src1 | src2
# Await 1 ns for the infos to propagate
await Timer(1, units="ns")
assert int(dut.alu_result.value) == expected
# ...
- CPU main datapath using sample hex programm
00802903 //LW TEST START : lw x18 0x8(x0) | x18 <= DEADBEEF
01202623 //SW TEST START : sw x18 0xC(x0) | 0xC <= DEADBEEF
01002983 //ADD TEST START : lw x19 0x10(x0) | x19 <= 00000AAA
01390A33 // add x20 x18 x19 | x20 <= DEADC999
01497AB3 //AND TEST START : and x21 x18 x20 | x21 <= DEAD8889
01402283 //OR TEST START : lw x5 0x14(x0) | x5 <= 125F552D
01802303 // lw x6 0x18(x0) | x6 <= 7F4FD46A
0062E3B3 // or x7 x5 x6 | x7 <= 7F5FD56F
00000013 //NOP
00000013 //NOP
//...
As you can see, I got myself some new other values than deadbeef to add a bit or "enthropy" in there. Here is the updated data memory :
AEAEAEAE
00000000
DEADBEEF
F2F2F2F2
00000AAA
125F552D
7F4FD46A
00000000
00000000
//...
And testing :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# OR TEST
# For this one, I decider to load some more value to
# change the "0xdead.... theme" ;)
# (Value pre-computed in python)
# lw x5 0x14(x0) | x5 <= 125F552D
# lw x6 0x18(x0) | x6 <= 7F4FD46A
# or x7 x5 x6 | x7 <= 7F5FD56F
##################
print("\n\nTESTING OR\n\n")
await RisingEdge(dut.clk) # lw x5 0x14(x0) | x5 <= 125F552D
assert binary_to_hex(dut.regfile.registers[5].value) == "125F552D"
await RisingEdge(dut.clk) # lw x6 0x18(x0) | x6 <= 7F4FD46A
assert binary_to_hex(dut.regfile.registers[6].value) == "7F4FD46A"
await RisingEdge(dut.clk) # or x7 x5 x6 | x7 <= 7F5FD56F
assert binary_to_hex(dut.regfile.registers[7].value) == "7F5FD56F"
And there we go ! We just added suport for 2 more instructions ! This section was to demonstrate that it is way faster to implement instructions once we already have most of the required logic.
Okay, now we can start to see how to remember data and do basic math ! great, we got ourselves a very dumb calculator. But the very essence of modern computing is Conditional programing.
When you first learn programming, conditions are the first thing you learn Even loops depends on them (We could even argue they are the same thing) ! It allows us to automate computing in ways that leverages the power of digital computing. Without these, we could not have any control flow - just tell the computer to execute a dumb set of math operations, exactly like a calculator would, but in less convinient (even though it is faster).
So, conditions sounds great, and I guess you already know how they work in C, Python and assembly. But how will we implement it here ?
Well, in the low level world, branching is just about changing the pc
(program counter) to whatever we want instead of going to the default pc + 4
we had until now. (thus the pc source selector we addedf at the very beginning).
Here will try to implement beq
that branches (changes the next pc to...) to whatever address we want in the instruction memory if a condition is met. This condition is that the two regs are equal.
IMM[12] + IMM[10:5] | rs2 | rs1 | f3 | IMM[4:1] + IMM[11] | OP |
---|---|---|---|---|---|
XXXXXXX | XXXXX | XXXXX | 000 | XXXXX | 1100011 |
(yes the immediate is weird I know...)
So let's get to work shall we ?
To implement beq
just like everything we did until now, we have to implement the instruction type datapath in the cpu. Here is a little todo list of what awaits us :
- Update the control to be able to change the source of
pc
andimmediate
- Add subtraction arithmetic to the ALU to check for equal
- Update the
pc_next
choice - Add some add arithmetic to compute a
PC_target
Here is a figure from the Harris' DDCA Books, RISC-V Edition alongside a table for the new weird IMM source
What is new here (in terms of datapath) ?
- We added a second adder that take the immediate and adds it to the PC.
- The resulting target PC is then fed in a MUX as a potential source for the nex PC.
- We add a control signal to select that new source.
- We also add a new immediate source
Okay, so in terms of control, we need to
- Take the
B-Type
OP into account - Add a
branch
flag to signal the possibility of branching - Raise a
pc_source
output flag if the branching condition is met
Here, the branching condition is simply
rs1 == rs2
i.e. thealu_zero
being high.
So in the controller code, we need to add the B-Type
and add some logic for pc_source
and a whole bunch of branch
signal for each instruction type :
// control.sv
module control (
// Other I/Os ...
output logic pc_source // NEW !
);
/**
* MAIN DECODER
*/
logic [1:0] alu_op;
logic branch;
always_comb begin
case (op)
// I-type
7'b0000011 : begin
//...
branch = 1'b0; // NEW !
end
// S-Type
7'b0100011 : begin
//...
branch = 1'b0; // NEW !
end
// R-Type
7'b0110011 : begin
//...
branch = 1'b0; // NEW !
end
// B-type
7'b1100011 : begin
// NEW !
reg_write = 1'b0;
imm_source = 2'b10;
alu_source = 1'b0;
mem_write = 1'b0;
alu_op = 2'b01;
branch = 1'b1;
end
// EVERYTHING ELSE
default: begin
// Updated this too, but has nothing to
// do with beq.
reg_write = 1'b0;
mem_write = 1'b0;
end
endcase
end
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 3'b000;
// R-Types
2'b10 : begin
// R-type stuf...
end
// BEQ
2'b01 : alu_control = 3'b001; // NEW ! (tell the alu to sub)
// EVERYTHING ELSE
default: alu_control = 3'b111;
endcase
end
/**
* PC_Source (NEW !)
*/
assign pc_source = alu_zero & branch;
endmodule
Great ! Now let's adapt our test cases and create a new one accordingly !
# test_control.py
# ...
@cocotb.test()
async def lw_control_test(dut):
# ...
assert dut.pc_source.value == "0" # NEW !
@cocotb.test()
async def sw_control_test(dut):
# ...
assert dut.pc_source.value == "0" # NEW !
@cocotb.test()
async def add_control_test(dut):
# ...
assert dut.pc_source.value == "0" # NEW !
@cocotb.test()
async def and_control_test(dut):
# ...
assert dut.pc_source.value == "0" # NEW !
@cocotb.test()
async def or_control_test(dut):
# ...
assert dut.pc_source.value == "0" # NEW !
@cocotb.test()
async def beq_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR BEQ
await Timer(10, units="ns")
dut.op.value = 0b1100011 # B-TYPE
dut.func3.value = 0b000 # beq
dut.alu_zero.value = 0b0
await Timer(1, units="ns")
assert dut.imm_source.value == "10"
assert dut.alu_control.value == "001"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "0"
assert dut.alu_source.value == "0"
assert dut.branch.value == "1"
assert dut.pc_source.value == "0"
# Test if branching condition is met
await Timer(3, units="ns")
dut.alu_zero.value = 0b1
await Timer(1, units="ns")
assert dut.pc_source.value == "1"
Note that our beq
testbench is separated in two :
- Test whilst the branching condition is not met (pc_source should stay default)
- Test whilst the branching condition is met (pc_source should change to select new target)
Before going any further, here are some basics that are neverthelesss important (and are pretty important details that ANYONE can get wrong) :
- In RISC-V, when the LU, performs a
sub
, we dosrcA - srcB
and not the contrary.- e.g.
sub rd, rs1, rs2
- rd <= rs1 - rs2 with rs2 being srcB of the ALU
- e.g.
- We do not care about sign interpretaion at this level, we just execute.
- 2's complement :
-srcB = (~srcB+1)
with ~ being a bitwise nor.
With the previous details in mind, for the logic, we simply make an addition with the 2's complement of src2 :
// alu.sv
//...
always_comb begin
case (alu_control)
// ADD STUFF
3'b000 : alu_result = src1 + src2;
// AND STUFF
3'b010 : alu_result = src1 & src2;
// OR STUFF
3'b011 : alu_result = src1 | src2;
// SUB Stuff (rs1 - rs2)
3'b001 : alu_result = src1 + (~src2 + 1'b1);
// NON IMPLEMENTED STUFF
default: alu_result = 32'b0;
endcase
end
//
And as usual, we verify our ALU works before going anyfurther or event thinking about using it in our CPU !
# test_alu.py
# ...
@cocotb.test()
async def sub_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b001
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
# src1 = random.randint(0,0xFF)
# src2 = random.randint(0,0xFF)
# print(bin(src1)[2:].zfill(8))
# print(bin(src2)[2:].zfill(8))
# print(bin(src1 - src2)[2:].zfill(8))
dut.src1.value = src1
dut.src2.value = src2
expected = (src1 - src2) & 0xFFFFFFFF
await Timer(1, units="ns")
assert str(dut.alu_result.value) == bin(expected)[2:].zfill(32)
assert binary_to_hex(dut.alu_result.value) == hex(expected)[2:].zfill(8).upper()
assert int(str(dut.alu_result.value),2) == expected
# ...
As you can see, they are a lot of assertions and comments, as I wanted to make sure this whole 2's complement stuff worked as intended regardless of our sign interpretation. I chose to keep them in the code if you want to experiment with this testbench as well.
And now is the time to tackle the monstruosity of an instruction format :
| IMM[12] + IMM[10:5] | rs2 | rs1 |f3 | IMM[4:1] + IMM[11] | OP |
By updating the sign extender's logic to support this new imm_source = 2'b10
Tip : I Highly suggest you you pen and paper, this is true for many things, but as immediates sources gets weirder, it will be more and more helpful. By doing so, you will easily pick up the patterns and quickly get the "bitwise gymnastics".
Here is the updated sign extender logic :
// signext.sv
module signext (
// IN
input logic [24:0] raw_src,
input logic [1:0] imm_source,
// OUT (immediate)
output logic [31:0] immediate
);
logic [11:0] gathered_imm;
always_comb begin
case (imm_source)
// For I-Types
2'b00 : gathered_imm = raw_src[24:13];
// For S-types
2'b01 : gathered_imm = {raw_src[24:18],raw_src[4:0]};
// For B-types
2'b10 : gathered_imm = {raw_src[0],raw_src[23:18],raw_src[4:1],1'b0};
default: gathered_imm = 12'b0;
endcase
end
assign immediate = (imm_source == 2'b10) ?
{{20{raw_src[24]}}, gathered_imm} :
{{20{gathered_imm[11]}}, gathered_imm};
endmodule
The
value = (condition) ? <value if true> : <value is false>
statement is used here. Take the time to understand it if you are not familiar with this syntax as is allows us to avoid writting hugealways_comb
chuncks for simple "no brainer" cases.
As you can see, the immediate range is 13 bits (we add a single 0 bit at the end), because the data is byte addressed, we can add a single 0 at the end as we will never point on a single byte (8bits) offset to retrieve an instruction. Adding a 0 allows for a better range (from 12bits to 13bits) .
But why a single 0 and not two ? an instruction is 32 bits ! so the theorical minimum offset is 4Bytes not 2Bytes !
Yes and no, This allows the user to point on "half words" on 16 bits. In our case, this is not useful and will automatically be discarded by the way we implemented our memory. BUT it can be useful if the Compressed extension set is implemented, but this is out of the scope of this tutorial. (It's a pain to work with AND it's optional so let's just not do that for now).
Once again, just like the other tests, nothing new except for the bitwise gymnastics to setup the test :
# test_signext.py
# other test ...
@cocotb.test()
async def signext_b_type_test(dut):
# 100 randomized tests
for _ in range(100):
# TEST POSITIVE IMM
await Timer(100, units="ns")
imm = random.randint(0,0b011111111111)
imm <<= 1 # 13 bits signed imm ending with a 0
imm_12 = (imm & 0b1000000000000) >> 12 # 0 for now (positive)
imm_11 = (imm & 0b0100000000000) >> 11
imm_10_5 = (imm & 0b0011111100000) >> 5
imm_4_1 = (imm & 0b0000000011110) >> 1
raw_data = (imm_12 << 24) | (imm_11 << 0) | (imm_10_5 << 18) | (imm_4_1 << 1)
source = 0b10
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source.value = source
await Timer(1, units="ns") # let it propagate ...
assert int(dut.immediate.value) == imm
# TEST NEGATIVE IMM
await Timer(100, units="ns")
imm = random.randint(0b100000000000,0b111111111111)
imm <<= 1 # 13 bits signed imm ending with a 0
imm_12 = (imm & 0b1000000000000) >> 12 # 1 (negative)
imm_11 = (imm & 0b0100000000000) >> 11
imm_10_5 = (imm & 0b0011111100000) >> 5
imm_4_1 = (imm & 0b0000000011110) >> 1
raw_data = (imm_12 << 24) | (imm_11 << 0) | (imm_10_5 << 18) | (imm_4_1 << 1)
source = 0b10
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source.value = source
await Timer(1, units="ns") # let it propagate ...
assert int(dut.immediate.value) - (1 << 32) == imm - (1 << 13)
And just like the other signext tests, we sepearated + and - tests to assert interger value in python. (we could assert binary values, but this a good bitwise operation exercice, so why not ?)
For the datapath, we need to be able to compute a new PC using some basic add arithmetic using :
- Our brand new
immediate
source - The current
pc
To do so we get a pc_source
wire (the one we just created) from the control unit :
// cpu.sv
//...
/**
* CONTROL
*/
// ... Others signals declarations ...
wire pc_source; // NEW !
control control_unit(
.op(op),
.func3(f3),
.func7(7'b0),
.alu_zero(alu_zero),
.alu_control(alu_control),
.imm_source(imm_source),
.mem_write(mem_write),
.reg_write(reg_write),
.alu_source(alu_source),
.write_back_source(write_back_source),
.pc_source(pc_source) // NEW !
);
And use it to select our pc_next either from pc+4
or our new pc+imm
:
// cpu.sv
//..I/Os...
/**
* PROGRAM COUNTER
*/
reg [31:0] pc;
logic [31:0] pc_next;
always_comb begin : pc_select
case (pc_source)
1'b1 : pc_next = pc + immediate; // pc_target
default: pc_next = pc + 4; // pc + 4
endcase
end
always @(posedge clk) begin
if(rst_n == 0) begin
pc <= 32'b0;
end else begin
pc <= pc_next;
end
end
//..other logic...
As we touched the datapath and other logics, now is a good time to see if all testbenches are still okay. I suggest you re-test all the design after each modification by the way.
Let's now build a test program for our new beq
.
Here is the program I came up with, it runs multiple branches :
- The branch in not taken
- The branch is taken, skipping some instructions
- The branch is taken with a negative offset
- The branch is taken forward avoiding the loop
00802903 //LW TEST START : lw x18 0x8(x0) | x18 <= DEADBEEF
01202623 //SW TEST START : sw x18 0xC(x0) | 0xC <= DEADBEEF
01002983 //ADD TEST START : lw x19 0x10(x0) | x19 <= 00000AAA
01390A33 // add x20 x18 x19 | x20 <= DEADC999
01497AB3 //AND TEST START : and x21 x18 x20 | x21 <= DEAD8889
01402283 //OR TEST START : lw x5 0x14(x0) | x5 <= 125F552D
01802303 // lw x6 0x18(x0) | x6 <= 7F4FD46A
0062E3B3 // or x7 x5 x6 | x7 <= 7F5FD56F
00730663 //BEQ TEST START : beq x6 x7 0xC | #1 SHOULD NOT BRANCH
00802B03 // lw x22 0x8(x0) | x22 <= DEADBEEF
01690863 // beq x18 x22 0x10 | #2 SHOULD BRANCH (+ offset)
00000013 // nop | NEVER EXECUTED
00000013 // nop | NEVER EXECUTED
00000663 // beq x0 x0 0xC | #4 SHOULD BRANCH (avoid loop)
00002B03 // lw x22 0x0(x0) | x22 <= AEAEAEAE
FF6B0CE3 // beq x22 x22 -0x8 | #3 SHOULD BRANCH (-offset)
00000013 // nop | FINAL NOP
00000013 //NOP
00000013 //NOP
We don't need to intervene on data memory, let's go right to the test case.
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# BEQ TEST
# 00730663 //BEQ TEST START : beq x6 x7 0xC | #1 SHOULD NOT BRANCH
# 00802B03 // lw x22 0x8(x0) | x22 <= DEADBEEF
# 01690863 // beq x18 x22 0x10 | #2 SHOULD BRANCH (+ offset)
# 00000013 // nop | NEVER EXECUTED
# 00000013 // nop | NEVER EXECUTED
# 00000663 // beq x0 x0 0xC | #4 SHOULD BRANCH (avoid loop)
# 00002B03 // lw x22 0x0(x0) | x22 <= AEAEAEAE
# FF6B0CE3 // beq x22 x22 -0x8 | #3 SHOULD BRANCH (-offset)
# 00000013 // nop | FINAL NOP
##################
print("\n\nTESTING BEQ\n\n")
assert binary_to_hex(dut.instruction.value) == "00730663"
await RisingEdge(dut.clk) # beq x6 x7 0xC NOT TAKEN
# Check if the current instruction is the one we expected
assert binary_to_hex(dut.instruction.value) == "00802B03"
await RisingEdge(dut.clk) # lw x22 0x8(x0)
assert binary_to_hex(dut.regfile.registers[22].value) == "DEADBEEF"
await RisingEdge(dut.clk) # beq x18 x22 0x10 TAKEN
# Check if the current instruction is the one we expected
assert binary_to_hex(dut.instruction.value) == "00002B03"
await RisingEdge(dut.clk) # lw x22 0x0(x0)
assert binary_to_hex(dut.regfile.registers[22].value) == "AEAEAEAE"
await RisingEdge(dut.clk) # beq x22 x22 -0x8 TAKEN
# Check if the current instruction is the one we expected
assert binary_to_hex(dut.instruction.value) == "00000663"
await RisingEdge(dut.clk) # beq x0 x0 0xC TAKEN
# Check if the current instruction is the one we expected
assert binary_to_hex(dut.instruction.value) == "00000013"
And the test passes ! (after correcting some typos of course haha).
Still in the "changing the pc" theme, I'd like to introduce the J-Type instructions :
IMM[20] + IMM[10:1] + IMM[11] + IMM[19:12] | rd | OP |
---|---|---|
XXXXXXXXXXXXXXXXXXXX | XXXXX | 1101111 |
According to the RV32 Table, jal
is the only instruction that uses the J-Type for the RV32I ISA.
And yes, the immediate is absolutely terrible.
So ? What does it do ?
The idea behind this instruction is to jump without a condition and store PC+4 in a return register rd
(usually x1
)to come back to it later, often used to call a function for example.
so, we pretty much have nothing to check as this is an unconditional event that will store PC_target
.
into rd
.
Here is what we want to implement :
So,
- We need a new
imm_source
- We need a new
write_back_source
- Update the control unit accordingly
Don't mind the last 2'b11 value for the write_back mux on the scheme, we reserve that one for later ;)
Tip : Just like
beq
(and many other things...) using pen and paper is strongly recommended to write HDL and tests !
As usual, we add this imm source to our HDL Code. I grabbed this opportunity to heavily re-factor the signext code as immediates sizes were now too different from one to another :
// signext.sv
module signext (
// IN
input logic [24:0] raw_src,
input logic [1:0] imm_source,
// OUT (immediate)
output logic [31:0] immediate
);
always_comb begin
case (imm_source)
// For I-Types
2'b00 : immediate = {{20{raw_src[24]}}, raw_src[24:13]};
// For S-types
2'b01 : immediate = {{20{raw_src[24]}},raw_src[24:18],raw_src[4:0]};
// For B-types
2'b10 : immediate = {{20{raw_src[24]}},raw_src[0],raw_src[23:18],raw_src[4:1],1'b0};
// For J-types
2'b11 : immediate = {{12{raw_src[24]}},
aw_src[12:5],
raw_src[13],
raw_src[23:14],
1'b0};
default: immediate = 12'b0;
endcase
end
endmodule
This module is now shorter and makes more sense. Let's move on to verifying it.
First, run the old tests to make sure our HDL code chages did not affect the underlying logic.
And then, to verify our new immediate, we do as usual: using some bitwise gymnastic, a pen and some paper :
# test_signext.sv
# ...
@cocotb.test()
async def signext_j_type_test(dut):
# 100 randomized tests
for _ in range(100):
# TEST POSITIVE IMM
await Timer(100, units="ns")
imm = random.randint(0,0b01111111111111111111)
imm <<= 1 # 21 bits signed imm ending with a 0
imm_20 = (imm & 0b100000000000000000000) >> 20
imm_19_12 = (imm & 0b011111111000000000000) >> 12
imm_11 = (imm & 0b000000000100000000000) >> 11
imm_10_1 = (imm & 0b000000000011111111110) >> 1
raw_data = (imm_20 << 24) | (imm_19_12 << 5) | (imm_11 << 13) | (imm_10_1 << 14)
source = 0b11
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source.value = source
await Timer(1, units="ns") # let it propagate ...
assert int(dut.immediate.value) == imm
# TEST NEGATIVE IMM
await Timer(100, units="ns")
imm = random.randint(0b10000000000000000000,0b11111111111111111111)
imm <<= 1 # 21 bits signed imm ending with a 0
imm_20 = (imm & 0b100000000000000000000) >> 20
imm_19_12 = (imm & 0b011111111000000000000) >> 12
imm_11 = (imm & 0b000000000100000000000) >> 11
imm_10_1 = (imm & 0b000000000011111111110) >> 1
raw_data = (imm_20 << 24) | (imm_19_12 << 5) | (imm_11 << 13) | (imm_10_1 << 14)
source = 0b11
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source.value = source
await Timer(1, units="ns") # let it propagate ...
assert int(dut.immediate.value) - (1 << 32) == imm - (1 << 21)
Here is a recap of what we'll need to do now for the control unit (ignore "I-Type ALU" for now)
(figures from Digital Design and Computer Architecture, RISC-V Edition, and as usual, you can find all the definitive signal in the Holy_Reference_Tables.pdf
file.
Let's :
- add a
jump
signal - modify the
write_back_source
signal - add a whole control logic for
J-type
// constrol.sv
`timescale 1ns/1ps
module control (
// I/Os
);
/**
* MAIN DECODER
*/
logic [1:0] alu_op;
logic branch;
logic jump;
always_comb begin
case (op)
// ... (note that the "write_back_source" was updated for other control cases)
// J-type (NEW !)
7'b1101111 : begin
reg_write = 1'b1;
imm_source = 2'b11;
mem_write = 1'b0;
write_back_source = 2'b10; //pc_+4
branch = 1'b0;
jump = 1'b1;
end
// ...
endcase
end
/**
* ALU DECODER
*/
// ...
/**
* PC_Source
*/
assign pc_source = (alu_zero & branch) | jump; // NEW !
endmodule
To verify this new design, we first need to update our tests cases by replacing :
# test_control.py
assert dut.write_back_source.value == "X"
By :
# test_control.py
assert dut.write_back_source.value == "XX"
And add our jal
test case :
# test_control.py
# ...
@cocotb.test()
async def jal_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR JAL
await Timer(10, units="ns")
dut.op.value = 0b1101111 # J-TYPE
await Timer(1, units="ns")
assert dut.imm_source.value == "11"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
assert dut.branch.value == "0"
assert dut.jump.value == "1"
assert dut.pc_source.value == "1"
assert dut.write_back_source.value == "10"
For the datapath, we shall simply update the write_back_source
wire to support 2bits and also update the result_src mux.
// cpu.sv
// I/Os
/**
* PROGRAM COUNTER
*/
reg [31:0] pc;
logic [31:0] pc_next;
logic [31:0] pc_target;
logic [31:0] pc_plus_four;
assign pc_target = pc + immediate;
assign pc_plus_four = pc + 4;
always_comb begin : pc_select
case (pc_source)
1'b1 : pc_next = pc_target; // pc_target
default: pc_next = pc_plus_four; // pc + 4
endcase
end
always @(posedge clk) begin
if(rst_n == 0) begin
pc <= 32'b0;
end else begin
pc <= pc_next;
end
end
/**
* CONTROL
*/
// ...
wire [1:0] write_back_source; // CHANGED !
// ...
/**
* REGFILE
*/
//...
always_comb begin : write_back_source_select
case (write_back_source)
// CHANGED !
2'b00: write_back_data = alu_result;
2'b01: write_back_data = mem_read;
2'b10: write_back_data = pc_plus_four; // NEW !
endcase
end
//...
As you can see, I also refactored the PC code using logic signals for values instead of hardcoded values.
As usual, let's write a little program to test our new jal
instruction, and add it to our main test program :
00802903 //LW TEST START : lw x18 0x8(x0) | x18 <= DEADBEEF
01202623 //SW TEST START : sw x18 0xC(x0) | 0xC <= DEADBEEF
01002983 //ADD TEST START : lw x19 0x10(x0) | x19 <= 00000AAA
01390A33 // add x20 x18 x19 | x20 <= DEADC999
01497AB3 //AND TEST START : and x21 x18 x20 | x21 <= DEAD8889
01402283 //OR TEST START : lw x5 0x14(x0) | x5 <= 125F552D
01802303 // lw x6 0x18(x0) | x6 <= 7F4FD46A
0062E3B3 // or x7 x5 x6 | x7 <= 7F5FD56F
00730663 //BEQ TEST START : beq x6 x7 0xC | #1 SHOULD NOT BRANCH
00802B03 // lw x22 0x8(x0) | x22 <= DEADBEEF
01690863 // beq x18 x22 0x10 | #2 SHOULD BRANCH
00000013 // nop | NEVER EXECUTED
00000013 // nop | NEVER EXECUTED
00000663 // beq x0 x0 0xC | #4 SHOULD BRANCH
00002B03 // lw x22 0x0(x0) | x22 <= AEAEAEAE
FF6B0CE3 // beq x22 x22 -0x8 | #3 SHOULD BRANCH
00000013 // nop | FINAL NOP
00C000EF //JAL TEST START : jal x1 0xC | #1 jump @PC+0xC
00000013 // nop | NEVER EXECUTED
00C000EF // jal x1 0xC | #2 jump @PC-0x4
FFDFF0EF // jal x1 0x-4 | #2 jump @PC-0x4
00000013 // nop | NEVER EXECUTED
00C02383 // lw x7 0xC(x0) | x7 <= DEADBEEF
00000013 //NOP
00000013 //NOP
// ...
Just like beq
I test out the "forward" (positive immediate) and "backward" (negative immediate) jumps.
Once again, no need to add new data in the data memory hex ROM file.
Our program is starting to look beefy ! I am proud of how far we've come ! Next time I will truncate it ;)
And now let's write a test bench :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# 00C000EF //JAL TEST START : jal x1 0xC | #1 jump @PC+0xC | PC 0x44
# 00000013 // nop | NEVER EXECUTED | PC 0x48
# 00C000EF // jal x1 0xC | #2 jump @PC-0x4 | PC 0x4C
# FFDFF0EF // jal x1 -4 | #2 jump @PC-0x4 | PC 0x50
# 00000013 // nop | NEVER EXECUTED | PC 0x54
# 00C02383 // lw x7 0xC(x0) | x7 <= DEADBEEF | PC 0x58
##################
print("\n\nTESTING JAL\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "00C000EF"
assert binary_to_hex(dut.pc.value) == "00000044"
await RisingEdge(dut.clk) # jal x1 0xC
# Check new state & ra (x1) register value
assert binary_to_hex(dut.instruction.value) == "FFDFF0EF"
assert binary_to_hex(dut.pc.value) == "00000050"
assert binary_to_hex(dut.regfile.registers[1].value) == "00000048" # stored old pc + 4
await RisingEdge(dut.clk) # jal x1 -4
# Check new state & ra (x1) register value
assert binary_to_hex(dut.instruction.value) == "00C000EF"
assert binary_to_hex(dut.pc.value) == "0000004C"
assert binary_to_hex(dut.regfile.registers[1].value) == "00000054" # stored old pc + 4
await RisingEdge(dut.clk) # jal x1 0xC
# Check new state & ra (x1) register value
assert binary_to_hex(dut.instruction.value) == "00C02383"
assert binary_to_hex(dut.pc.value) == "00000058"
assert binary_to_hex(dut.regfile.registers[1].value) == "00000050" # stored old pc + 4
await RisingEdge(dut.clk) # lw x7 0xC(x0)
assert binary_to_hex(dut.regfile.registers[7].value) == "DEADBEEF"
And it works ! J-Type
instructions are now supported.
Great ! Our CPU is capable of doing many things ! We only have one instruction type left before being done with all the types. But right before moving on to that (see part 7), we'll implement the addi
instruction.
Why ? This is to show you how convinient to add new instrcution now that we have most of the logic written down.
addi
is an I-Type
instruction that does this :
addi rd, rs, imm # add a 12 bits immediate to rs1 and stores it into rd
And here is how this instruction is structured :
IMM[11:0] | rs1 | f3 | rd | OP |
---|---|---|---|---|
XXXXXXXXXXXX | XXXXX | 000 | XXXXX | 0010011 |
So yes, the op isn't quite the same as lw
but still, it is an I-Type
instruction (the ones that do not interact with memory). So to differenciate these from regular "memory interfering" I-Types
, we'll call these : ALU I-Types
.
Here is the very long list of what we need to do :
- Update control
That's it ! Because when you think about it, we already have everything we need !
- An ALU able to add stuff toghether
- A sign extender able to get it's immediates from
I-Type
- ...
So let's get to work !
Here is what I am basing my signals on (from Harris' DDCA Risc-V edition Book) :
You can find the full list of definitive signals in the
Holy_Reference_Tables.pdf
file.
// control.sv
//...
// ALU I-type
7'b0010011 : begin
reg_write = 1'b1;
imm_source = 2'b00;
alu_source = 1'b1; //imm
mem_write = 1'b0;
alu_op = 2'b10;
write_back_source = 1'b00; //alu_result
branch = 1'b0;
jump = 1'b0;
end
//...
Only the source of the write_back changes ! Which makes sense ! Let's test that without further ado !
# test_control.py
#...
@cocotb.test()
async def addi_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR ADDI
await Timer(10, units="ns")
dut.op.value = 0b0010011 # I-TYPE
dut.func3.value = 0b000 # addi
await Timer(1, units="ns")
# Logic block controls
assert dut.alu_control.value == "000" # we seek to add
assert dut.imm_source.value == "00"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "1"
assert dut.write_back_source.value == "00"
assert dut.pc_source.value == "0"
And the control is ready to tell our CPU what to do ! Let's hop on to using it into our test program !
The programm will be only 2 instruction :
- One with a positive immediate
- Another with a negative immediate
As I said before, I will not show the rest of the program here as it is getting pretty chunky :
//...
00000013 // nop | NEVER EXECUTED
00C02383 // lw x7 0xC(x0) | x7 <= DEADBEEF
1AB38D13 //ADDI TEST START : addi x26 x7 0x1AB | x26 <= DEADC09A
F2130C93 // addi x25 x6 0xF21 | x25 <= DEADBE10
00000013 //NOP
//...
And here is the testbench in python to test for these expected results & behavior :
# test_cpu.sv
# ...
##################
# ADDI TEST
# 1AB38D13 // addi x26 x7 0x1AB | x26 <= DEADC09A
# F2130C93 // addi x25 x6 0xF21 | x25 <= DEADBE10
##################
print("\n\nTESTING ADDI\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "1AB38D13"
assert not binary_to_hex(dut.regfile.registers[26].value) == "DEADC09A"
await RisingEdge(dut.clk) # addi x26 x7 0x1AB
assert binary_to_hex(dut.instruction.value) == "F2130C93"
assert binary_to_hex(dut.regfile.registers[26].value) == "DEADC09A"
await RisingEdge(dut.clk) # addi x25 x6 0xF21
assert binary_to_hex(dut.regfile.registers[25].value) == "7F4FD38B"
And it works ! GG ! This addi
instruction can also be used as li
aka : Load Immediate because if we take x0 as a source register, the immediate will imediatly get stored in the rd
(destination register).
Instruction like
li
are known as pseudo-instruction as they are just a more convinient name for another instruction, which means that, yes, assembly does have it's lots of abstraction layers ;)
But the immediate here is only 12 bits ! What if I need to load 32 bits in my register, am I doomed to write all my 32 bits data in memory to load it ?
NO ! There are convinient instruction out there to load the upper 20 bits of an immediate, and they are called U-Types
! See you in part 7 to implement them !
Okay ! Our CPU is now capable of doing many things ! And as we implemented most of the instructions types, adding new instructions should be fairly easy.
BUT WAIT ! An instruction type is missing !
And you are right indeed ! If we take a look at the base RV32 I Types table, we see that we still have to implement U-Type
! But what is U-Type
exactly ?
Well U-type only has 2 instrctions under its belt :
lui
(load upper immediate in an rd)auipc
(load upper immediate in an rd + PC)
These instructions allows us to setup the upper 20bits of a register epecially useful for memory stuff. (Take a look here to learn more).
Here is how they look like :
Instruction name | IMM[31:12] | rd | OP |
---|---|---|---|
lui |
XXXXXXXXXXXXXXXXXXXX | XXXXX | 0110111 |
auipc |
XXXXXXXXXXXXXXXXXXXX | XXXXX | 0010111 |
Notice there is a changing bit in the OP codes.
Even though these instructions sounds (and look) simple, implementing them will not be as easy as
addi
!
Well we shall do multiple things this time :
- Add a new immediate source
- Update sign ext
- Add a new result source
- Update data path
- Update the control accordignly
- Add a new source signal for the pc target arithmetic
Here is a scheme of what we'll implement :
The mux before pc target chooses between :
- 32'b0 for lui
- PC for auipc
There also is a new source for the second_adder to add to the decoder. (don't mind the empty mux input for the second add, we keep this one for later)
These modifs will make imm_source
3 bits wide instead of 2. So we'll have to update things accordingly. (This will sometimes happen, remember to always update everything accordingly).
Let's get started with the signext logic. Here is the updated code :
// signext.sv
// IOs ...
always_comb begin
case (imm_source)
// For I-Types
3'b000 : immediate = {{20{raw_src[24]}}, raw_src[24:13]};
// For S-types
3'b001 : immediate = {{20{raw_src[24]}},raw_src[24:18],raw_src[4:0]};
// For B-types
3'b010 : immediate = {{20{raw_src[24]}},raw_src[0],raw_src[23:18],raw_src[4:1],1'b0};
// For J-types
3'b011 : immediate = {{12{raw_src[24]}},
raw_src[12:5],
raw_src[13],
raw_src[23:14],
1'b0};
// For U-Types
3'b100 : immediate = {raw_src[24:5],12'b000000000000};
default: immediate = 12'b0;
endcase
end
//..
And here is a simple test to verify the outputs :
# test_signext.py
# ...
@cocotb.test()
async def signext_u_type_test(dut):
# 100 randomized tests
for _ in range(100):
# TEST POSITIVE NEGATIVE IMM
await Timer(100, units="ns")
imm_31_12 = random.randint(0,0b11111111111111111111)
raw_data = (imm_31_12 << 5)
# add random junk to the raw_data to see if it is indeed discarded
random_junk = random.randint(0,0b11111)
raw_data |= random_junk
source = 0b100 # you'll have to update this in other tests too
await Timer(1, units="ns")
dut.raw_src.value = raw_data
dut.imm_source.value = source
await Timer(1, units="ns") # let it propagate ...
assert int(dut.immediate.value) == imm_31_12 << 12
# ...
So, as stated before, we need to update the imm_source
signals (make them 3 bits) add a new source for the add arithmetic (we'll call it second_add_source
to stay as simple and explicit as possible).
Here is the updated control unit, do not forget to update all the imm_source. The main thing here is the newly added second_add_source
:
// control.sv
`timescale 1ns/1ps
module control (
// I/Os...
output logic second_add_source // NEW !
);
/**
* MAIN DECODER
*/
logic [1:0] alu_op;
logic branch;
logic jump;
always_comb begin
case (op)
// I-type
7'b0000011 : begin
imm_source = 3'b000; // CHANGED
// ...
end
// ALU I-type
7'b0010011 : begin
imm_source = 3'b000; // CHANGED
// ...
end
// S-Type
7'b0100011 : begin
imm_source = 3'b001; // CHANGED
// ...
end
// R-Type
7'b0110011 : begin
// ...
end
// B-type
7'b1100011 : begin
imm_source = 3'b010; // CHANGED
// ...
end
// J-type
7'b1101111 : begin
imm_source = 3'b011; // CHANGED
// ...
end
// U-type NEW !
7'b0110111, 7'b0010111 : begin
imm_source = 3'b100;
mem_write = 1'b0;
reg_write = 1'b1;
write_back_source = 2'b11;
branch = 1'b0;
jump = 1'b0;
case(op[5])
1'b1 : second_add_source = 1'b1; // lui
1'b0 : second_add_source = 1'b0; // auipc
endcase
end
// ...
endcase
end
//...
endmodule
Notice that we used the changing bit between the two OP Codes to discriminate both instructions to determine second_add_source
.
Verification is pretty simple, here is the resulting test bench :
Don't forget to update the
imm_source
assertions to 3 bits or the test will not pass !
# test_control.py
# ...
@cocotb.test()
async def auipc_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR AUIPC
await Timer(10, units="ns")
dut.op.value = 0b0010111 # U-TYPE (auipc)
await Timer(1, units="ns")
# Logic block controls
assert dut.imm_source.value == "100"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
assert dut.write_back_source.value == "11"
assert dut.branch.value == "0"
assert dut.jump.value == "0"
assert dut.second_add_source.value == "0"
Now let's implement and test the datapath ! Refer to the previous scheme (see 7.1).
Here is an overview of the updates we shall make :
// cpu.sv
module cpu (
// I/Os
);
/**
* PROGRAM COUNTER
*/
reg [31:0] pc;
logic [31:0] pc_next;
logic [31:0] pc_plus_second_add;
logic [31:0] pc_plus_four;
assign pc_plus_four = pc + 4;
// NEW LOGIC BELOW !
always_comb begin : pc_select
case (pc_source)
1'b0 : pc_next = pc_plus_four; // pc + 4
1'b1 : pc_next = pc_plus_second_add;
endcase
end
always_comb begin : second_add_select
case (second_add_source)
1'b0 : pc_plus_second_add = pc + immediate;
1'b1 : pc_plus_second_add = immediate;
endcase
end
/**
* INSTRUCTION MEMORY
*/
// ...
/**
* CONTROL
*/
// ...
wire second_add_source; // NEW !
control control_unit(
.second_add_source(second_add_source) // ...
);
/**
* REGFILE
*/
//...
logic [31:0] write_back_data;
always_comb begin : write_back_source_select
case (write_back_source)
// ...
2'b11: write_back_data = pc_plus_second_add; // NEW !
endcase
end
regfile regfile(
// ...
);
/**
* SIGN EXTEND
*/
// ...
/**
* ALU
*/
//...
/**
* DATA MEMORY
*/
// ...
endmodule
As you can see, we also add a MUX case for the write_back_source
signal and renamed pc_target
for pc_plus_second_add
. (Which does't really make sense but we're here to be explicit in our naming so this will have to do !).
By the way, here is a little fun fact : we still don't use F7 at all in our CPU ! We'll use it to discriminate R-Types don't worry ;)
Warning ! You have to make 2
always_comb
block, otherwise it does not work ! Idk if it's about simulation or something, but if you put both second_add_select and pc_select muxes in the samealways_comb
, it just does not update the pc correctly for thejal
and the simulation crashes ! (yes it took me a whole hour to figure this sh*t out haha).
As usual , let's update our main program and make assertions on the results based ont the RISC-V ISA.
//test_imemory.hex
//...
1F1FA297 //AUIPC TEST START : auipc x5 0x1F1FA | x5 <= 1F1FA064 PC 0x64
2F2FA2B7 //LUI TEST START : lui x5 0x2F2FA | x5 <= 2F2FA000
00000013 //NOP
00000013 //NOP
00000013 //NOP
//...
And we add these tests cases to our CPU testbench :
# test_cpu.py
#...
@cocotb.test()
async def cpu_insrt_test(dut):
# Other tests ...
##################
# AUIPC TEST (PC befor is 0x64)
# 1F1FA297 //AUIPC TEST START : auipc x5 0x1F1FA | x5 <= 1F1FA064
##################
print("\n\nTESTING AUIPC\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "1F1FA297"
await RisingEdge(dut.clk) # auipc x5 0x1F1FA
assert binary_to_hex(dut.regfile.registers[5].value) == "1F1FA064"
##################
# LUI TEST
# 2F2FA2B7 //LUI TEST START : lui x5 0x2F2FA | x5 <= 2F2FA000
##################
print("\n\nTESTING LUI\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "2F2FA2B7"
await RisingEdge(dut.clk) # lui x5 0x2F2FA
assert binary_to_hex(dut.regfile.registers[5].value) == "2F2FA000"
And it works ! two birds with one stone !
This part will contain less details as I assume you now have enough experience to understand problems and fixes more easily.
Okay, we just implemented our last instruction type ! Meaning we now have a pretty robust datapath on which we can implement more instructions (hopefully) without much of a problem !
Now that we enter the "DIY" realm (given DDCA lectures & book stopped covering what we're doing since we did the U-Types
), here is a reminder you can find all the signals encodings, their meanings, etc... in the Holy_Reference_Tables.pdf
file.
In this sub-part, we'll use our datapath to rush the implementation of :
slti
(set less than immediate)sltiu
(set less than immediate unsigned)xori
ori
andi
slli
(shift left logical by a 5bits MAX shamt)srli
(shift right logical by a 5bits MAX shamt)srai
(shift right arithmetic by a 5bits MAX shamt)
what does slti do ?
IMM [11:0] | rs1 | f3 | rd | op |
---|---|---|---|---|
XXXXXXXXXXXX | XXXXX | 010 | XXXXX | 0010011 |
slti rd rs1 0xXXX
According to the RISC-V ISA Manual :
SLTI (set less than immediate) places the value 1 in register rd if register rs1 is less than the sign- extended immediate when both are treated as signed numbers, else 0 is written to rd.
To do that, we need new arithmetic in the ALU : a comparator that will return either 1 or 0 depending on wether or not the srcA (rs1) is smaller than srcB (the immediate because alu_src will the alu_source will be "immediate").
According to the Holy_Reference_Tables.pdf
file, we allocate the value 101
to our ALU_control for this comparison operation.
The logic is pretty straight-forward :
// alu.sv
// ...
// LESS THAN COMPARE STUFF (src1 < src2)
3'b101 : alu_result = {31'b0, $signed(src1) < $signed(src2)};
// ...
Don't forget to specify $signed(XXX)
otherwise you are not implementing the right logic !
As well as the verification :
# test_alu.py
@cocotb.test()
async def slt_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b101
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
await Timer(1, units="ns")
# if scr1 pos, src2 pos
if src1 >> 31 == 0 and src2 >> 31 == 0:
expected = int(src1 < src2)
# if scr1 pos, src2 neg
elif src1 >> 31 == 0 and src2 >> 31 == 1:
expected = int(src1 < (src2 - (1<<32)))
# if scr1 neg, src2 pos
elif src1 >> 31 == 1 and src2 >> 31 == 0:
expected = int((src1 - (1<<32)) < src2)
# if scr1 neg, src2 neg
elif src1 >> 31 == 1 and src2 >> 31 == 1:
expected = int((src1 - (1<<32)) < (src2 - (1<<32)))
assert int(dut.alu_result.value) == expected
assert dut.alu_result.value == 31*"0" + str(int(dut.alu_result.value))
I also added a raw binary check to check for the raw formating of the information. There are some tests to compare different immediates scenarios because slti
treats both sources as signed.
As usual, we update our control, please refer to the Holy_Reference_Tables.pdf
file.
We use previous muxes that have been layed our for R-Type
and I-Type (alu)
, and we use the f3 of the instruction to figure out the arithmetic (which is what we'll do for (almost) all I-Types
and R-Types
):
// control.sv
// ...
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 3'b000;
// R-Types
2'b10 : begin
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : alu_control = 3'b000;
// AND
3'b111 : alu_control = 3'b010;
// OR
3'b110 : alu_control = 3'b011;
// SLTI
3'b010 : alu_control = 3'b101; // NEW !
endcase
end
// BEQ
2'b01 : alu_control = 3'b001;
// EVERYTHING ELSE
default: alu_control = 3'b111;
endcase
end
// ...
Here is the tb. As usual for control : basic assertions
(Almost copy-pasted from our previous addi
test case) :
# test_control.py
# ...
@cocotb.test()
async def slti_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR SLTI
await Timer(10, units="ns")
dut.op.value = 0b0010011 # I-TYPE (alu)
dut.func3.value = 0b010 # slti
await Timer(1, units="ns")
# Logic block controls
assert dut.alu_control.value == "101"
assert dut.imm_source.value == "000"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "1"
assert dut.write_back_source.value == "00"
assert dut.pc_source.value == "0"
Here is what I'll add to the test program :
//...
FFF9AB93 //SLTI TEST START : slti x23 x19 0xFFF | x23 <= 00000000
001BAB93 // slti x23 x23 0x001 | x23 <= 00000001
//..
This one may be a bit tricky : If you recall, x19 is set to 0x00000AAA via lw
at the beggining of the program.
Given that slti
uses signed values, 0xFFF is -1 ! which is why x23 <= 00000000
.
After that, I test the more obvious 0 (x23) < 1 (0x001)
for a less entricate test case.
Here here is the updated test_bench :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# FFF9AB93 //SLTI TEST START : slti x23 x19 0xFFF | x23 <= 00000000
# 001BAB93 // slti x23 x23 0x001 | x23 <= 00000001
##################
print("\n\nTESTING SLTI\n\n")
# Check test's init state
assert binary_to_hex(dut.regfile.registers[19].value) == "00000AAA"
assert binary_to_hex(dut.instruction.value) == "FFF9AB93"
await RisingEdge(dut.clk) # slti x23 x19 0xFFF
assert binary_to_hex(dut.regfile.registers[23].value) == "00000000"
await RisingEdge(dut.clk) # slti x23 x23 0x001
assert binary_to_hex(dut.regfile.registers[23].value) == "00000001"
IMM [11:0] | rs1 | f3 | rd | op |
---|---|---|---|---|
XXXXXXXXXXXX | XXXXX | 011 | XXXXX | 0010011 |
sltiu rd rs1 0xXXX
sltiu
does exactly the same thing, except it treats both the source register and the immediate as unsigned values :
SLTI (set less than immediate) places the value 1 in register rd if register rs1 is less than the sign- extended immediate when both are treated as signed numbers, else 0 is written to rd. SLTIU is similar but compares the values as unsigned numbers (i.e., the immediate is first sign-extended to XLEN bits then treated as an unsigned number)
As you can read above (from the RISC-V USER MANUAL), the immediate is first sign extended and then treated as unsigned. It's weird but it also allows us to avoid adding an imm_source
so I'll take it.
Check out the Holy_Reference_Tables.pdf
file for signals values and references.
We simply add this statement in the alu logic :
// alu.sv
// ...
// LESS THAN COMPARE STUFF (src1 < src2) (UNSIGNED VERSION)
3'b111 : alu_result = {31'b0, src1 < src2};
// ...
And we verify using this testbench, slightly simpler than the signed version as every number here is treated as positive :
# test_alu.py
# ...
@cocotb.test()
async def sltu_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b111
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
await Timer(1, units="ns")
expected = int(src1 < src2)
assert dut.alu_result.value == 31*"0" + str(int(dut.alu_result.value))
Following the Holy_Reference_Tables.pdf
file., here is the updated ALU decoder :
// control.sv
/**
* ALU DECODER
*/
// ...
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 3'b000;
// R-Types
2'b10 : begin
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : alu_control = 3'b000;
// AND
3'b111 : alu_control = 3'b010;
// OR
3'b110 : alu_control = 3'b011;
// SLTI
3'b010 : alu_control = 3'b101;
// SLTIU
3'b011 : alu_control = 3'b111; // NEW !
endcase
end
// BEQ
2'b01 : alu_control = 3'b001;
// EVERYTHING ELSE
default: alu_control = 3'b111;
endcase
end
// ...
And here are the assertions for the tb :
# test_control.py
# ...
@cocotb.test()
async def sltiu_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR SLTI
await Timer(10, units="ns")
dut.op.value = 0b0010011 # I-TYPE (alu)
dut.func3.value = 0b011 # sltiu
await Timer(1, units="ns")
# Logic block controls
assert dut.alu_control.value == "111"
assert dut.imm_source.value == "000"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "1"
assert dut.write_back_source.value == "00"
assert dut.pc_source.value == "0"
Here is the test program I'll use to make assertions on the CPU behavior :
//...
FFF9BB13 //SLTIU TEST START : sltiu x22 x19 0xFFF | x22 <= 00000001
0019BB13 // sltiu x22 x19 0x001 | x22 <= 00000000
//...
And here is the corresponding test bench snippet :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# FFF9BB13 //SLTIU TEST START : sltiu x22 x19 0xFFF | x22 <= 00000001
# 0019BB13 // sltiu x22 x19 0x001 | x22 <= 00000000
##################
print("\n\nTESTING SLTIU\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "FFF9BB13"
await RisingEdge(dut.clk) # sltiu x22 x19 0xFFF
assert binary_to_hex(dut.regfile.registers[22].value) == "00000001"
await RisingEdge(dut.clk) # sltiu x22 x19 0x001
assert binary_to_hex(dut.regfile.registers[22].value) == "00000000"
IMM [11:0] | rs1 | f3 | rd | op |
---|---|---|---|---|
XXXXXXXXXXXX | XXXXX | 100 | XXXXX | 0010011 |
xori rd rs1 0xXXX
The problem here is than we encoded our alu_control
signal on 3 bits until now. And we just ran out of encoding to add another instruction. so we'll add some ! Keep in mind that we have to update the signal width in both :
- The HDL
- The testbench binary assertions
// alu.sv
// ...
// XOR STUFF
4'b1000 : alu_result = src1 ^ src2;
// ...
Don't forget to update all the bit widths in this file !
The testbench technique does not differ from the usual one :
# test_alu.py
# ...
@cocotb.test()
async def xor_test(dut):
await Timer(1, units="ns")
dut.alu_control.value = 0b1000 #xor
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
await Timer(1, units="ns")
expected = src1 ^ src2
assert int(dut.alu_result.value) == int(expected)
Don't forget to update all the bit widths in this file !
Here is the updated control :
// control.sv
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 4'b0000;
// R-Types
2'b10 : begin
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : alu_control = 4'b0000;
// AND
3'b111 : alu_control = 4'b0010;
// OR
3'b110 : alu_control = 4'b0011;
// SLTI
3'b010 : alu_control = 4'b0101;
// SLTIU
3'b011 : alu_control = 4'b0111;
// XOR
3'b100 : alu_control = 4'b1000;
endcase
end
// BEQ
2'b01 : alu_control = 4'b0001;
endcase
end
Don't forget to update all the bit widths for the
alu_control
output as well !
And the testbench, as usual :
# test_control.py
# ...
@cocotb.test()
async def xori_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR XORI
await Timer(10, units="ns")
dut.op.value = 0b0010011 # I-TYPE (alu)
dut.func3.value = 0b100 # xori
await Timer(1, units="ns")
# Logic block controls
assert dut.alu_control.value == "1000"
assert dut.imm_source.value == "000"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
# Datapath mux sources
assert dut.alu_source.value == "1"
assert dut.write_back_source.value == "00"
assert dut.pc_source.value == "0"
Don't forget to update all the bit widths in this file !
As the datapath is starting to get a bit beefy, I won't display it here,just know that the only modif is to update the data width of the alu_control
signal wire.
The test program :
AAA94913 //XORI TEST START : xori x18 x19 0xAAA | x18 <= 21524445 (because sign extend)
00094993 // xori x19 x18 0x000 | x19 <= 21524445
And the tb :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# AAA94913 //XORI TEST START : xori x18 x18 0xAAA | x18 <= 21524445 (because sign ext)
# 00094993 // xori x19 x18 0x000 | x19 <= 21524445
##################
print("\n\nTESTING XORI\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "AAA94913"
await RisingEdge(dut.clk) # xori x18 x19 0xAAA
assert binary_to_hex(dut.regfile.registers[18].value) == "21524445"
await RisingEdge(dut.clk) # xori x19 x18 0x000
assert (
binary_to_hex(dut.regfile.registers[19].value) ==
binary_to_hex(dut.regfile.registers[18].value)
)
Okay, you start to know the drill, and this tutorial would not be complete without a part where you actually do stuff yourself ! I will now let you implement them on your own too ! Don't worry, you are not completely on your own !
If you still struggle. Read the next few lines.
Here is a list of the points of attention to keep in mind whilst implementing I_type
:
ori
&andi
: we already have the logic so thiese should be fairly easyslli
,srli
&srai
: The thing about these is that there are multiple ways to go about it hardware-wise. I won't get into the details here, as we focus on the CPU RTL logic rather than precise hardware implementation. so we'll implement these in the alu using system verilog's<<
(sll),>>
(srl) and>>>
(sra) operators.
The difference between
>>
(srl) and>>>
(sra) is the fact thatsra
extends the sign wheresrl
just shifts, effectively dividing the number by 2 by filling the upper bits with 0s.
Please consider the following too :
Don't forget to use a 5 bit shamt for the shift instructions ! The upper 7 bits of the immediate (for shift instructions) acts like a f7 ! so we need to invalidate (or add a default case that does nothing) the instructions that are not conformimg to that ! This translates in a 0 in reg write if "f7" is invalid. (We may add an
illegal_op
flag later in future tutorials if we re-use this core for other projects). We also only use the 5 lower bits ofsrc2
in the alu. So make sure to add f7 to the datapath, and invalidate the non-support f7s for shift in control !
This should give you a bit of a challenge too ! (You can use the final source code if needed, especially for the sra
testbench which can be a bit challenging with python's weird representation of negative numbers ! (You can refer to this stack overflow post for this matter).
Are you still here ? good, hope your implementation went well !
Now that we implemented all I-Types instructions, a logical follow-up would be implementing R-Types instructions.
They do basically the same thing as I-types, execept the alu_source
signal is not set to immediate but rs2.
You start to know the drill for these ones too, so I'll hint you on the first couple of instructions and then let you implement the rest on your own, just like
I-Type
!
Also remember we already have or
, and
& add
supported on the core so far.
We already added sub arithmetic for the beq
instruction. This means we don't have to update the ALU and most likely won't for all of the R-Types
as they all inherit the same arithmetic principles than the I-Type
instructions we just implemented.
Regarding the sub instruction, refering to the instruction table here is what it looks like :
F7 | rs2 | rs1 | f3 | rd | op |
---|---|---|---|---|---|
0100000 | XXXXX | XXXXX | 000 | XXXX | 0110011 |
And here the logic will be the same as the shifts we added before, meaning we will check for a valid F7.
The ALU is already up-to-date. (damn that was hard ! we desrve a pause !)
SIKE ! There is no pause ! Now let's get to work shall we ?
For the control's HDL Logic, we need to keep something in mind :
As stated earlier, the logic will be kind of the same than for shifts. Why ? Well because add
and sub
both share the same f3
That means meaning we have to use f7 to know which is which. And because the addi
also shares the same alu_op
signal than add
and sub
we also have to take this scenario into account ! (because addi
use immediate instead of f7, meaning the value could be anything !)
I added comments in the code below to make that a bit clearer.
I will go once again with some good old high-level logic description, I know for a fact there are loads of more efficient ways to implement the check or f7, but I chose to do it like that because it is convinient :
// control.sv
// ...
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// LW, SW
2'b00 : alu_control = 4'b0000;
// R-Types, I-types
2'b10 : begin
case (func3)
// ADD (and later SUB with a different F7)
3'b000 : begin
// 2 scenarios here :
// - R-TYPE : either add or sub and we need to a check for that
// - I-Type : addi -> we use add arithmetic
if(op == 7'b0110011) begin // R-type
alu_control = func7[5] ? 4'b0001 : 4'b0000;
end else begin // I-Type
alu_control = 4'b0000;
end
end
// AND
// ...
To avoid bloating the code, I won't invalidate the non-compliant f7 and only check for the bit #5 of f7 to either chose sub
or add
. That will also be on less feature to test.
Is it right ? no. But I plan on implementing invalid_op
flags once we set the constant and try to pipeline this thing. So let's leave that for later !
-"the compiler knows what it's doing anyway" -famous last words.
In order to verify, first, update the first add
test to include f7 :
# test_control.py
# ...
@cocotb.test()
async def add_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR ADD
await Timer(10, units="ns")
dut.op.value = 0b0110011 # R-TYPE
dut.func3.value = 0b000 # add, sub
dut.func7.value = 0b0000000 # add // CHANGED !
await Timer(1, units="ns")
assert dut.alu_control.value == "0000"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
assert dut.alu_source.value == "0"
assert dut.write_back_source.value == "00"
assert dut.pc_source.value == "0"
# ...
And then, add a test case for sub
:
# test_control.py
# ...
@cocotb.test()
async def sub_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR SUB
await Timer(10, units="ns")
dut.op.value = 0b0110011 # R-TYPE
dut.func3.value = 0b000 # add, sub
dut.func7.value = 0b0100000 # sub
await Timer(1, units="ns")
assert dut.alu_control.value == "0001"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
assert dut.alu_source.value == "0"
assert dut.write_back_source.value == "00"
assert dut.pc_source.value == "0"
Here is the test program i'll use :
412A8933 //SUB TEST START : sub x18 x21 x18 | x18 <= FFFFF8FF
And the associated test bench :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# 412A8933 //SUB TEST START : sub x18 x21 x18 | x18 <= FFFFF8FF
##################
print("\n\nTESTING SUB\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "412A8933"
await RisingEdge(dut.clk) # sub x18 x21 x18
assert binary_to_hex(dut.regfile.registers[18].value) == "FFFFF8FF"
Well, we do have all the logic for the rest of the instructions now...
For most of them, we don't even need to do anything ! Think about it :
- The main decoder in the control unit already treats the
R-Type
generic signals. - The ALU decoder already takes all f3 into account !
So this way, we can already verify :
- sll
- slt
- sltu
- xor
without touching anything !
Exercise for you : come up with a test programm and testbench to verify cpu behavior with these instructions ;)
And what about srl
and sra
?
Well, nothing to do either ! (except test it of course) Why ? Well, we did test for the f7-ish immediate before in a scenario specific to our I-Tpye
stuggle to differentiate srl
and sra
because their f3 encoding were the same.
Turns out R-Types uses actual f7, and guess what :
F7 | rs2 | rs1 | f3 | rd | op | |
---|---|---|---|---|---|---|
srl |
0000000 | XXXXX | XXXXX | 101 | XXXX | 0110011 |
sra |
0100000 | XXXXX | XXXXX | 101 | XXXX | 0110011 |
The 7 upper bits encodings (immediate for I-Type
) and f7 for R-Type
are the same.
So our decoder is already able to differentiate them, and given R-Type generic signals have already been implemented a whiiile ago, we just have to test them ! So let's do exactly that !
Here is a summary of what we did, aka only testing
//...
00800393 //SLL TEST START : addi x7 x0 0x8 | x7 <= 00000008
00791933 // sll x18 x18 x7 | x18 <= FFF8FF00
013928B3 //SLT TEST START : slt x17 x22 x23 | x17 <= 00000001 (-459008 < -4368)
013938B3 //SLTU TEST START : sltu x17 x22 x23 | x17 <= 00000001
013948B3 //XOR TEST START : xor x17 x18 x19 | x17 <= 000711F0
0079D433 //SRL TEST START : srl x8 x19 x7 | x8 <= 00FFFFEE
4079D433 //SRA TEST START : sra x8 x19 x7 | x8 <= FFFFFFEE
00000013 //NOP
//...
[I'll let you come up with the test benches ;)]
Okay, now that we have all this fancy slt
arithmetic, we can use it to implement all the B-Type
instructions !
How ? Well, we'll simply add an output flags to our ALU that will serve the exact same role as zero
:
alu_last bit
will be 1 if the lastalu_result
bit is 1.
And to use the correct arithmetic, we'll have to improve our control's ALU_Decoder to choose the correct arithmetic from the ALU. In the Holy_Reference_Tables.pdf
file, there is a dedicated page containing all the data you need for reference.
Here is the recap of the datapath we'll have after blt
, not much is changing except for the new alu_last_bit
signal.
As you can see, we go for a strategy where the alu only gives us the state of its last bit and the branch logic inside the control unit will sort out from there, depending on the instruction, whether we branch or not.
And last but not least, after we got our flags and correct arithmetic, we'll also need to improve the pc_source
selector, by adding a whole branch logic section (The Holy_Reference_Tables.pdf
file as a page for that for reference).
I'll let you implement these instructions yourself ;) Nevertheless, I will hint you on the implementation of blt
in section 10.1 below.
As usual, we'll start by updating our alu. No need to add any arithmetic, only a flag that gets asserted when the last bit is 1 :
// alu.sv
module alu (
// IN
input logic [3:0] alu_control,
input logic [31:0] src1,
input logic [31:0] src2,
// OUT
output logic [31:0] alu_result,
output logic zero,
output logic last_bit // NEW !
);
// ...
assign zero = alu_result == 32'b0;
assign last_bit = alu_result[0]; // NEW !
endmodule
To verify this, we'll simply get a bunch of random numbers, use slt
arithmetic (but it could be any, as the control unit wuill be the one figuring out wheter it is relevant or not), and check if the "last_bit" is behaving as expected :
# test_alu.py
# ...
@cocotb.test()
async def last_bit_test(dut):
# (logic copy-pasted from slt_test function)
await Timer(1, units="ns")
dut.alu_control.value = 0b0101
for _ in range(1000):
src1 = random.randint(0,0xFFFFFFFF)
src2 = random.randint(0,0xFFFFFFFF)
dut.src1.value = src1
dut.src2.value = src2
await Timer(1, units="ns")
if src1 >> 31 == 0 and src2 >> 31 == 0:
expected = int(src1 < src2)
elif src1 >> 31 == 0 and src2 >> 31 == 1:
expected = int(src1 < (src2 - (1<<32)))
elif src1 >> 31 == 1 and src2 >> 31 == 0:
expected = int((src1 - (1<<32)) < src2)
elif src1 >> 31 == 1 and src2 >> 31 == 1:
expected = int((src1 - (1<<32)) < (src2 - (1<<32)))
assert dut.last_bit.value == str(expected)
And now we need to update the branching logic and the ALU_Decoder.
We start by re-arranging our alu decoder
module control (
// I/Os
// ...
input logic alu_last_bit, // NEW !
// ...
);
// (Main decoder) ...
/**
* ALU DECODER
*/
always_comb begin
case (alu_op)
// ...
2'b01 : begin
case (func3)
// BEQ
3'b000 : alu_control = 4'b0001;
// BLT
3'b100 : alu_control = 4'b0101;
endcase
end
endcase
end
// And we now create a whole new always_comp blocks for our new *branch logic* decoder :
/**
* PC_Source
*/
logic assert_branch;
always_comb begin : branch_logic_decode
case (func3)
// BEQ
3'b000 : assert_branch = alu_zero & branch;
// BLT
3'b100 : assert_branch = alu_last_bit & branch;
default : assert_branch = 1'b0;
endcase
end
assign pc_source = assert_branch | jump;
You can see I used an intermediate "assert_branch" signal. There are many ways to implement this logic, some are more efficient if you concatenate the branch logic in the other decoders, but who cares ? (coping again)
And to verify this behavior, we simply copy-paste the beq
test case and adapt the expected signals using the Holy_Reference_Tables.pdf
reference.
# test_control.py
# ...
@cocotb.test()
async def blt_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR BLT (underlying logic same as BEQ)
await Timer(10, units="ns")
dut.op.value = 0b1100011 # B-TYPE
dut.func3.value = 0b100 # blt
dut.alu_last_bit.value = 0b0
await Timer(1, units="ns")
assert dut.imm_source.value == "010"
assert dut.alu_control.value == "0101"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "0"
assert dut.alu_source.value == "0"
assert dut.branch.value == "1"
assert dut.pc_source.value == "0"
assert dut.second_add_source.value == "0"
# Test if branching condition is met
await Timer(3, units="ns")
dut.alu_last_bit.value = 0b1
await Timer(1, units="ns")
assert dut.pc_source.value == "1"
assert dut.second_add_source.value == "0"
For this part, you just need to route the alu_last_bit
flag from the ALU the the Control unit.
We already test branches extensively before for beq
so we'll stick to something simple :
- An instruction with an untaken branch
- An instruction with a taken branch to
PC+0x8
And here is the resulting test program for blt
:
//...
0088C463 //BLT TEST START : blt x17 x8 0x8 | not taken : x8 neg (sign), x17 pos (no sign)
01144463 // blt x8 x17 0x8 | taken : x8 neg (sign), x17 pos (no sign)
00C00413 // addi x8 x0 0xC | NEVER EXECUTED (check value)
//...
So we use that do do some assertions in the testbench, as usual :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# 0088C463 //BLT TEST START : blt x17 x8 0x8 | not taken : x8 neg, x17 pos
# 01144463 // blt x8 x17 0x8 | taken : x8 neg, x17 pos
# 00C00413 // addi x8 x0 0xC | NEVER EXECUTED (check value)
##################
print("\n\nTESTING BLT\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "0088C463"
assert binary_to_hex(dut.regfile.registers[17].value) == "000711F0"
assert binary_to_hex(dut.regfile.registers[8].value) == "FFFFFFEE"
# execute, branch should NOT be taken !
await RisingEdge(dut.clk) # blt x17 x8 0x8
assert binary_to_hex(dut.instruction.value) == "01144463"
# execute, branch SHOULD be taken !
await RisingEdge(dut.clk) # blt x8 x17 0x8
assert not binary_to_hex(dut.instruction.value) == "00C00413"
# We verify x8 value was not altered by addi instruction (never fetched)
assert binary_to_hex(dut.regfile.registers[8].value) == "FFFFFFEE"
And there you go ! you now have a strong base to implement all other B-Types
! I'll see you in the next section to implement jalr
;).
Spoiler, if you used 64 words wide memories like me, you mights wanna make it larger to test everything (depending on your tests complexity).
Before moving on to the "partial" memory operations that we yet have to implement, let's get rid of the final operation to have a complete processor (that only works on word aligned memory accesses) : jalr
.
Here is what jalr
looks like :
Imm [11:0] | rs1 | f3 | rd | op |
---|---|---|---|---|
XXXXXXXXXXXX | XXXXX | 000 | XXXXX | 1100111 |
This instruction is formatted as an I-Type and "behaves" as a J-Type (it's a jump instruction after all !)
Its unique OP code makes it easy to distinguish (still similar to jal
's : 1100111).
Here is how we use it in a program :
jalr rd offset(rs1)
So it's exactly the same as jal
excepts it uses a pre-stored address from rs1 to jump. (rd gets the value of pc+4). Here is the recap :
rd
<- PC + 4 (write_back_source = 010)pc_next
<- rs1 + offset (we can't currently do that)
So, we have two choices :
- Use the ALU to compute the target PC and add
alu_result
to our pc sources for pc_next - Use our second and add a second_add source
I'll go with the second choice, as we've always computed PC-related stuff in the second_add logic so far. This is far from ideal, and I know for a fact that we should re-use the ALU to refactor our cpu logic BUT I am a biased human and I mainly try to get this thing working first. Optimisation will come later, don't worry. (more coping).
Here is a recap of our datapath we will implement of jalr
:
So you thought it was all easy from now on ?
WRONG ! We now have more to do on the datapth. Which is not hard by itself but rather long as we have to update signals width in both the logic AND the tests.
Once again, we should add constants to make our job easier on this part, but I also plan on doing this later in this course (which is not now).
So let's get started shall we ?
First of all, we have to start with the control unit by adding our jalr
special case in the main decoder. To do so, we'll reuse J-Type logic and add a check on the op difference to decide which second_add_source
to go for :
// control.sv
// ...
// J-type + JALR weird Hybrib
7'b1101111, 7'b1100111 : begin
reg_write = 1'b1;
imm_source = 3'b011;
mem_write = 1'b0;
write_back_source = 2'b10; //pc_+4
branch = 1'b0;
jump = 1'b1;
if(op[3]) begin// jal
second_add_source = 2'b00;
imm_source = 3'b011;
end
else if (~op[3]) begin // jalr
second_add_source = 2'b10;
imm_source = 3'b000;
end
end
// ...
As you can see, we do not forget about the imm_source
too !
As for the test bench, not much to do if not adding a test case, and update all the checks on second_add_source
to take the new signal width into account :
# test_control.py
# Updated tests (...)
@cocotb.test()
async def jalr_control_test(dut):
await set_unknown(dut)
# TEST CONTROL SIGNALS FOR JALR
await Timer(10, units="ns")
dut.op.value = 0b1100111 # Jump / I-type : jalr
await Timer(1, units="ns")
assert dut.imm_source.value == "000"
assert dut.mem_write.value == "0"
assert dut.reg_write.value == "1"
assert dut.branch.value == "0"
assert dut.jump.value == "1"
assert dut.pc_source.value == "1"
assert dut.write_back_source.value == "10"
assert dut.second_add_source.value == "10"
We can now update our datapath !
To update our datapth, here is a refresher of our new layout :
We quickly identify that there is not much to do, if not updating the second_add_source
width and add a MUX option accordingly :
// cpu.sv
// ...
wire [1:0] second_add_source; // width updated !
// ...
always_comb begin : second_add_select
case (second_add_source)
2'b00 : pc_plus_second_add = pc + immediate; // width updated !
2'b01 : pc_plus_second_add = immediate; // width updated !
2'b10 : pc_plus_second_add = read_reg1 + immediate; // NEW
endcase
end
// ...
And that's it ! now let's verify that what we did is still working great and come up with a new test program:
//...
00000397 //JALR TEST START : auipc x7 0x0 | x7 <= 00000110 PC = 0x10C
01438393 // addi x7 x7 0x10 | x7 <= 00000120 PC = 0x110
FFC380E7 // jalr x1 -4(x7) | x1 <= 00000118, go @PC 0x11C PC = 0x114
00C00413 // addi x8 x0 0xC | NEVER EXECUTED (check value) PC = 0x118
//...
This program loads a target PC using auipc and addi to build a constant, and then, we use jalr
we a negative offset to this target, leading us to a final jump @PC=0x11C.
And the testbench assertions :
# test_cpu.py
# ...
@cocotb.test()
async def cpu_insrt_test(dut):
# ...
##################
# 00000397 //JALR TEST START : auipc x7 0x0 | x7 <= 00000110 PC = 0x10C
# 01438393 // addi x7 x7 0x10 | x7 <= 00000120 PC = 0x110
# FFC380E7 // jalr x1 -4(x7) | x1 <= 00000118, go @PC 0x11C PC = 0x114
# 00C00413 // addi x8 x0 0xC | NEVER EXECUTED (check value) PC = 0x118
##################
print("\n\nTESTING JALR\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "00000397"
assert binary_to_hex(dut.pc.value) == "0000010C"
await RisingEdge(dut.clk) # auipc x7 0x00
await RisingEdge(dut.clk) # addi x7 x7 0x10
assert binary_to_hex(dut.regfile.registers[7].value) == "00000120"
await RisingEdge(dut.clk) # jalr x1 -4(x7)
assert binary_to_hex(dut.regfile.registers[1].value) == "00000118"
assert not binary_to_hex(dut.instruction.value) == "00C00413"
assert binary_to_hex(dut.regfile.registers[8].value) == "FFFFFFEE"
assert binary_to_hex(dut.pc.value) == "0000011C"
And it works ! GG to us.
Okay, we've implemented all logical operations ! GG.
But now is time to enter the realms of "the things we said we would do later, don't worry bro".
The first one to take care of is the memory stuff as it will unclock all the load/stores for bytes and halfwords.
But remember our memory module ? Remember the way we built it so we could byte-address it but couldn't really operate on other things than words ? Well that's what we are going to take care of in this section to finally have a fully functionnal RV32I core !
The goal here is to implement the following instructions :
lb
lh
lbu
lhu
sb
sh
Which are non-aligned (aka not modulo 4) version of our already-existing sw
and lw
.
Non aligned ? not really. I will elaborate on that more in section 12.1; Just know that byte-wise memory operations cannot really be mis-aligned technically speaking.
To really understand the behavior of these instruction, I suggest you use a risc-v ISA simulator like this "RISC-V Interpreter" (you can find better one that support unsigned and half words but this one can give you an idea).
For our stores instructions, we need to improve our memory first. To do so, we need to understand what is expected from our core from a programming standpoint.
According to this discussion, bytes aligned loads/stores are not a problem at all as the byte is the lowest possible unit of our memory.
BUT halfwords can be mis-aligned : if the address is odd, that means we are trying to load in-between words :
In this sceneario, it is up to us to decide what to do. And the "what to do" will be : nothing.
In that case, we should also throw an exception but remember : this is something for future improvements. On top of that, mis-aligned lh are, theorically, not supposed to happen as the compiler "knows" what it is doing.
Now, the way we will handle that "nothing" can vary in many ways :
- Do we want the CPU or the memory to handle these mis-alignement ? It depends on how we'll implement memory later on, maybe what controller we'll use, and how this said controller will handle these exceptions... To get the best predictible behavior, it is better to just handle these things ourselves in the cpu core. It is also a more standard way to handle this stuff.
- Where exactly do we want to handle these exceptions in the cpu ? Right now, we have a single cycle CPU. But in a pipelined cpu, every stage has its role. This kind of exception shall idealy be caught in the "decoding" stage, aka at the very beginning of the cpu, where we generate control signals (right after fetching it from memory) but in reality, it is a bit more tricky than that. More on that in the later sections.
We could theorically implement that "yolo" style but it is a bit more important for future improvement that what we were used to procrastinate on until now, so might as well think about these problems early this time ;)
So, now that we have a broad idea on how to handle mis-alignment, we need to tell our memory to write this data, at a specific byte or halfword, and in a "standard" way !
I decided to give another go at chatgpt, maybe this time he can give us some valuable insights :
# Standard Behavior for Memory Writes
When interfacing with standard memory modules on an FPGA, it's common to handle partial writes
(like byte or halfword writes) using write enable masks.
These masks indicate which byte(s) in a word should be updated. Here's a breakdown:
1. SB (Store Byte)
Behavior: Write 1 byte (8 bits) to a specific location.
Implementation:
You provide a write_data signal with the full word, but only 1 byte of it is relevant.
You send a byte-enable mask to the memory, typically a 4-bit signal for 32-bit systems.
Example: To write the least significant byte of a word at address 0x1000,
set write_mask = 0b0001.
For 0x1001, write_mask = 0b0010, etc.
2. SH (Store Halfword)
Behavior: Write 2 bytes (16 bits) to a specific location.
Implementation:
Provide a write_data signal with the relevant 2 bytes filled in.
Send a byte-enable mask indicating which two bytes to write.
Example: For a halfword-aligned store at 0x1002, write_mask = 0b1100 for a
32-bit system.
# Memory Model
Most standard memory modules support the concept of:
Write data bus (e.g., data_in).
Write mask or enable signal (e.g., byte_enable).
When writing:
Align the data: For byte or halfword writes,
shift the relevant data to the proper byte lanes.
Generate mask: Based on the address and size (SB, SH),
create the appropriate byte-enable signal.
Great ! This time it's LLM for the win ! Great explainations Mr GPT, now let's take back the control and actually use our brains to do something with these infos.
It looks like we'll have to enhance our mem_write
signal with something a bit more adapted : a 4bits wide signals telling the memory where to write (1 for each byte) ! We'll call it mem_byte_enable
and it will work with mem_write
to specify which byte to write.
In summary, we need to :
- Update our memory to works with byte masks.
- Implement a new "decoder" to determine :
- mis-alignements
- a write mask
By keeping in mind what we just said in section 12.2, memory never sees non word-aligned addresses : our new "decoder" will translate everything into :
- A nice word-aligned address
- And a
byte_enable
mask along thewrite_enable
signal
But let's leave the decoder for later and focus on the momory for now as this section is about updating our new memory according to our new requierements.
With that in mind, we can improve our memory like so :
// memory.sv
module memory #(
parameter WORDS = 128,
parameter mem_init = ""
) (
input logic clk,
input logic [31:0] address,
input logic [31:0] write_data,
input logic [3:0] byte_enable,
input logic write_enable,
input logic rst_n,
output logic [31:0] read_data
);
// Memory array (32-bit words)
reg [31:0] mem [0:WORDS-1];
initial begin
if (mem_init != "") begin
$readmemh(mem_init, mem);
end
end
// Write operation
always @(posedge clk) begin
if (rst_n == 1'b0) begin
for (int i = 0; i < WORDS; i++) begin
mem[i] <= 32'b0;
end
end else if (write_enable) begin
if (address[1:0] != 2'b00) begin
$display("Misaligned write at address %h", address);
end else begin
// use byte-enable to selectively write bytes
for (int i = 0; i < 4; i++) begin
if (byte_enable[i]) begin
mem[address[31:2]][(i*8)+:8] <= write_data[(i*8)+:8];
end
end
end
end
end
always_comb begin
read_data = mem[address[31:2]];
end
endmodule
To test if this new code did not affect underlying memory logic, we can see if our module still behaves correctly by temporarly replacing if (byte_enable[i]) begin
with if (1) begin
which will enable write on all bytes and hopefully make our previous tests pass.
And it works just fine so let's revert this temporary change and move on !
Time to enhance our old memory tests : we keep the old one by asserting byte_enable = 0b1111
(because we were only dealing with word memory operations)
And then do some new tests with different byte_enable
value using a simple for loop.
Here is the final testbench :
# test_memory.py
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge, Timer
@cocotb.coroutine
async def reset(dut):
await RisingEdge(dut.clk)
dut.rst_n.value = 0
dut.write_enable.value = 0
dut.address.value = 0
dut.write_data.value = 0
await RisingEdge(dut.clk)
dut.rst_n.value = 1
await RisingEdge(dut.clk)
# Assert all is 0 after reset
for address in range(dut.WORDS.value):
dut.address.value = address
await Timer(1, units="ns")
# just 32 zeroes, you can also use int()
assert dut.read_data.value == "00000000000000000000000000000000"
@cocotb.test()
async def memory_data_test(dut):
# INIT MEMORY
cocotb.start_soon(Clock(dut.clk, 1, units="ns").start())
await reset(dut)
# Test: Write and read back data
test_data = [
(0, 0xDEADBEEF),
(4, 0xCAFEBABE),
(8, 0x12345678),
(12, 0xA5A5A5A5)
]
# For the first tests, we deal with word operations
dut.byte_enable.value = 0b1111
# ========================================
# ... OTHER TESTS REMAINS THE SAME ...
# ========================================
# ...
# ===============
# BYTE WRITE TEST
# ===============
dut.write_enable.value = 1
for byte_enable in range(16):
await reset(dut) # we reset memory..
dut.byte_enable.value = byte_enable
# generate mask from byte_enable
mask = 0
for j in range(4):
if (byte_enable >> j) & 1:
mask |= (0xFF << (j * 8))
for address, data in test_data:
dut.address.value = address
dut.write_data.value = data
# write
dut.write_enable.value = 1
await RisingEdge(dut.clk)
dut.write_enable.value = 0
await RisingEdge(dut.clk)
# Verify by reading back
dut.address.value = address
await RisingEdge(dut.clk)
assert dut.read_data.value == data & mask
And it works ! Great !
Now if we run all the tests from the test runner, we obviously get failures everywhere in the cpu test because the memory isn't getting any info on its new byte_enable
signal it is expecting to get.
So, how do we generate valid byte_enable
signals ?
The thing is we don't know what the final read address is until we've reached the ALU.
We could try to figure it out in control by getting info from the regitsers and alu back to the control, but these kind of design choices can lead to severe logic hazards when pipelining a CPU (which is an advanced technique to speed up clock speeds we will cover in another tutorial). The best practice here is simply to wait for the data to be availible and use it there.
Fun fact : Waiting for data to be availible like that leads to a bottleneck in pipined CPU, where branch prediction becomes a thing and we have to wait until the ALU to know if the prediction was right ! We'll have the opportunity to explore these concepts extensively in later tutorials ;)
With that in mind, here is how we'll proceed to know where to write (or not) :
As you can see, we added a whole "load_store_decoder" or "byte_enable_decoder" (you can call it however you want) which will produce the adequate byte_enable
mask depending on the address offset and the instruction being fetched.
This unit also processes the data we send to the memory, and the reason why come from this extract of the RISC-V Specs :
The SW, SH, and SB instructions store 32-bit, 16-bit, and 8-bit values from the low bits of register rs2 to memory
That means : we'll have to apply a shift to these lower bits for sb
and sh
to match the mask ! But more on this later. Note that this does not affect the way our memory byte_enable
's interpretation works at all. (In our design, the memory simply does not care and will execute whatever masked write we send it.)
Also remember our design choice : "Do nothing if halfword or word alignment is not correct". Well, this will be done by setting the mask to 4'b000
, which will avoid any altering in memory, even if write_enable
is asserted.
So we start by creating a new module, and the tb subfolder config that goes with it, for now we only need to output 4'b1111
to make the tests pass, we'll implement real logic later for sb
and sh
:
// load_store_decoder.sv
module load_store_decoder (
input logic [31:0] alu_result_address,
input logic [2:0] f3,
input logic [31:0] reg_read,
output logic [3:0] byte_enable,
output logic [31:0] data
);
assign byte_enable = 4'b1111;
endmodule
And we add it to our datapath, right after the ALU :
// cpu.sv
// alu ...
/**
* LOAD/STORE DECODER
*/
wire [3:0] mem_byte_enable;
load_store_decoder ls_decode(
.alu_result_address(alu_result),
.byte_enable(mem_byte_enable)
);
// ...
memory #(
.mem_init("./test_dmemory.hex")
) data_memory (
.clk(clk),
.address(alu_result),
.write_data(read_reg2),
.write_enable(mem_write),
.byte_enable(mem_byte_enable), // ROUTED HERE !
.rst_n(1'b1),
.read_data(mem_read)
);
For now, we only seek to make the test pass to check if our new memory integrates well in our design so we wont connect more wires for now, you can if you want, it doesn't really matter.
Running all the test does work now.
We will also add a test bench for this sub module for good measure to ensure it is bahving as expected :
# test_load_store_decoder.py
import cocotb
from cocotb.triggers import Timer
@cocotb.test()
async def ls_unit_test(dut):
await Timer(1, units="ns")
assert dut.byte_enable.value == "1111"
We'll put actual assertions in here as time goes on. but for now, this will have to do.
Okay, now everything is set up ! let's implement sb
! First, here is what the instruction looks like :
Imm [11:5] | rs1 | rs1 | f3 | Imm [4:0] | op |
---|---|---|---|---|---|
XXXXXXX | XXXXX | XXXXX | 000 | XXXXX | 0100011 |
So we need to get that f3 into our decoder, let's do exactly that, first, here is a refresher of the revised version of our datapath :
For the datapath on this one, don't forget to discard last two bit of the address you give to the data memory, as our new decoder will handle offset & masking for this matter. And also make the register data go trough or design for the treatment we are about to apply. It should look like something like the code snippet below :
// cpu.sv
// ...
/**
* LOAD/STORE DECODER
*/
wire [3:0] mem_byte_enable;
wire [31:0] mem_write_data;
load_store_decoder ls_decode(
.alu_result_address(alu_result),
.reg_read(read_reg2),
.f3(f3),
.byte_enable(mem_byte_enable),
.data(mem_write_data)
);
/**
* DATA MEMORY
*/
wire [31:0] mem_read;
memory #(
.mem_init("./test_dmemory.hex")
) data_memory (
// Memory inputs
.clk(clk),
.address({alu_result[31:2], 2'b00}),
.write_data(mem_write_data),
.write_enable(mem_write),
.byte_enable(mem_byte_enable),
.rst_n(1'b1),
// Memory outputs
.read_data(mem_read)
);
// ...
After updating the datapath to route signals to our new ls_decoder unit, we can enhance this said unit to implement sb
.
Don't forget to add existing
sw
and make sure it will still be supported !
// load_store_decoder.sv
module load_store_decoder (
input logic [31:0] alu_result_address,
input logic [2:0] f3,
input logic [31:0] reg_read,
output logic [3:0] byte_enable,
output logic [31:0] data
);
logic [1:0] offset;
assign offset = alu_result_address[1:0];
always_comb begin
case (f3)
3'b000: begin // SB
case (offset)
2'b00: begin
byte_enable = 4'b0001;
data = (reg_read & 32'h000000FF);
end
2'b01: begin
byte_enable = 4'b0010;
data = (reg_read & 32'h000000FF) << 8;
end
2'b10: begin
byte_enable = 4'b0100;
data = (reg_read & 32'h000000FF) << 16;
end
2'b11: begin
byte_enable = 4'b1000;
data = (reg_read & 32'h000000FF) << 24;
end
default: byte_enable = 4'b0000;
endcase
end
3'b010: begin // SW
byte_enable = (offset == 2'b00) ? 4'b1111 : 4'b0000;
data = reg_read;
end
default: begin
byte_enable = 4'b0000; // No operation for unsupported types
end
endcase
end
endmodule
This logic summarizes all the requirements we listed until now for sb
.
We can now ditch our old 3 lines long temporary test bench to get a brand new one that tests some actual expected behavior on the signals AND the data being fed to the memory :
# test_load-store_decoder.py
import cocotb
from cocotb.triggers import Timer
import random
@cocotb.test()
async def ls_unit_test(dut):
word = 0x123ABC00
# ====
# SW
# ====
dut.f3.value = 0b010
for _ in range(100):
reg_data = random.randint(0, 0xFFFFFFFF)
dut.reg_read.value = reg_data
for offset in range(4):
dut.alu_result_address.value = word | offset
await Timer(1, units="ns")
assert dut.data.value == reg_data & 0xFFFFFFFF
if offset == 0b00:
assert dut.byte_enable.value == 0b1111
else :
assert dut.byte_enable.value == 0b0000
# ====
# SB
# ====
await Timer(10, units="ns")
dut.f3.value = 0b000
for _ in range(100):
reg_data = random.randint(0, 0xFFFFFFFF)
dut.reg_read.value = reg_data
for offset in range(4):
dut.alu_result_address.value = word | offset
await Timer(1, units="ns")
if offset == 0b00:
assert dut.byte_enable.value == 0b0001
assert dut.data.value == (reg_data & 0x000000FF)
elif offset == 0b01:
assert dut.byte_enable.value == 0b0010
assert dut.data.value == (reg_data & 0x000000FF) << 8
elif offset == 0b10:
assert dut.byte_enable.value == 0b0100
assert dut.data.value == (reg_data & 0x000000FF) << 16
elif offset == 0b11:
assert dut.byte_enable.value == 0b1000
assert dut.data.value == (reg_data & 0x000000FF) << 24
Everything works as expected ? Great ! sb
should be supported now ! Let's check it out !
We need to test 2 things here as a "bare minimum" verification:
- make sure misaligned
sw
don't alter memory. - make sure
sb
only wites 1 byte only
Here is a test programm that does just that :
008020A3 //SB TEST START : sw x8 0x1(x0) | NO WRITE ! (mis-aligned !) PC = 0x11C
00800323 // sb x8 0x6(x0) | mem @ 0x4 <= 00EE0000 PC = 0x120
And here are the associated test bench :
# test_cpu.py
# ...
################
# 008020A3 //SB TEST START : sw x8 0x1(x0) | NO WRITE ! (mis-aligned !)
# 00800323 // sb x8 0x6(x0) | mem @ 0x4 <= 00EE0000
##################
print("\n\nTESTING SB\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "008020A3"
await RisingEdge(dut.clk) # sw x8 0x1(x0)
# address is 1 because 0x6 is word @ address 4 and the test bench gets data by word
assert binary_to_hex(dut.data_memory.mem[1].value) == "00000000" # remains UNFAZED
await RisingEdge(dut.clk) # sb x8 0x6(x0)
assert binary_to_hex(dut.data_memory.mem[1].value) == "00EE0000"
Now that we have our underlying masking logic, we can re-use that new decoder to implement sh
.
Here's a little reminder of our requierements for sh
:
- Alignement is mandatory, so we'll check for odd addresses to determine that.
- The final mask produced is either
4'b1100
or4'b0011
- Don't forget to shift the lower bits to the portion we want to write on to match mask.
And here is how sh
looks like :
Imm [11:5] | rs1 | rs1 | f3 | Imm [4:0] | op |
---|---|---|---|---|---|
XXXXXXX | XXXXX | XXXXX | 001 | XXXXX | 0100011 |
So yeah, it's basically the same except f3 is 3'b001
.
Here is the f3 case we add to handle sh
in our decoder :
// load_store_decoder.sv
// ...
3'b001: begin // SH
case (offset)
2'b00: begin
byte_enable = 4'b0011;
data = (reg_read & 32'h0000FFFF);
end
2'b10: begin
byte_enable = 4'b1100;
data = (reg_read & 32'h0000FFFF) << 16;
end
default: byte_enable = 4'b0000;
endcase
end
// ...
And here are the assotiated assertions :
# test_load_store_decoder.py
@cocotb.test()
async def ls_unit_test(dut):
# ...
# ====
# SH
# ====
await Timer(10, units="ns")
dut.f3.value = 0b001
for _ in range(100):
reg_data = random.randint(0, 0xFFFFFFFF)
dut.reg_read.value = reg_data
for offset in range(4):
dut.alu_result_address.value = word | offset
await Timer(1, units="ns")
if offset == 0b00:
assert dut.byte_enable.value == 0b0011
assert dut.data.value == (reg_data & 0x0000FFFF)
elif offset == 0b10:
assert dut.byte_enable.value == 0b1100
assert dut.data.value == (reg_data & 0x0000FFFF) << 16
else:
assert dut.byte_enable.value == 0b0000
Here is a simple test program that tests for bahavior when mis-aligned and then performs a genuine write :
008010A3 //SH TEST START : sh x8 1(x0) | NO WRITE ! (mis-aligned !)
008011A3 // sh x8 3(x0) | NO WRITE ! (mis-aligned !)
00801323 // sh x8 6(x0) | mem @ 0x4 <= FFEE0000
And the assertions :
# test_cpu.py
#################
# 008010A3 //SH TEST START : sh x8 1(x0) | NO WRITE ! (mis-aligned !)
# 008011A3 // sh x8 3(x0) | NO WRITE ! (mis-aligned !)
# 00801323 // sh x8 6(x0) | mem @ 0x4 <= FFEE0000
##################
print("\n\nTESTING SH\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "008010A3"
await RisingEdge(dut.clk) # sh x8 1(x0)
assert binary_to_hex(dut.data_memory.mem[1].value) == "00EE0000" # remains UNFAZED
await RisingEdge(dut.clk) # sh x8 3(x0)
assert binary_to_hex(dut.data_memory.mem[1].value) == "00EE0000" # remains UNFAZED
await RisingEdge(dut.clk) # sh x8 6(x0)
# address is 1 because 0x6 is word @ address 4 and the test bench gets data by word
assert binary_to_hex(dut.data_memory.mem[1].value) == "FFEE0000"
Now that we added partial store, time to add partial loads which can either load byte or halfwords; signed and unsigned.
What we'll do is the masks comming from the be_decoder
and add the necessary f3 to support all partials write.
Then, we'll use the mask in a reader
module that will simply read the data from the memory and proc it (i.e. apply mask & apply sign extension if needed).
Here is how this new data path would look like :
You can see the new reader
module which also gets info on what instruction is being fetched from f3. We need this information to know whether or not we should sign extend the data but also how to behave in all aspects of the data processing (data processing will be necessary to make the data ready to be written in a register).
The reader modul also passes a valid
flag to confirm the validity of the write. But more on that later.
Here are how each of these instructions look like :
IMM [11:0] | rs1 | f3 | rd | op | |
---|---|---|---|---|---|
lb |
XXXXXXXXXXXX | XXXXX | 000 (same as sb ) |
XXXXX | 0000011 |
lh |
XXXXXXXXXXXX | XXXXX | 001 (same as sh ) |
XXXXX | 0000011 |
lbu |
XXXXXXXXXXXX | XXXXX | 100 | XXXXX | 0000011 |
lhu |
XXXXXXXXXXXX | XXXXX | 101 | XXXXX | 0000011 |
For this, I chose to add a separate module with its own testbench.
The logic is pretty straight forward. On top of that, we can see that the bit f3[2]
is the only one that changes for sign/unsigned matter :
f3[2] == 0
means we sign extend.f3[2] == 1
means we don't.
With that in mind, we create the module following the setup manual guidelines, prepare all the files, and write some HDL logic :
// reader.sv
module reader (
input logic [31:0] mem_data,
input logic [3:0] be_mask,
input logic [2:0] f3,
output logic [31:0] wb_data,
output logic valid
);
logic sign_extend;
assign sign_extend = ~f3[2];
logic [31:0] masked_data; // just a mask applied
logic [31:0] raw_data; // Data shifted according to instruction
// and then mem_data is the final output with sign extension
always_comb begin : mask_apply
for (int i = 0; i < 4; i++) begin
if (be_mask[i]) begin
masked_data[(i*8)+:8] = mem_data[(i*8)+:8];
end else begin
masked_data[(i*8)+:8] = 8'h00;
end
end
end
always_comb begin : shift_data
case (f3)
3'b010 : raw_data = masked_data; // masked data is full word in that case
3'b000, 3'b100: begin // LB, LBU
case (be_mask)
4'b0001: raw_data = masked_data;
4'b0010: raw_data = masked_data >> 8;
4'b0100: raw_data = masked_data >> 16;
4'b1000: raw_data = masked_data >> 24;
endcase
end
3'b001, 3'b101: begin // LH, LHU
case (be_mask)
4'b0011: raw_data = masked_data;
4'b1100: raw_data = masked_data >> 16;
endcase
end
endcase
end
always_comb begin : sign_extend_logic
case (f3)
// LW
3'b010 : wb_data = raw_data;
// LB, LBU
3'b000, 3'b100: wb_data = sign_extend ? {{24{raw_data[7]}},raw_data[7:0]} : raw_data;
// LH, LHU
3'b001, 3'b101: wb_data = sign_extend ? {{16{raw_data[15]}},raw_data[15:0]} : raw_data;
endcase
valid = |be_mask;
end
end
endmodule
As you can see, I went for a... very descriptive file ! (wait until you see the test bench haha).
You also saw that there is a valid
flag that depends on the mask value. This avoids writing 0 to rd
when the ask is 0. This valid is then carried all the way to our reg_file to validate the write.
Adding valid signals is a technique used extensively in larger cores, especially in pipelined CPUs, under the form of a custom type. Unfortunatly, our simulator "icarus verilog" does not support custom types, meaning we'll have to switch to something like "verilator" for later tutorials.
I guess there are more efficient ways to implement this logic using syntax tricks but I went for something that works and kept it very descriptive by addressing each case one by one.
Let's break down the different comb. logic blocks :
mask_apply
: simply applies the byte_enable mask coming frombe_decoder
shift_data
: put the masked data in the lower bits, ready to be loaded in a registersign_extend_logic
: will apply sign extension to the ready-to-load data if needed.
We can directly apply the mask coming from the
be_decoder
because f3 are consistent inlb & sb
,lw & sw
etc... meaning the masking logic we developped inbe_decoder
will not need to be updated for now (we'll have to slightly re-adjust to take unsigned load into account).
So I implemented all at once ! Here is a chellenge for you (and yet another todo for me) : make this logic more efficient (coding-wise).
You guessed it, now is the time to verify the behavior of our new module.
To check how it behaves, I setted up random test for each intruction, for signed & unsigned and for each mask.
This may be a bit overkill, especially since this makes the testbench kind of long and, ahem not extremely readable.
Anyway, I added a bunch of comment to break it down, but there still is some bitwise arithmetic you need to get your head around.
I'll let you be the judge of that :
# test_reader.py
import cocotb
from cocotb.triggers import Timer
import random
# 100 random test per mask
@cocotb.test()
async def reader_lw_test(dut):
# LW TEST CASE
dut.f3.value = 0b010
await Timer(1, units="ns")
dut.be_mask.value = 0b1111
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == mem_data
@cocotb.test()
async def reader_invalid_test(dut):
dut.f3.value = 0b001
dut.mem_data.value = random.randint(0,0xFFFFFFFF)
for i in range(16):
dut.be_mask.value = i
await Timer(1, units="ns")
if i == 0 :
assert dut.valid.value == 0
else :
assert dut.valid.value == 1
@cocotb.test()
async def reader_lh_test(dut):
# LH TEST CASE
dut.f3.value = 0b001
await Timer(1, units="ns")
dut.be_mask.value = 0b1100
await Timer(1, units="ns")
for _ in range(100):
# UNSIGNED
mem_data = random.randint(0,0x7FFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0xFFFF0000) >> 16
assert dut.valid.value == 1
# SIGNED
mem_data = random.randint(0x80000000,0xFFFFFFFF)
dut.mem_data.value = mem_data
expected = ((mem_data & 0xFFFF0000) >> 16) - (1 << 16)
await Timer(1, units="ns")
assert int(dut.wb_data.value) - (1 << 32) == expected
assert dut.valid.value == 1
dut.be_mask.value = 0b0011
await Timer(1, units="ns")
for _ in range(100):
# UNSIGNED
# Add a random AEAE to check if they are ignored
mem_data = random.randint(0,0x00007FFF) | 0xAEAE0000
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x0000FFFF)
assert dut.valid.value == 1
# SIGNED
mem_data = random.randint(0x00008000,0x0000FFFF) | 0xAEAE0000
dut.mem_data.value = mem_data
expected = (mem_data & 0x0000FFFF) - (1 << 16)
await Timer(1, units="ns")
assert int(dut.wb_data.value) - (1 << 32) == expected
assert dut.valid.value == 1
# LHU TEST CASE
dut.f3.value = 0b101
await Timer(1, units="ns")
dut.be_mask.value = 0b1100
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0xFFFF0000) >> 16
assert dut.valid.value == 1
dut.be_mask.value = 0b0011
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x0000FFFF)
assert dut.valid.value == 1
@cocotb.test()
async def reader_lb_test(dut):
# LB TEST CASE
dut.f3.value = 0b000
await Timer(1, units="ns")
dut.be_mask.value = 0b1000
await Timer(1, units="ns")
for _ in range(100):
# UNSIGNED
mem_data = random.randint(0,0x7FFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0xFF000000) >> 24
assert dut.valid.value == 1
# SIGNED
mem_data = random.randint(0x80000000,0xFFFFFFFF)
dut.mem_data.value = mem_data
expected = ((mem_data & 0xFF000000) >> 24) - (1 << 8)
await Timer(1, units="ns")
assert int(dut.wb_data.value) - (1 << 32) == expected
assert dut.valid.value == 1
dut.be_mask.value = 0b0100
await Timer(1, units="ns")
for _ in range(100):
# UNSIGNED
mem_data = random.randint(0,0x007FFFFF) | 0xAE000000
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x00FF0000) >> 16
assert dut.valid.value == 1
# SIGNED
mem_data = random.randint(0x00800000,0x00FFFFFF) | 0xAE000000
dut.mem_data.value = mem_data
expected = ((mem_data & 0x00FF0000) >> 16) - (1 << 8)
await Timer(1, units="ns")
assert int(dut.wb_data.value) - (1 << 32) == expected
assert dut.valid.value == 1
dut.be_mask.value = 0b0010
await Timer(1, units="ns")
for _ in range(100):
# UNSIGNED
mem_data = random.randint(0,0x00007FFF) | 0xAEAE0000
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x0000FF00) >> 8
assert dut.valid.value == 1
# SIGNED
mem_data = random.randint(0x00008000,0x0000FFFF) | 0xAEAE0000
dut.mem_data.value = mem_data
expected = ((mem_data & 0x0000FF00) >> 8) - (1 << 8)
await Timer(1, units="ns")
assert int(dut.wb_data.value) - (1 << 32) == expected
assert dut.valid.value == 1
dut.be_mask.value = 0b0001
await Timer(1, units="ns")
for _ in range(100):
# UNSIGNED
mem_data = random.randint(0,0x0000007F) | 0xAEAEAE00
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x000000FF)
assert dut.valid.value == 1
# SIGNED
mem_data = random.randint(0x00000080,0x000000FF) | 0xAEAEAE00
dut.mem_data.value = mem_data
expected = (mem_data & 0x000000FF) - (1 << 8)
await Timer(1, units="ns")
assert int(dut.wb_data.value) - (1 << 32) == expected
assert dut.valid.value == 1
# LBU TEST CASE
dut.f3.value = 0b100
await Timer(1, units="ns")
dut.be_mask.value = 0b1000
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0xFF000000) >> 24
assert dut.valid.value == 1
dut.be_mask.value = 0b0100
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x00FF0000) >> 16
assert dut.valid.value == 1
dut.be_mask.value = 0b0010
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x0000FF00) >> 8
assert dut.valid.value == 1
dut.be_mask.value = 0b0001
await Timer(1, units="ns")
for _ in range(100):
mem_data = random.randint(0,0xFFFFFFFF)
dut.mem_data.value = mem_data
await Timer(1, units="ns")
assert dut.wb_data.value == (mem_data & 0x000000FF)
assert dut.valid.value == 1
And there we go ! Note that you can get in touch with me if needed if you have trouble with this kind of etricate testbenches. But if you made it this far, I'm sure you'll be fine ;)
I told you f3 was consistent accross partial memory operations, but remember our load_store_decoder
has no idea the unsigned lbu
and lhu
exists yet !
So let's update the HDL slightly to take the new f3 into account when decoding the mask :
// load_store_decoder.sv
//...
case (f3)
3'b000, 3'b100 : begin // SB, LB, LBU
case (offset)
// ...
endcase
end
// ...
3'b001, 3'b101 : begin // SH, LH, LHU
// ...
end
// ...
endcase
// ...
Okay, now that we have our reader behaving as expected, what we'll do is simply add it to our datapath.
Here's the idea behind this implementation, (some details are truncated, don't forget to update everything !)
// cpu.sv
// ...
/**
* REGFILE
*/
// ...
logic wb_valid;
logic [31:0] write_back_data;
always_comb begin : write_back_source_select
case (write_back_source)
2'b00: begin
write_back_data = alu_result;
wb_valid = 1'b1;
end
2'b01: begin
write_back_data = mem_read_write_back_data;
wb_valid = mem_read_write_back_valid;
end
2'b10: begin
write_back_data = pc_plus_four;
wb_valid = 1'b1;
end
2'b11: begin
write_back_data = pc_plus_second_add;
wb_valid = 1'b1;
end
endcase
end
regfile regfile(
// ...
.write_enable(reg_write & wb_valid), // AND GATE ADDED
// ...
);
// ...
/**
* DATA MEMORY
*/
wire [31:0] mem_read;
memory #(
.mem_init("./test_dmemory.hex")
) data_memory (
// Memory inputs
.clk(clk),
.address({alu_result[31:2], 2'b00}),
.write_data(mem_write_data),
.write_enable(mem_write),
.byte_enable(mem_byte_enable),
.rst_n(1'b1),
// Memory outputs
.read_data(mem_read)
);
/**
* READER
*/
wire [31:0] mem_read_write_back_data;
wire mem_read_write_back_valid;
reader reader_inst(
.mem_data(mem_read),
.be_mask(mem_byte_enable),
.f3(f3),
.wb_data(mem_read_write_back_data),
.valid(mem_read_write_back_valid)
);
// ...
As you can see, the valid
flag for other logic block are always asserted because there is no reason not to sa far (we always made it so non-valid instructions do not infer with CPU state).
A potential improvement for later would be to de-assert the
valid
if the fetched instruction is not supported. Or even better, add anothernon-supported
flag but given we can not do custom type, this is something we'll keep for future tutorials.
Note that this valid
flag for memory is temporary and this will get thought out a bit better we implementing real memory interfaces.
Now, the support for partial loads should be complete ! Whenever a load instruction is fetched :
reg_write
should be 1 andwrite_back_source
is set on what's coming from our reader.- And our reader, getting its mask from the
load_store_decoder
, should do the necessary to process the data coming from memory to be compliant with what's expected. - If the mask is invalid, the write does not happen !
All that remains to do, as usual, is to come up with a test program :
01000393 //LB TEST START : addi x7 x0 0x10 | x7 <= 00000010 (base address fot test)
FFF3A903 // lw x18 -1(x7) | NO WRITE IN REGISTER ! (not valid)
FFF38903 // lb x18 -1(x7) | x18 <= FFFFFFDE
FFD3C983 //LBU TEST START : lbu x19 -3(x7) | x19 <= 000000BE
FFD39A03 //LH TEST START : lh x20 -3(x7) | NO WRITE IN REGISTER !
FFA39A03 // lh x20 -6(x7) | x20 <= FFFFDEAD
FFD3DA83 //LHU TEST START : lhu x21 -3(x7) | NO WRITE IN REGISTER !
FFA3DA83 // lhu x21 -6(x7) | x21 <= 0000DEAD
And a bunch of assertions :
# test_cpu.py
# ...
#################
# PARTIAL LOADS
# 01000393 //LB TEST START : addi x7 x0 0x10
# FFF3A903 // lw x18 -1(x7)
# FFF38903 // lb x18 -1(x7)
# FFD3C983 //LBU TEST START : lbu x19 -3(x7)
# FFD39A03 //LH TEST START : lh x20 -3(x7)
# FFA39A03 // lh x20 -6(x7)
# FFD3DA83 //LHU TEST START : lhu x21 -3(x7)
# FFA3DA83 // lhu x21 -6(x7)
##################
print("\n\nTESTING LB\n\n")
# Check test's init state
assert binary_to_hex(dut.instruction.value) == "01000393"
await RisingEdge(dut.clk) # addi x7 x0 0x10
assert binary_to_hex(dut.regfile.registers[7].value) == "00000010"
assert binary_to_hex(dut.regfile.registers[18].value) == "FFF8FF00"
await RisingEdge(dut.clk) # lw x18 -1(x7)
assert binary_to_hex(dut.regfile.registers[18].value) == "FFF8FF00"
await RisingEdge(dut.clk) # lb x18 -1(x7)
assert binary_to_hex(dut.regfile.registers[18].value) == "FFFFFFDE"
await RisingEdge(dut.clk) # lbu x19 -3(x7)
assert binary_to_hex(dut.regfile.registers[19].value) == "000000BE"
await RisingEdge(dut.clk) # lh x20 -3(x7)
assert binary_to_hex(dut.regfile.registers[20].value) == "0FFFFEEF"
await RisingEdge(dut.clk) # lh x20 -6(x7)
assert binary_to_hex(dut.regfile.registers[20].value) == "FFFFDEAD"
await RisingEdge(dut.clk) # lhu x21 -3(x7)
assert binary_to_hex(dut.regfile.registers[21].value) == "FFFFFFEE"
await RisingEdge(dut.clk) # lhu x21 -6(x7)
assert binary_to_hex(dut.regfile.registers[21].value) == "0000DEAD"
This time, we test them all at once as tthe principle is the same as what we did until now.
Well... Looks like we implemented all of the RV32I instruction set ! But is it the end ? Well yes and no.
After making this course on my side, I looked into implementing this core on fpga. The thing is that if I wanted to do so, the core needed to be... well... a bit more profesionnal !
This is beacause using external memory etc raises the need to building interfaces, which is easier to do using system verilog fancy notations.
So the first task is to switch to actual systemverilog ! Until now I used icarus verilog as a simulator, meaning I have to switch to verilator to have full system verilog support. (On your side, the setup file told you to use verilator so you shouldn't have to switch.)
Once this is done, we can actually start doing a bit od cleaning to make the core more "pro".
Here is what we'll do :
- Make a config file with
- Standardized signals, e.g. we'll use a op-code using a variable name instead of hardcoded 7 bits
- Standardized write_back signal type (a data with a valid signal)
- and more, the goal is to avoid hardcoded values and group signals than can be grouped toghether
For the hardcoded signals part, I described the main ones in the package :
// holy_core_pkg.sv
`timescale 1ns/1ps
package holy_core__pkg;
// INSTRUCTION OP CODES
typedef enum logic [6:0] {
OPCODE_R_TYPE = 7'b0110011,
OPCODE_I_TYPE_ALU = 7'b0010011,
OPCODE_I_TYPE_LOAD = 7'b0000011,
OPCODE_S_TYPE = 7'b0100011,
OPCODE_B_TYPE = 7'b1100011,
OPCODE_U_TYPE_LUI = 7'b0110111,
OPCODE_U_TYPE_AUIPC = 7'b0010111,
OPCODE_J_TYPE = 7'b1101111,
OPCODE_J_TYPE_JALR = 7'b1100111
} opcode_t;
// ALU OPs for ALU DECODER
typedef enum logic [1:0] {
ALU_OP_LOAD_STORE = 2'b00,
ALU_OP_BRANCHES = 2'b01,
ALU_OP_MATH = 2'b10
} alu_op_t ;
// "MATH" F3 (R&I Types)
typedef enum logic [2:0] {
F3_ADD_SUB = 3'b000,
F3_SLL = 3'b001,
F3_SLT = 3'b010,
F3_SLTU = 3'b011,
F3_XOR = 3'b100,
F3_SRL_SRA = 3'b101,
F3_OR = 3'b110,
F3_AND = 3'b111
} funct3_t;
// BRANCHES F3
typedef enum logic [2:0] {
F3_BEQ = 3'b000,
F3_BNE = 3'b001,
F3_BLT = 3'b100,
F3_BGE = 3'b101,
F3_BLTU = 3'b110,
F3_BGEU = 3'b111
} branch_funct3_t;
// LOAD & STORES F3
typedef enum logic [2:0] {
F3_WORD = 3'b010,
F3_BYTE = 3'b000,
F3_BYTE_U = 3'b100,
F3_HALFWORD = 3'b001,
F3_HALFWORD_U = 3'b101
} load_store_funct3_t;
// F7 for shifts
typedef enum logic [6:0] {
F7_SLL_SRL = 7'b0000000,
F7_SRA = 7'b0100000
} shifts_f7_t;
// F7 for R-Types
typedef enum logic [6:0] {
F7_ADD = 7'b0000000,
F7_SUB = 7'b0100000
} rtype_f7_t;
// ALU control arithmetic
typedef enum logic [3:0] {
ALU_ADD = 4'b0000,
ALU_SUB = 4'b0001,
ALU_AND = 4'b0010,
ALU_OR = 4'b0011,
ALU_SLL = 4'b0100,
ALU_SLT = 4'b0101,
ALU_SRL = 4'b0110,
ALU_SLTU = 4'b0111,
ALU_XOR = 4'b1000,
ALU_SRA = 4'b1001
} alu_control_t;
endpackage
And we them them around the files, check out the setup manual file for directions on howw to include the package and make it work with verilator on cocotb.
Here is a usage example :
// control.sv
// ...
case (alu_op)
// LW, SW
ALU_OP_LOAD_STORE : alu_control = ALU_ADD;
// R-Types, I-types
ALU_OP_MATH : begin
case (func3)
// ADD (and later SUB with a different F7)
F3_ADD_SUB : begin
// 2 scenarios here :
// - R-TYPE : either add or sub and we need to a check for that
// - I-Type : aadi -> we use add arithmetic
if(op == 7'b0110011) begin // R-type
alu_control = (func7 == F7_SUB)? ALU_SUB : ALU_ADD;
end else begin // I-Type
alu_control = ALU_ADD;
end
end
// AND
F3_AND : alu_control = ALU_AND;
// OR
F3_OR : alu_control = ALU_OR;
// SLT, SLTI
F3_SLT: alu_control = ALU_SLT;
// SLTU, SLTIU
F3_SLTU : alu_control = ALU_SLTU;
// XOR
F3_XOR : alu_control = ALU_XOR;
// SLL
F3_SLL : alu_control = ALU_SLL;
// SRL, SRA
F3_SRL_SRA : begin
if(func7 == F7_SLL_SRL) begin
alu_control = ALU_SRL; // srl
end else if (func7 == F7_SRA) begin
alu_control = ALU_SRA; // sra
end
end
endcase
// ...
This is the end of this course ! We made a Core run on simulation, the next step is to actually use it on an FPGA !
See you in the fpga edition of this course !