201 - Coprocessor

All the code that you write has to be executed on the processor. With computers being ever more powerful, it is easy to forget about all the work a processor is doing. Irrespective of how simple a job might seem, it still needs to be executed. The example illustrates this. The only thing the processor needs to do is output 2 values. Nonetheless with the software as seen before, this takes 74 clock cycles.

#include "print.h"

void main(void) {
	print_str(";");
	print_str(".");
}

example

Offloading

When a processor is doing a lot of work, it can become useful if parts of that work could be delegated or offloaded to another processor. Which portions of the work need to be offloaded ? What will the performance gain be ? What is the price (€s, area in silicon, energy, … ) ? These are simple questions, but answering them is not straightforward !!

The first coprocessors saw the light of day in the 1970’s. It became clear that only doing calculations with integer numbers was to restrictive. The first coprocessors were floating-point units (FPUs). These coprocessor were so heavily used that their functionality got integrated in the processor itself.

Multiplication

The PicoRV32 implementation we’ve used so far only supports the RV32I instruction set. This means that only the basic integer operations are supported. Although the instruction set does not contain a multiplication operation, it can be used nonetheless.

#include "print.h"

void main(void) {
    volatile unsigned int value1, value2, product;

    value1 = 208;
    value2 = 3;

    product = value1 * value2;

    print_dec(product);
}


Running this C-code generates an output .dat file. After parsing the output looks like this.

00000000000000000000000000110110 - 054 - 0x36 - 6
00000000000000000000000000110010 - 050 - 0x32 - 2
00000000000000000000000000110100 - 052 - 0x34 - 4
The value 624 is the product of the (hardcoded) values 208 and 3.

The reason that this works without having a mul instruction is because of the compiler jumps. The compiler figures out what needs to be done and comes up with a recipe to achieve what the code prescribes.

00000174 <__mulsi3>:
 174:   00050613        mv  a2,a0
 178:   00000513        li  a0,0
 17c:   0015f693        andi    a3,a1,1
 180:   00068463        beqz    a3,188 <__mulsi3+0x14>
 184:   00c50533        add a0,a0,a2
 188:   0015d593        srli    a1,a1,0x1
 18c:   00161613        slli    a2,a2,0x1
 190:   fe0596e3        bnez    a1,17c <__mulsi3+0x8>
 194:   00008067        ret

build

00000274 <main>:
 274:   fe010113        addi    sp,sp,-32
 278:   0d000793        li  a5,208
 27c:   00f12223        sw  a5,4(sp)
 280:   00300793        li  a5,3
 284:   00f12423        sw  a5,8(sp)
 288:   00412503        lw  a0,4(sp)
 28c:   00812583        lw  a1,8(sp)
 290:   00112e23        sw  ra,28(sp)
 294:   ee1ff0ef        jal ra,174 <__mulsi3>

Changing one letter in the Makefile allows the compiler use the mul instruction.

ARCHITECTURE = rv32i$(subst C,c,$(COMPRESSED_ISA))
to
ARCHITECTURE = rv32im$(subst C,c,$(COMPRESSED_ISA))


00000168 <main>:
 168:   ff010113            addi    sp,sp,-16
 16c:   0d000793            li  a5,208
 170:   00f12223            sw  a5,4(sp)
 174:   00300793            li  a5,3
 178:   00f12423            sw  a5,8(sp)
 17c:   00412783            lw  a5,4(sp)
 180:   00812703            lw  a4,8(sp)
 184:   02e787b3            mul a5,a5,a4
 188:   00f12623            sw  a5,12(sp)
 18c:   00c12503            lw  a0,12(sp)
 190:   01010113            addi    sp,sp,16
 194:   f29ff06f            j   bc <print_dec>