Skip to content

Commit

Permalink
pulp tutorial adjust sections to have table of contents in gh pages. …
Browse files Browse the repository at this point in the history
…new doc and pulp tutorial finished.
  • Loading branch information
xiaywang committed Jan 23, 2022
1 parent f225589 commit fc22e07
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 15 deletions.
2 changes: 1 addition & 1 deletion docs/troubles.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This page is a collection of possible solutions.


# GTKWave
## GTKWave

If you are trying to generate the traces to be visualized with GTKWave on Mr. Wolf and you are getting

Expand Down
28 changes: 14 additions & 14 deletions tutorial/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction
# PULP Tutorial

In this tutorial you will learn how to use the pulp-dsp library and how to improve and write optimized code for PULP platforms.

Expand All @@ -12,13 +12,13 @@ If the user wants to use a parallel implementation, they can call the glue code

The kernel functions are more specific to the underlying ISA, so they can be used for other processors (other than Mr. Wolf) that contain cores with the specific ISA.

# Requirements
## Requirements

Follow the instructions under `Installation and usage` on the home page to install pulp-sdk and pulp-dsp.

For Mr. Wolf, we will use the [v1 branch](https://github.com/pulp-platform/pulp-sdk/tree/v1) of the pulp-sdk. The 'new' sdk on the [main branch](https://github.com/pulp-platform/pulp-sdk) works similarly. For detailed documentations, please refer to the respective documentation of pulp-sdk.

# Configurations
## Configurations

After you installed the pulp-sdk and the pulp-dsp in the sdk, you can start developing your DSP application.

Expand All @@ -37,7 +37,7 @@ and activate the configurations:
source sourceme.sh
~~~~~

# Baseline
## Baseline

Go to the `baseline` folder. It contains the codes for computing the dot product on FC of Mr. Wolf, which has an Ibex core featuring the basic RV32IMC ISA.

Expand All @@ -49,17 +49,17 @@ make clean all run

The vectors length is 80. The values are 32-bit integers. It takes around 806 cycles and 431 instructions. Refer to the `main.c` to learn how to use the performance counter. Note that on GVSoC you can use as many counters as you want, while on the board only one HW counter exists.

# Single cluster core
## Single cluster core

## Basic version
### Basic version

Go to the folder `cluster_single_basic`, compile and run the code as before.

If you access the data from the cluster to the L2 memory in the SoC, the number of cycles is very high (2165 cycles, 277 instructions), because the data access to L2 requires many cycles.

You have to store the data in L1 to exploit the fast TCDM memory with single-cycle access and sigle-cycle latency which is hidden by the loop unrolling. Uncomment `#define L2DATA` in `defines.h` and run again. You will see that the number of cycles is significantly reduced (394 cycles, 277 instructions).

### Generate assembly code
#### Generate assembly code

Compared to the execution on Ibex, the number of cycles is halved, thanks to the hardware loop (`lp.setup`) and the post-incremental load and store (marked by `!`). You can visualize these instructions by generating the assembly code:

Expand Down Expand Up @@ -87,14 +87,14 @@ and visualize the file using a text editor. If you search for `plp_dot_prod_` yo

You can refer to the manuals of the ISA extensions for details about the instructions.

### IPC and MACs/cycle
#### IPC and MACs/cycle

Ideally the instructions per cycle would be 1 and the asymptotical MACs/cycle is 2/6=0.33. (MACs = multiply-and-accumulate operations)

Our execution has IPC=277/394=0.7 and 80/394=0.20 MACs/cycle. We computed the dot product of 80 elements. With more elements, the IPC and the MACs/cycle will increase and approach the ideal case. You can try to calculate the dot product of 800 elements. The IPC will be around 0.95 and the MACs/cycle around 0.31.


## Transfer the data with DMA
### Transfer the data with DMA

The L1 memory is much smaller (64kB on Mr. Wolf) than the L2 (512kB). This means that you will need to transfer some data back and forth during the computation when your datasize is too big.

Expand All @@ -103,7 +103,7 @@ You can use the DMA to transfer the data between L1 and L2.
Go to `cluster_single_dma` folder and check how this is done.


## SIMD instructions
### SIMD instructions

The CV32E40P core with RV32IMCXPulpV2 extensions provides Single Instructions Multiple Data (SIMD) instructions. If you have 8-bit integer numbers, you can do the dot product of 4 values in a single cycyle with a single instruction.

Expand All @@ -119,7 +119,7 @@ Go to `cluster_single_simd` folder and run the code. You will see around 221 cyc
1c008674: b8f71e57 pv.sdotsp.b t3,a4,a5
```

# Parallel computation using multiple cluster cores
## Parallel computation using multiple cluster cores

You can compute the dot product using all the cores in the cluster. The library provides also functions to do parallel computations.

Expand All @@ -135,7 +135,7 @@ Note that the L1 memory is limited, hence you will need to use DMA if the datasi
4. at the same time transfer a second chunk;
5. once the second chunk is transfered, do the computation on this while transfering the next chunk.

# Visualize traces
## Visualize traces

The GVSoC virtual platform allows dumping architecture events to help developers debugging their applications by better showing what is happening in the system. For example, it can show instructions being executed, DMA transfers, events generated, memory accesses and so on. The generated traces can be visualized using GTKWave. You can install it from their official website.

Expand All @@ -156,7 +156,7 @@ You can visualize the status of each core, how long each of them spend doing wha
![gtkwave example](/tutorial/gtkwave_screenshot.png)


# Optimizations
## Optimizations

Under `Signals` window, select `chip/fc/func`. In `Wave` window, zoom in and out, find `__rt_init_cluster_data` at the level of `func`, put the cursor on it, click. You will see a vertical red line popping up. You can now go with left and right arrow on your keyboard, the cursor will jump to the beginning or end of the function, respectively. Now under `overview/soc` you see the cycles number.

Expand All @@ -170,7 +170,7 @@ Another part which takes long cycles can be seen under `chip/cluster/pe_0` with

In real applications, the optimization methods will vary case by case. For example, you might acquire data from some sensors and the data are first stored in L2 memory. Then you need the cluster to do some compute-intensive tasks on these data, you can declare buffer variables using `RT_L1_BSS`, then using DMA (maybe also applying the double buffering technique), you transfer the data from L2 to L1 and do the computations with the cluster cores in parallel.

# Contributions
## Contributions

This library provides many optimized functions. However, there are still many to be optimized. You are very welcome to contribute. You can use the techniques learned in this tutorial and apply them. Find more details on how to contribute on the home page `README.md`. Have fun!

Expand Down

0 comments on commit fc22e07

Please sign in to comment.