pulp tutorial adjust sections to have table of contents in gh pages. …

…new doc and pulp tutorial finished.
pulp-platform · Jan 23, 2022 · fc22e07 · fc22e07
1 parent f225589
commit fc22e07
Show file tree

Hide file tree

Showing 2 changed files with 15 additions and 15 deletions.
diff --git a/docs/troubles.md b/docs/troubles.md
@@ -3,7 +3,7 @@
 This page is a collection of possible solutions.
 
 
-# GTKWave
+## GTKWave
 
 If you are trying to generate the traces to be visualized with GTKWave on Mr. Wolf and you are getting
 

diff --git a/tutorial/README.md b/tutorial/README.md
@@ -1,4 +1,4 @@
-# Introduction
+# PULP Tutorial
 
 In this tutorial you will learn how to use the pulp-dsp library and how to improve and write optimized code for PULP platforms.
 
@@ -12,13 +12,13 @@ If the user wants to use a parallel implementation, they can call the glue code
 
 The kernel functions are more specific to the underlying ISA, so they can be used for other processors (other than Mr. Wolf) that contain cores with the specific ISA.
 
-# Requirements
+## Requirements
 
 Follow the instructions under `Installation and usage` on the home page to install pulp-sdk and pulp-dsp.
 
 For Mr. Wolf, we will use the [v1 branch](https://github.com/pulp-platform/pulp-sdk/tree/v1) of the pulp-sdk. The 'new' sdk on the [main branch](https://github.com/pulp-platform/pulp-sdk) works similarly. For detailed documentations, please refer to the respective documentation of pulp-sdk.
 
-# Configurations
+## Configurations
 
 After you installed the pulp-sdk and the pulp-dsp in the sdk, you can start developing your DSP application. 
 
@@ -37,7 +37,7 @@ and activate the configurations:
 source sourceme.sh
 ~~~~~
 
-# Baseline
+## Baseline
 
 Go to the `baseline` folder. It contains the codes for computing the dot product on FC of Mr. Wolf, which has an Ibex core featuring the basic RV32IMC ISA.
 
@@ -49,17 +49,17 @@ make clean all run
 
 The vectors length is 80. The values are 32-bit integers. It takes around 806 cycles and 431 instructions. Refer to the `main.c` to learn how to use the performance counter. Note that on GVSoC you can use as many counters as you want, while on the board only one HW counter exists.
 
-# Single cluster core
+## Single cluster core
 
-## Basic version
+### Basic version
 
 Go to the folder `cluster_single_basic`, compile and run the code as before.
 
 If you access the data from the cluster to the L2 memory in the SoC, the number of cycles is very high (2165 cycles, 277 instructions), because the data access to L2 requires many cycles.
 
 You have to store the data in L1 to exploit the fast TCDM memory with single-cycle access and sigle-cycle latency which is hidden by the loop unrolling. Uncomment `#define L2DATA` in `defines.h` and run again. You will see that the number of cycles is significantly reduced (394 cycles, 277 instructions).
 
-### Generate assembly code
+#### Generate assembly code
 
 Compared to the execution on Ibex, the number of cycles is halved, thanks to the hardware loop (`lp.setup`) and the post-incremental load and store (marked by `!`). You can visualize these instructions by generating the assembly code:
 
@@ -87,14 +87,14 @@ and visualize the file using a text editor. If you search for `plp_dot_prod_` yo
 
 You can refer to the manuals of the ISA extensions for details about the instructions.
 
-### IPC and MACs/cycle
+#### IPC and MACs/cycle
 
 Ideally the instructions per cycle would be 1 and the asymptotical MACs/cycle is 2/6=0.33. (MACs = multiply-and-accumulate operations)
 
 Our execution has IPC=277/394=0.7 and 80/394=0.20 MACs/cycle. We computed the dot product of 80 elements. With more elements, the IPC and the MACs/cycle will increase and approach the ideal case. You can try to calculate the dot product of 800 elements. The IPC will be around 0.95 and the MACs/cycle around 0.31.
 
 
-## Transfer the data with DMA
+### Transfer the data with DMA
 
 The L1 memory is much smaller (64kB on Mr. Wolf) than the L2 (512kB). This means that you will need to transfer some data back and forth during the computation when your datasize is too big.
 
@@ -103,7 +103,7 @@ You can use the DMA to transfer the data between L1 and L2.
 Go to `cluster_single_dma` folder and check how this is done.
 
 
-## SIMD instructions
+### SIMD instructions
 
 The CV32E40P core with RV32IMCXPulpV2 extensions provides Single Instructions Multiple Data (SIMD) instructions. If you have 8-bit integer numbers, you can do the dot product of 4 values in a single cycyle with a single instruction.
 
@@ -119,7 +119,7 @@ Go to `cluster_single_simd` folder and run the code. You will see around 221 cyc
 1c008674:	b8f71e57          	pv.sdotsp.b	t3,a4,a5
 ```
 
-# Parallel computation using multiple cluster cores
+## Parallel computation using multiple cluster cores
 
 You can compute the dot product using all the cores in the cluster. The library provides also functions to do parallel computations.
 
@@ -135,7 +135,7 @@ Note that the L1 memory is limited, hence you will need to use DMA if the datasi
 4. at the same time transfer a second chunk;
 5. once the second chunk is transfered, do the computation on this while transfering the next chunk.
 
-# Visualize traces
+## Visualize traces
 
 The GVSoC virtual platform allows dumping architecture events to help developers debugging their applications by better showing what is happening in the system. For example, it can show instructions being executed, DMA transfers, events generated, memory accesses and so on. The generated traces can be visualized using GTKWave. You can install it from their official website.
 
@@ -156,7 +156,7 @@ You can visualize the status of each core, how long each of them spend doing wha
 ![gtkwave example](/tutorial/gtkwave_screenshot.png)
 
 
-# Optimizations
+## Optimizations
 
 Under `Signals` window, select `chip/fc/func`. In `Wave` window, zoom in and out, find `__rt_init_cluster_data` at the level of `func`, put the cursor on it, click. You will see a vertical red line popping up. You can now go with left and right arrow on your keyboard, the cursor will jump to the beginning or end of the function, respectively. Now under `overview/soc` you see the cycles number. 
 
@@ -170,7 +170,7 @@ Another part which takes long cycles can be seen under `chip/cluster/pe_0` with
 
 In real applications, the optimization methods will vary case by case. For example, you might acquire data from some sensors and the data are first stored in L2 memory. Then you need the cluster to do some compute-intensive tasks on these data, you can declare buffer variables using `RT_L1_BSS`, then using DMA (maybe also applying the double buffering technique), you transfer the data from L2 to L1 and do the computations with the cluster cores in parallel.
 
-# Contributions
+## Contributions
 
 This library provides many optimized functions. However, there are still many to be optimized. You are very welcome to contribute. You can use the techniques learned in this tutorial and apply them. Find more details on how to contribute on the home page `README.md`. Have fun!