-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnextsilicon.txt
1055 lines (864 loc) · 49.1 KB
/
nextsilicon.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
= NextSilicon Notes
:doctype: book
:toc:
:icons:
:sectlinks:
:source-highlighter: pygments
== Introduction
This document collects scattered information about how to install and configure a NextSilicon card.
It also contains a brief overview of the software stack used in the NextSilicon system.
Section headings are also links to the original documents.
Other important documents not included herein are:
* https://userdocs.nextsilicon.com/en/latest/users/overview/[NextSilicon User Guide]
* https://userdocs.nextsilicon.com/en/latest/UIUX/launcher/[GUI Applications Guide]
* https://userdocs.nextsilicon.com/en/latest/troubleshooting/known-issues/[Troubleshooting]
* https://userdocs.nextsilicon.com/en/latest/reference/glossary/[Glossary]
* https://userdocs.nextsilicon.com/en/latest/release/overview/[Release Notes]
There are four key procedures for setting up the NextSilicon system: xref:hardware_installation[hardware installation],
xref:software_installation[software installation],
xref:installation_verification[installation verification],
and xref:runtime_configuration[runtime configuration].
[[hardware_installation]]
== https://userdocs.nextsilicon.com/en/latest/setup/HWinstall/[Hardware Installation]
Directions on how to install and test the NextSilicon Maverick PCIe card in a single card per server environment.
=== Package Contents
The following list includes all contents included in package delivery. Upon opening the package, please verify that all contents are included. Please contact your NextSilicon support team if the package contents differ from this list:
* Maverick NXT10500KV148R64GB PCIe module
* PCIe Gen 5 12VHPWR to EPS-12V power cable adapter-splitter with two 8-pin connectors
=== Installation Prerequisites
Before beginning installation of the NextSilicon Maverick PCIe card, ensure you have the following prerequisites:
* A server with at least 4 cores and 64 GB RAM (see supported servers list below)
* 5 GB of disk space
* Available PCIe Gen 3, Gen 4, or Gen 5 x16-lane slot
* 300 W power delivery per card, and airflow to cool 300 W (at least 48 CFM air flow-through per card for 50 °C operation)
* PCIe 5 12VHPWR cable, or one or two EPS-12V 8-pin connectors and cables
*Note*: Certain applications may demand greater resources, so check your application’s specific requirements. We recommend at least 64 GB RAM per Maverick card on the server for running applications.
=== Pre-Installation Configuration
NextSilicon hardware requires:
* Minimum hardware requirements
** PCIe Gen 4
** 4 CPU cores
** 64 GB memory
* BIOS settings
** xref:above_4g_decoding[Above 4G decoding]
** Resize BAR support
** xref:iommu[IOMMU] support
* Operating system
** Debian 10 or RHEL 8.5
** Enable xref:iommu[IOMMU] in pass-through mode
** Set PCIe card cooling fan control to maximum
=== Safety Precautions
Before beginning the installation, take these safety precautions:
* Shut down the server or computer and remove its power cable.
* Wait until internal heat dissipates before opening the cabinet.
* Prevent ESD (electrostatic discharge) by touching a grounded surface or wearing a grounded antistatic wrist strap.
=== PCIe Card Installation
Follow this procedure to install the Maverick card:
* Unplug the system’s power cable.
* Open the system cabinet.
* Find an empty double-wide, full-height, full-length PCIe x16 card slot with appropriate airflow.
* If the slot has a locking latch or retaining clip, open it.
* If the slot has a cover or guard plate, remove it.
* Insert the Maverick card connectors carefully into the slot. Press firmly to seat the card, placing your fingers on the card directly over the slot. Do not use excessive force.
* Close the locking latch or retaining clip, if present.
* Attach the system’s power cable to the power socket on the back of the Maverick card. Use the provided EPS-12V Maverick power cable adapter-splitter if needed.
** Connect the #4 connector (with four yellow wires) first. Up to 300 W will be drawn from the single 8-pin header.
** If necessary, connect the #2 connector (with two yellow wires) as well. Up to 200 W will be drawn from the #4 header, and up to 100 W from the #2 header.
** When the installation of the Maverick card is complete, close the cabinet.
*Note*: The cable adapter-splitter has two 8-pin CPU EPS-12V headers.
=== Low-Level Verification
Verify that the server recognizes the Maverick card:
-----
$ lspci -d cdfa:
01:00.0 Processing accelerators: Device cdfa:0007 (rev 01)
-----
=== Installation Troubleshooting
*Warning*: To ensure safer and more stable use of the Maverick card, we recommend setting the default fan speed to 90% on your server. This will help prevent the Maverick card from overheating.
Please refer to the https://userdocs.nextsilicon.com/en/latest/troubleshooting/known-issues/[known issues]
and https://userdocs.nextsilicon.com/en/latest/troubleshooting/faqs/[FAQs] pages within the troubleshooting guide for more details.
== https://userdocs.nextsilicon.com/en/latest/setup/SWinstall/[Software Specifications]
Your guide to installing and testing the NextSilicon software stack and simulator on Debian-, or RHEL- distributions.
=== Supported OS
* Debian 10, kernel version 4.19.0-18 or higher
* RHEL 8.5, kernel version 4.18.0-348.12.2 or higher
=== Installation Prerequisites
* A server with at least 4 cores and 64 GB RAM, with a Maverick card installed according to the Hardware installation guide.
* OS:
** Debian 10 (kernel version 4.19.0-18 or higher)
** RHEL 8.5 (kernel version 4.18.0-348.12.2 or higher)
* Connection to an internal or external Debian 10 or RHEL 8.5 package repository
* If not connected to physical NextSilicon hardware, a virtual machine (VM) instance is supported for use with the NextSilicon simulator
* A Docker container is supported only when using the simulator, and not with Maverick card. When using Docker:
** Increase /dev/shm size by running the Docker with --shm-size=10G
** Add the `SYS_PTRACE` capability by running the Docker with `--cap-add=SYS_PTRACE`
** `$USER` must be a member of the Docker group
* `sudo` privilege
* Internet connection (required for the installation only)
=== Pre-Installation Configuration
Ensure that this OS option has been configured before software installation:
* VT-d/IOMMU (Input–Output Memory Management Unit), enable in passthrough mode
[[software_installation]]
== https://userdocs.nextsilicon.com/en/latest/setup/SWinstall_RHEL/[Software Installation on RHEL-Based Installations]
=== Install NextSilicon Dependencies
For RHEL-based distributions, you must first register your system and then enable the following repos using these commands.
Update the available repositories via:
-----
sudo subscription-manager repos --enable codeready-builder-for-rhel-8-x86_64-rpms
sudo subscription-manager repos --enable rhel-8-for-x86_64-baseos-rpms
-----
Install all required sytem package dependencies:
-----
sudo yum install -y binutils glibc-devel libuv patchelf libatomic mpfr file graphviz spawn-fcgi fcgi-devel nginx zlib kernel-debug-devel-$(uname -r) dkms kernel-devel-$(uname -r) kernel-headers-$(uname -r) cmake git wget
-----
=== Download
Download the NextSilicon software stack via:
-----
wget --user <NS-USER> --password <NS-PASSWORD> http://repo.nextsilicon.net/release/rhel-8/0.10.0/ns-sw-kit-rhel-8-0.10.0-308.tar.bz2
-----
Extract it with:
-----
tar xvf ns-sw-kit-rhel-8-0.10.0-308.tar.bz2
cd ns-sw-kit-0.10.0/rhel-8
-----
=== Install the DKMS Driver Packages
Install the drivers when using the Maverick card, not when using the simulator.
Compile, install and load the `nextsi` and `nextuvm` drivers from the `rhel-8` subdirectory via:
-----
sudo rpm -Uvh nextsi-0.10.0-3383.x86_64.rpm
sudo rpm -Uvh nextuvm-0.10.0-3383.x86_64.rpm
sudo modprobe nextsi
sudo modprobe nextuvm
-----
=== Verify Driver Installation
Verify the driver status with:
-----
echo ">>> dkms status <<<"
sudo dkms status | grep next
echo ">>> modinfo nextsi <<<"
sudo modinfo nextsi
echo ">>> modinfo nextuvm <<<"
sudo modinfo nextuvm
echo ">>> dmesg output <<<"
sudo dmesg | grep next
echo ">>> lspci output <<<"
lspci | grep accelerators
-----
The expected output is:
-----
>>> dkms status <<<
nextsi/0.10.0, 4.18.0-372.9.1.el8.x86_64, x86_64: installed
nextuvm/0.10.0, 4.18.0-372.9.1.el8.x86_64, x86_64: installed
>>> modinfo nextsi <<<
filename: /lib/modules/4.18.0-372.9.1.el8.x86_64/extra/nextsi.ko.xz
softdep: post: nextuvm
version: 0.10.0
description: NextSilicon Driver
license: GPL
rhelversion: 8.6
srcversion: F36FF2837E68F4430575408
alias: pci:v0000CDFAd00000007sv*sd*bc*sc*i*
depends:
name: nextsi
vermagic: 4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
parm: mem:int
>>> modinfo nextuvm <<<
filename: /lib/modules/4.18.0-372.9.1.el8.x86_64/extra/nextuvm.ko.xz
version: 0.10.0
description: NextSilicon Driver
license: GPL
rhelversion: 8.6
srcversion: 5B56459F361E13BE4B356C0
depends:
name: nextuvm
vermagic: 4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
>>> dmesg output <<<
[ 25.722943] nextsi 0000:01:00.0: PCIE atomic ops is not supported
[ 25.746113] using nextsilicon pci device!
[ 26.634162] nextuvm loaded! api_ver 0x6
[ 26.634165] nextuvm detected support for avx512x4 memcpy offload
>>> lspci output <<<
01:00.0 Processing accelerators: Device cdfa:0007 (rev 01)
-----
*Note*: The module verification error can be ignored.
=== Install NextSilicon Packages
Install the `nextllvm`, `nextruntime` and `nextsilicon-ui-apps` packages via:
-----
sudo rpm -Uvh nextllvm-12.0.1-1101.x86_64.rpm
sudo rpm -Uvh nextruntime-RelWithDebInfo-0.10.0-3383.x86_64.rpm
sudo rpm -Uvh nextsilicon-ui-apps-0.10.0-40.x86_64.rpm
sudo reboot
-----
After the reboot, enter the command:
-----
sudo dmesg | grep next
-----
to see the expected output:
-----
[ 5.552130] systemd[1]: Set hostname to <vm-srv14-rhel-8-03.il.nextsilicon.com>.
[ 6.367713] nextsi: loading out-of-tree module taints kernel.
[ 6.371074] nextsi: module verification failed: signature and/or required key missing - tainting kernel
[ 6.401436] nextsi 0000:01:00.0: PCIE atomic ops is not supported
[ 6.425743] using nextsilicon pci device!
[ 6.811490] nextuvm loaded! api_ver 0x3
-----
=== Set Up the NextSilicon Environment
Set up a NextSilicon environment via:
-----
source /etc/profile.d/nextsilicon.sh
-----
[[installation_verification]]
== https://userdocs.nextsilicon.com/en/latest/setup/SWinstall_check/[Installation Verification]
NextSilicon provides a script that verifies that the NextSilicon hardware and software have been installed correctly. If they are not, it generates error messages that help NextSilicon identify the error.
=== Running the Verification Script
The verification script is saved in a read-only directory. The following commands copy it to a smoketest directory in your home directory, where error messages and any other output can be saved if necessary, and then run the script.
*Warning*: If you don't run the command from the specified directory, it will fail.
The commands are:
-----
cp -r /opt/nextsilicon/share/smoketest ~/smoketest
cd ~/smoketest/
./run_smoketest.sh 2>&1 | tee output.log
-----
=== Expected Results
The script creates two reports, `output.log` and `nextcli.log`. Both are saved to `~/smoketest`, the same directory the script was saved to.
If you see the following trailing output on the screen when the script terminates, the verification was successful and your system is properly set up:
-----
Expected output from smoketest
-----
*Note*: If the smoketest script fails early on, nextcli.log might not be generated, in which case output.log will be sufficient for troubleshooting.
[[runtime_configuration]]
== https://userdocs.nextsilicon.com/en/latest/setup/config/[Runtime Configuration]
How to set up the basic runtime configuration for the NextSilicon utilities.
=== Static Configuration
The configuration file for the NextSilicon utilities is the YAML file located, by default, at `$NEXT_HOME/etc/next_runtime.conf`. You can directly update `$NEXT_HOME/etc/next_runtime.conf` using sudo or by copying it to a dedicated path. The advantage of editing in the default path is that you will not need to specify `--cfg-fle <new-config-file>` on every command execution.
=== Prerequisite Network Communication
Various modules in the NextSilicon runtime – `nextdaemon`, `nextloader`, `nextcli`, `nextprofiler`
(see Introduction to command-line utilities for more information) – use network packets to communicate.
It is mandatory to update the network configuration first.
When setting ports in the configuration (first-port, simulator port, daemon-port), make sure these ports are not used by other processes in your machine.
The elrond service binds additional ports starting from the first-port, then incrementing by one. To avoid port conflicts with simulator port and daemon-port, make sure that first-port is the higher port number, as shown here:
-----
daemon:
elrond:
...
first-port: 7003
...
simulator:
- port: 7002
...
system:
daemon-port: 7001
daemon-host: 0.0.0.0
...
-----
Use `0.0.0.0` and not `127.0.0.1` as the socket bind address.
=== Example Configuration
An example of a configuration file is:
-----
system:
daemon-port: 7001
daemon-host: 0.0.0.0
generation: gen1
daemon:
# debug: false
elrond:
path: elrond
first-port: 7003
# Note that increasing the maximum elrond count would decrease the maximum
# possible VFIDs, which could cause problems for some apps.
count: 1
simulator:
- port: 7002
# Enable eventlog: a trace for system events
# enable-eventlog: true
# Run the software simulator, even if the hardware is present.
# force-software-only: false
# device-init:
# Set to change to override DRAM initialization binary: (default: ../lib/firmware/nextsilicon/sbus_master.hbm2e.0x000c_1032.rom)
# dram-rom-path: <path>
# Set to change default pattern DRAM memory is set to after being initialized (default: 0x00000000)
# dram-default-pattern: <value>
# Uncomment lines below to skip device configuration state.
# Changing these may affect system performance and stability!
# skip-host-ghi: true
# skip-bins: true
# skip-grid-pll: true
# skip-hbm: true
# Set the bin cache line size: (default: 64)
# cache-line-size: default | 64 | 128
# Make the bin scramble the input address: (default: no-scramble)
# mapcont-scramble: no-scramble | xor-tag-bits | xor-mul-17
# RISC settings.
# risc:
# Set to activate RISC complexes: (default: both)
# In SCU-only mode, a complex will be initialized for SCU execution,
# but no cores will be available to elrond, and MNG services will not function.
# mode: default | none | scu-only | east | west | both
# Set to change firmware binary to be loaded at runtime: (default: ../nextrisc/bin/mngfw[p].bin)
# firmware-path: <path>
# Set to readback firmware upon upload and check its contents: (default: false)
# verified-upload: <value>
# Do not use HBM as MNG channel backing memory: (default: false)
# mng-private-mode: <value>
# Use native emulation for uemu blocks (and type of support, default: enabled/x86-simulated [device/direct-mode]):
# native-support: default | none | enabled | x86-simulated
# cores-disable-west-mask: West complex mask value for cores disabled, for cores 0-23
# cores-disable-east-mask: East complex mask value for cores disabled, for cores 0-23
# Optimize the interpreter's slots by liveliness when enabled, only relevant for interpreted uCG BBGs
# optimize-bytecode: true
# If true, store RISC thread data on the hbm, otherwise store it on the SRAM (default: false)
# hbm-thread-data: false
# openmp:
# When enabled OpenMP calls will be deployed to the RISC mngfw, instead of being implemented on the Host
# enable: false
# Sets the number of continuously-preallocated threads IDs to be used in RISC OpenMP implementation
# preallocate-threads: 1024
# SCU state polling interval, in seconds.
# scu-state-poll-interval: 5
# SCU (system control unit) settings.
# Changing these may affect system performance and stability!
# scu:
# Set to change SCU mode: (default: protection)
# mode: default | disabled | protection
# Set to change core reserved for SCU: (default: 1)
# core: <value>
# Set to change sampling frequency: (default: 1000.0Hz)
# sample-frequency: <value>
# Set to change protection threshold temperature: (default: 100.0C)
# protection-threshold: <value>
execution:
deploy-scheme:
# Assign mill peripheral BBGs to host/risc (default: risc):
mill-peripheral: default
# Assign mill core BBGs to host/risc/grid (default: grid):
mill-core: default
# Enables or disables the use of nextuvm
# enable-uvm: true
# device-telemetry:
# Telemetry sample interval in milliseconds: (default: 1000)
# sample-interval: <value>
# switch between select set counters and global writer set counters.
# These counters are mutually exclusive for they share HW resources (default: global-ws)
# global-grid-telemetry-mode: default | sls | global-ws
#
# tlb:
# Enable hardware tlb telemetry collection: (default: true)
# Note - the tlb telemetries are very spamming.
# At 1.4GHz clock, a packet will be sent from each tlb every 0.2ms
# enable: true | false
# gmu:
# Each latency bucket will be of size 2^n cycles where n is the value
# of this field. 0 means disable.
# Nonzero values must be between 2 and 10: (default: 0)
# mep-latency-bucket-size: <value>
#
# Enable hardware gmu telemetry collection: (default: true)
# enable: true | false
#
# 0 - MEP: requests, responses, backpressure to GCU, MIU backpressure
# BIN: miss, backpressure to MIU REQ and from MIU RSP,
# dlink backpressure, HIT or FIP
# MMU: hit, miss
# 1 - MEP: split, unaligned
# BIN: fetch, fill, hit, fip, scratchpad, miss_no_alloc, hit or fip
# MMU: hit, miss
# 2 - 16 bins aggregating different rtts. bin size is defined by
# mep-latency-bucket-size.
# Choose the mem counters you want to collect:
# (default: custom. Defined in TELEM_GMU_SRC_DEFAULT_CFG)
# global-telem-mode: The modes below are supported:
# default: A mixed-mode default per counter, "0000000000011000000011"
# '1' at offset #i (right to left) ==>
# Indicates that counter#i is configured to mode_1.
# '0' at offset #i (right to left) ==>
# Indicates that counter#i is configured to mode_0.
# custom: Per user request.
# mode_0 / mode_1 / mode_rtt: Global all counters set to mode_0 or mode_1 or mode_rtt.
# global-telem-mode: default | mode_0 | mode_1 | mode_rtt | custom
# #EXAMPLE: counter-0: mode_0 | mode_1 | default
# telem-mode-custom:
# All counters custom configuration below:
# counter-0: mode_1
# counter-1: mode_1
# ...
# counter-21: mode_0
# Enable telemetries about the state of the osq pointers (head, tail, size/peep) (default: false)
# osq-telem-enable: false | true
# Set to ignore mill thread limits, as configured by the optimizer. (default: false)
# ignore-thread-limits: <value>
# Prepare the change objects but do not apply them onto the domains.
# skip-apply: false
# instance-selection:
# Set XFLD mask that is applied to tid before selecting duplication instance, int value (default: -1, don't apply)
# xfld-mask: <value>
projection:
# duplication:
# Create boundry box of tiles out of the existing projection, and duplicate it across the remaining space on the grid.
# tiles: false
# Create boundry box of clusters out of the existing projected tiles, and duplicate it across the remaining space inside the tiles.
# clusters: false
# pre-process:
# Queue redirection depends on `use-next-research`
# enable-queue-redirection: true
# topology-prioritization-factor: 1.0
#
# projection experimental configuration:
# note that fields under experimental:
# 1. are subject to change at any time without notice
# 2. can introduce instability and thus usage is risky
# experimental:
# redirect-queues-to-gsu: false
# redirect-feeder-sets-to-gsu: false
# use projection boundaries to bound the projection. units in clusters
# note that the boundaries are bounded by grid absolute boundaries
# set all to -1 to ignore projection boundaries
# projection-boundaries:
# offset-row: -1
# offset-column: -1
# rows: -1
# columns: -1
# relocate projection bounding box upon the load of projection result by hwcg.
# the units are in clusters. set all to -1 to ignore relocation
# relocate-on-load:
# row: -1
# column: -1
# clusters and tiles to skip when duplicating the projection
# duplication-exclude-clusters:
# tiles we don't want to use in duplications. Should contain a list of numbers between 0 and 7
# 0 being top left, 1 top right etc.
# tiles: []
# Clusters inside each tile we want to skip. Should contain a list of numbers between 0 and 31.
# The skipped clusters will be ignored in all 8 tiles.
# clusters: []
use-next-research: true
# disjoint-windows-plan: false
# Limits the number of process project.py can use
# workers-limit: 16
# Projection mode - (default: regular)
# mode: default | regular | cluster-based
# Projection split mode - (default: pre-projection)
# split-mode: default | disabled | pre-projection | recursive
elrond:
# optimizer: true
simulator:
# To enable jitting while running BBGs not on grid/device, set to true, otherwise set to false.
# This comes into effect for all BBGs when running without hardware and on unlikely BBGs when running with a device
# jit: true
#
# To save jit coredump set the dump path. This will save a BBG if the lifter failed to compile it.
# jit-coredump: $NEXT_HOME/var/log/nextsilicon/
#
# To enable jitted code to be interrupted and report telemetry at loop head,
# set the following value to a non-zero value. After every <value> static jumps,
# the interrupt callback will be called. Note that this will cause a slowdown
# for values closer to zero.
interrupt-threshold: 1000
#
# To set the maximum number of simulator threads that will run at the same
# time, uncomment the next line and set the number:
# threads: <num>
#
# To limit device memory size in the software simulator, set the desired
# devmem size here, in amounts of clusters (256MB). If left unset, a full
# silicon layout will be used, enabling 64GiB of device memory. This value
# cannot be more than 256 (full silicon layout).
# When not running in software simulator, this setting is ignored.
# memory-clusters: <value>
cachesim:
# To enable cache simulator use gen1 or gen2 as the mode.
mode: disabled
#
# To get a human readable cachesim report set its path below
# info-log-path : $NEXT_HOME/cs.log
#
# To get a machine readable cachesim report set its path below
# json-log_path : $NEXT_HOME/cs.json.log
#
# Number of wall clock seconds between each report
# report-interval : <num>
#
# enable eventlog: a trace for system events
# enable-eventlog: true
# enable builtins: Load from ../etc/codegraph_builtins. Enabled by default.
# Uncomment to disable
# enable-builtins: false
#
# In order to disable mem-trap uncomment the line `disable-mem-trap: <value>` and
# replace `<value>` with `true`.
# disable-mem-trap: <value>
#
# In order to disable inclusion of mtrap memory in core dumps, set to false.
# Note that device memory will not be included in core dumps regardless of the
# value supplied here due to technical limitations.
mtrap-in-core-dumps: true
#
# To set source path substitution, uncomment the following lines and apply the
# pairs of paths (first to replace, the second the replacement) separated by a
# colon:
# source-path-sub:
# - /from/sample/dir:/to/sample/dir
# - /another/sample/dir:/to/another/sample/location
# - ...
#
# To stop Elrond on (somewhat) recoverable errors (recommended to be set on
# test environments):
# strict: true
#
# To enable or disable control-plane handoffs, which causes the interception
# and injection of certain functions for internal next control-plane purposes,
# set the following setting. Possible values are true/false
# control-plane-handoffs: true
#
# The memory allocation policy for the memory manager.
# Possible values:
# - `default`: If running with a device use `migrate-one-way`, otherwise use
# `host-only`
# - `host-only`: All memory is allocated on the host. Can be used for
# testing without device memory constraints.
# - `device-only`: All memory is allocated on the device.
# - `migrate-one-way`: Memory can be allocated either on host or on device.
# Host memory is migrated to device on first device access,
# but device memory is never migrated to host.
# mmu-policy: default
#
# Atomic shadow space is a memory region used to catch atomic operations from the host on migrated memory
# disable-atomic-shadow-space: false
#
# The memory migration maximal chunk size in bytes
# Note that actual migration size may be smaller in cases
# where the assorted memory allocation was smaller than the chunk size
# This setting also decides the maximal page size.
# For 2MiB pages, set at least 2097152. For 1GiB pages, set at least 1073741824.
# The default is 2 MiB
# max-migration-chunk: 2097152
#
# Set to true to zero memory used as application stack
zero-stack: false
#
# To set thread stack size (in bytes)
# thread-stack-size: 0x20000
# To change libcall load policy, which controls how libcall implementations
# are loaded into Codegraph from disk, set this value to:
# - `compliant-only` (the default): Allow only IEEE-compliant implementations.
# - `prefer-fast`: Prefer selecting fast versions (no-NaNs, no-Infs,
# flush-to-zero, denormals-are-zero) if available, otherwise fall back to
# IEEE-compliant implementations.
# libcall-load-policy: compliant-only
#
# To disable the MMU, set the following value to true. This will cause all
# memory allocations to be served directly from linear memory.
# disable-mmu: <value>
#
# In order to enable or disable function overlays, which replaces some
# non-libcall functions with ns-optimized implementations of them, set the
# following value to true/false. Defaults to true
# enable-overlay: true
#
# Number of seconds to wait before sending a daemon keepalive request, after
# receiving an answer to the previous query:
# daemon-keepalive-send: 5
# Number of seconds until Elrond considers the daemon unresponsive, and enters
# failed state:
# daemon-keepalive-wait: 45
loader5:
use_ld_so_plugin: false
# Enable eventlog: Log for system events
enable-eventlog: true
codegraph_db:
# remove-unused: true
# To enable or disable the select_set simplification set the following
# value to true/false. The simplification splits select_set nodes into select
# nodes before optimization and reassembles them after.
simplify-select-sets: true
# Libcalls mode can be one of the following values:
# - legacy
# - gen1 (Default if not provided)
# - gen2
libcalls-mode: gen1
feeder-optimization:
feeder-recalculation:
disable: false
depth-limit: 4
added-compute-limit: 20
feeder-spilling:
slots-per-thread: 0x200
feeder-rarely-used:
disable: false
gain-threshold: 10
feeder-used-in-unlikely:
disable: false
gain-threshold: 5
memory-optimization:
disable-eliminate-barriers: false
enable-eliminate-barriers-force: false
loop-pipelining-hack: false
classification:
# Treat OR nodes as ADD nodes in address calculation
# even when we can't prove they're equivalent (see mem_class.cpp)
unsafe-or-as-add: false
reordering:
# reordering can change the order of memory accesses including atomics
disable: false
# is reordering of two read only is allowed.
# for safe x86 like behavior non-atomic-only
read-only: all
# is reordering of two memory accesses that do not overlap is allowed
# for safe x86 like behavior non-atomic-only
non-alias: all
# UNSAFE: treat atomics and non-atomics as non-aliasing
unsafe-deorder-atomics-from-non-atomics: true
# UNSAFE: treat pointers from different stack frames as non-aliasing
unsafe-deorder-distinct-stack-frames: false
coalescing:
disable: false
max-size-bytes: 16
enable-heuristic-alignment-optimization: false
# Experimental heuristic to prevent split memory transactions during
# iteration of struct arrays with sizes that do not align well with the
# bin size (in gen1, 128bits).
enable-struct-array-size-heuristic: false
super-unsafe:
# for research. don't turn on unless you know what you are doing
enable-unsafe-remove-memread-cond: false
# Disable all memory accesses without modifying topology
enable-total-memory-suppression: false
# Attempt to coalesce simple read-mutate-write access patterns
enable-read-mutate-write-coalesce: false
# Original (report_tool)
optimizer:
# blacklist-functions:
# - foo
# - bar
# abs-topo-lim should have different value depending on optimization usage
# Value should be larger then 'topology-min-counter' below
# When optimizing function: use number that is <= to your number of test.
# 10k is recommended but if the func is slow, you can lower it to 1k
# For full apps: 30k is a good number to start with.
# For fine tuning, use report tool to create CSV file.
# look at 'loops loads' for better estimated value
abs-topo-lim: 30000
call-inline-factor: 0.005
check-load-except-intervals: 100000
collect-counters-duration: 10000000
consolidate-unlikely:
min-lower-lim: 10
prob-lim-loop-edge: 0.8
prob-lim: 0.9
prob-lower-lim: 0.02
fast-mode: true
inline-validity-threshold: 0.0
merge-limit: 0.001
minimal-print: false
new-counters-interval: 10000000
new-counters-validity: 0.0
simple-optimization-stop-ratio: 1000
topo-lim-factor: 0.01
topology-min-counter: 1000
topology-min-duration-factor: 1.5
allow-exit-paths: false
# Modern (elrond_core)
optimizer-pi:
millable-functions: []
# Optimizer will skip mills which are not from those functions, ignore if empty.
# format is Muid_Zuid or debug name
unmillable-functions: []
# Optimizer will skip mills which are from those functions.
# format is Muid_Zuid or debug name
inline-blacklist-functions: []
# - Muid_Zuid
import-blacklist-functions: []
# - Muid_Zuid
blacklist-mills: []
# - "<func_name>: <bbg_id>" (note the quotes, they are a MUST)
# - "main: 33"
# - "step_10: 9"
small-mill-limit: 1
# Will duplicate simple mills (single BBG that underwent feeder spilling) by
# this count, which must be a power of 2.
simple-mill-duplication-count: 0
use-unoptimized: false
# if projection fails on a loop or on an epilogue, downgrade the entire loop
# with all of its epilogues and not just the failed BBG
downgrade-entire-loop-on-failure: true
# if projection fails on a loop and it has a closer parent loop, downgrade
# the parent loop as well
prevent-closer-parent-loops: true
# none - None of the mills created by the optimizer will be drafted
# all - All mills created by the optimizer will be draft mills,
# Allows the user to select the wanted mills and apply them through nextcli/UI
# unstable - Only unstable mills will be drafted
draft-mills: none
loop-flattening:
enabled: false
# total number of MEPs in outer head and epilogue
memory-access-threshold: 0
# total number of LEs in outer head and epilogue
compute-nodes-threshold: 200
discovery:
# Ignore this much time of the application's initial telemetry
noise-skip-ms: 5000
# Start sending for imports only after accumulating at least this much
# application-telemetry-time.
initial-data-ms: 10000
# Minimum time, in milliseconds, for which requested telemetry must be
# collected after the final import request (i.e. no new import requests being
# made since the most recent one's completion) to advance to inline stage
import-stable-ms: 10000
# Any discovery telemetry less than this is completely discarded:
# max(max_importable_func_load / threshold_factor, threshold_minimum)
# This is a noise-cancellation mechanism.
# Both of the following must hold:
# 1. Only import function f if load(f) > import-threshold-minimum.
# 2. Import highest load function f_hi.
# Import any function f if load(f) > load(f_hi) / import-threshold-factor.
import-threshold-minimum: 100000
import-threshold-factor: 4096
plan:
# Both of the following must hold:
# 1. Only keep edges e if load(e) > edge-threshold-minimum.
# 2. Keep highest load edge e_hi.
# Keep edge e only if load(e) > load(e_hi) / edge-threshold-factor.
# The connected subgraphs that are left are mill candidates.
edge-threshold-minimum: 100000
edge-threshold-factor: 4096
# Both of the following must hold:
# 1. Only keep mills m if load(m) > mill-threshold-minimum.
# load(m) on a mill is equal to the edge with the highest load in the mill.
# 2. Keep highest load mill m_hi.
# Keep mill m only if load(m) > load(m_hi) / mill-threshold-factor.
mill-threshold-minimum: 100000
mill-threshold-factor: 4096
# Only keep mills m with mill-iteration-count(m) > mill-iter-threshold.
# Example: nested loops, outer loop has iters = 3, inner loop has iters = 4.
# Then the mill iteration count is 3 x 4 = 12.
mill-iter-threshold: 1000.0
# Only applies to functions with flag NS_MARK_HINT_LIKELY in source code.
# Values between 0 and 1.
# Multiplying thresholds by this factor makes the marked functions more
# likely to have mills.
# The following thresholds are affected by likely-hint-factor:
# mill-iter-threshold; loop-threshold; edge-threshold
# where loop-threshold = max(mill-threshold-minimum, max_edge_load / mill-threshold-factor)
# (max_edge_load is the maximal edge load of the entire application)
likely-hint-factor: 0.5
refine:
unlikely-edge-ratio: 1024
bbg-load-ratio: 100
# Can be used to disable break commutative (uncomment):
# disable-break-commutative: true
# Can be used to disable feeder spilling (uncomment):
# disable-feeder-spilling: true
# Can be used to disable inlining order calls (uncomment):
# disable-inline-order-call: true
# Disable reoptimize mode. False = reoptimize enabled, True = refine enabled.
# Reoptimize mode means that all counters will be cleared on stage2 completion
# and reoptimize will happen periodically (by the following config) or manually (cli)
disable-reoptimize: false
# Time in seconds to trigger reoptimize
reoptimize-period: 0
# Size of reoptimize phase cache
phase-cache-size: 10
exploration:
# This weight will be assigned to parent loop edges by default when calling
# mill from parent loop. The weight also can be set in the command itself.
mill-from-parent-loop-weight: 1024.0
# This config controlls merging epilogues backward into predecessors to spare bbgs for projection
merge-epilogue-backward:
enabled: false
# Maximal number of MEPs in epilogue to merge
memory-access-threshold: 0
# Maximal number of LEs in epilogue to merge
compute-nodes-threshold: 200
-----
== NextSilicon Software Guide
=== Command Line Utilities
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextdaemon/[`nextdaemon`]
The `nextdaemon` command is the daemon managing the NextSilicon hardware and software system.
This daemon performs the various aspects of seamless software offloading and acceleration on behalf of the user applications.
It is responsible for various crucial tasks, such as live telemetry collection, optimization, and memory migration, as well as providing data for other tools such as `nextcli` and `webapps-server`. Each application, all executed through `nextloader`, communicates with the daemon from the moment it is started, throughout its entire run time, and finally during its teardown.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextengined/[`nextengined`]
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextloader/[`nextloader`]
The `nextloader` command is the NextSilicon application loader.
It is used to execute an application binary through the NextSilicon software acceleration stack.
Given an executable application and its command line options, `nextloader` loads and executes the application, provided that a daemon (`nextdaemon` for hardware or `nextengined` for simulator) is running.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextmonitor/[`nextmonitor`]
The `nextmonitor` command is the metrics aggregation agent.
It is a performance metrics and eventlog aggregation system. At predefined time intervals, it samples the system metrics and the event log. The sample is written into an SQL database.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextmonitor_to_json/[`nextmonitor_to_json`]
The `nextmonitor_to_json` utility is used to convert the content of the `nextmonitor` SQLite database into JSON
to be loaded into the Perfetto graphical utility.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/webapps-server/[`webapps-server`]
The `webapps-server` programs runs and allows connection to NextSilicon’s custom-developed visualization tools that help developers and researchers understand how their code behaves on the NextSilicon platform.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/next_perf_analyzer/[`next_perf_analyzer`]
The `next_perf_analyzer` utility is a text based performance report utility for single threaded performance. It is recommended to generate a nextmonitor database via a hardware run with only one thread running on the device e.g. setting `OMP_NUM_THREADS=1` through the environment.
This is a textual report whose goal is to help find basic performance bottlenecks. This tool is for users who are comfortable with NextSilicon hardware and understand basic concepts.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextcli/[`nextcli`]
The `nextcli` utility is the NextSilicon software acceleration stack command-line control interface.
It is used through subcommands, which you can list with `nextcli -h`. Most commands also accept command-line arguments that guide them in controlling specific accelerators or applications being accelerated. Each subcommand is documented in separate, per-command help, which can be displayed with `nextcli <subcommand> --help` or `nextcli <subcommand> -h`. Each subcommand comprises one or more words.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/mpirun/[`mpirun`]
The `mpirun` command executes serial and parallel jobs in OpenMPI. It sends the name of the directory where it was invoked on the local node to each of the remote nodes, and attempts to change to that directory. Note that `mpirun`, `mpiexec`, and `orterun` are all synonyms for each other.
=== Compiler Wrappers
Compiler wrappers
NextSilicon provides compiler wrappers along with the NextSilicon LLVM-based toolchain. These wrappers serve two purposes:
* Building libraries and executables against the provided sysroot (linking against musl-libc, using it as the runtime dynamic linker, linking against runtime libraries provided in the sysroot).
* Enriching specific binaries for runtime optimization through extraction of dataflow graphs, and bridging the application binary interface (ABI) between offloaded and non-offloaded code.
The wrappers do this by injecting extra parameters to the compiler and linker invocations. They act as intermediaries to the toolchain’s front-end drivers, so both linker and compiler invocations should pass through them.
These wrappers are subdivided into three sets.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/flatcc/[`flatcc`, `flatcxx`, `flatfort`]
Minimal compiler drivers for the NextSilicon Clang-based C, C++, and Fortran compilers. Code compiled through one of these compilers does not contain enriched code sections, but is linked with NextSilicon’s sysroot. These are used in creating enriched libraries from which the runtime manager can extract computation graph representations and perform optimizations and ABI bridging. This process involves using the linker’s link time optimization (LTO) capabilities, but without using the actual LTO pipeline.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/nextcc/[`nextcc`, `nextcxx`, `nextfort`]
Main compiler drivers for the NextSilicon enriching Clang-based C, C++, and Fortran compilers. Code compiled through one of these compilers contains enriched code sections as well as being linked with NextSilicon’s sysroot. These are used in creating enriched libraries from which the runtime manager can extract computation graph representations and perform optimizations and ABI bridging. This process involves using the linker’s LTO capabilities, but without using the actual LTO optimization pipeline.
==== https://userdocs.nextsilicon.com/en/latest/software/command-line-utilities/mpicc/[`mpicc`, `mpicxx`, `mpifort`]
OpenMPIs compiler wrappers for MPI applications. These are to be used in conjuction with nextcc, nextcxx, and nextfort, respectively.
==== https://userdocs.nextsilicon.com/en/latest/software/wrappers/[Enrichment-Enabling Wrappers]
=== https://userdocs.nextsilicon.com/en/latest/software/APIs/NSAPI/Overview/[NSAPI]
NSAPI is a unified API that can be called from application code to query or control NextSilicon-specific runtime properties. This additional form of runtime-level granular control can help applications programmatically harness the full power of NextSilicon hardware and software capabilities.
==== Categories
NSAPI is divided thematically into the following categories:
* xref:handoff[Handoff]: Dealing with the processes of importing and handing off functions
* xref:threading[Threading]: Threading in the NextSilicon hardware context
* xref:function[Function and loop marks]: Using function and loop marks as additional control mechanisms for the optimizer
* xref:memory[Memory]
* xref:devices[Devices]
[[handoff]]
==== https://userdocs.nextsilicon.com/en/latest/software/APIs/NSAPI/Handoff/[Handoff]
These are the NSAPI functions related to the importing and handoff processes. Once a function is handed off, additional information (such as NextSilicon thread capacity) can be queried. In addition, import and handoff of a function can be forced or restricted by “marking” the function accordingly. Use `#include <nsapi/handoff.h>` to use these functions in your application.
[[threading]]
==== https://userdocs.nextsilicon.com/en/latest/software/APIs/NSAPI/Thread/[Threading]
The NSAP functions related to NextSilicon threading. Threads running on NextSilicon hardware are identified by a unique NextSilicon thread ID and process ID (unrelated to POSIX TID and PID).
NSAPI provides functions to query the currently running NextSilicon thread and process IDs, as well as checking whether the currently running code is running in the NextSilicon device context (hardware or software simulator).
Additional API is provided to allocate NextSilicon TIDs and start them with a given function.
[[function]]
==== https://userdocs.nextsilicon.com/en/latest/software/APIs/NSAPI/Marks/[Function and Loop Marks]
Function and loop marks act as an additional means to control Optimizer decisions at run time. The mark can act as a manual command or as a hint.