Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion idea #582

Open
qwas982 opened this issue Dec 30, 2024 · 3 comments
Open

Discussion idea #582

qwas982 opened this issue Dec 30, 2024 · 3 comments

Comments

@qwas982
Copy link

qwas982 commented Dec 30, 2024

Discussion, why not write a virtual NPU using C89? Let the large model run on this virtual NPU.
Can Bitnet use Mamba + MoE?

@qwas982
Copy link
Author

qwas982 commented Dec 30, 2024

Matrix Operation Instructions

MMUL: Matrix Multiplication
CONV: Convolution
DOT: Dot Product
ADD: Addition (for matrices or vectors)
SUB: Subtraction (for matrices or vectors)
MUL: Element-wise Multiplication (for matrices or vectors)
DIV: Element-wise Division (for matrices or vectors)

Activation Function Instructions

RELU: Rectified Linear Unit (ReLU) Activation Function
SIGM: Sigmoid Activation Function
TANH: Hyperbolic Tangent Activation Function
SOFT: Softmax Function
GELU: Gaussian Error Linear Unit (GELU) Activation Function

Data Manipulation Instructions

POOL: Pooling Operation (e.g., max pooling or average pooling)
COPY: Data Copy
TRAN: Transpose (for matrices)
RESH: Reshape (for data)
PAD: Padding (for data)

Compression and Decompression Instructions

COMP: Compression (for data)
DECO: Decompression (for data)
QUAN: Quantization Operation
DEQU: Dequantization Operation

Sparse Matrix Support Instructions

SPMM: Sparse Matrix Multiplication
SPAD: Sparse Matrix Addition

Transformer Model-Specific Instructions

ATTN: Attention Mechanism Calculation
GATH: Gather (for data)
SCAT: Scatter (for data)
LNORM: Layer Normalization

Mixed Precision Computing Instructions

FP16: 16-bit Floating Point Calculation (Half Precision)
FP32: 32-bit Floating Point Calculation (Single Precision)
INT8: 8-bit Integer Calculation
INT4: 4-bit Integer Calculation

Synchronization and Parallelism Instructions

SYNC: Hardware Synchronization
BAR: Hardware Barrier
THRD: Multi-threading Operation

Custom Operator Support

CUST: Custom Operator

Memory Management Instructions

PREF: Data Prefetch
CACH: Cache Management
LOAD: Data Load
STOR: Data Store

@qwas982
Copy link
Author

qwas982 commented Dec 30, 2024

Designing a Virtual NPU Instruction Set Based on RISC Architecture and Register Machine

To design a comprehensive virtual NPU instruction set based on RISC architecture that meets the requirements of modern large models (such as Transformer, CNN, etc.), it is essential to consider core operations including matrix operations, convolution, activation functions, data movement, and parallel computing. Below is a detailed design for such a virtual NPU instruction set:

1. Instruction Set Design Principles

  • RISC Style: Simple, fixed-length instruction set with each instruction performing a single operation.
  • Register-Register Operations: Most instructions operate directly between registers to reduce memory access.
  • Support for Parallel Computing: Provides multi-threading and hardware barrier instructions to support parallel tasks.
  • Mixed Precision Support: Supports various precision calculations including INT4/INT8/FP16/BF16/FP32.
  • Scalability: Reserves space for custom instructions to support future expansion.

2. Register Design

  • General Purpose Registers (R0-R31): 32 32-bit general-purpose registers for storing operands and intermediate results.
  • Matrix Registers (M0-M15): 16 128x128-bit matrix registers for storing matrix data.
  • Special Registers:
    • PC: Program Counter.
    • SP: Stack Pointer.
    • FLAGS: Status register storing condition flags (e.g., zero flag, overflow flag).

3. Instruction Set Classification

3.1 Data Movement Instructions
  • LOAD R1, [R2]: Loads data from memory address R2 into register R1.
  • STORE [R1], R2: Stores data from register R2 into memory address R1.
  • MOV R1, R2: Copies the value from register R2 to register R1.
  • PUSH R1: Pushes the value of register R1 onto the stack.
  • POP R1: Pops the top value from the stack into register R1.
3.2 Matrix Operation Instructions
  • MMUL M1, M2, M3: Matrix multiplication, M1 = M2 * M3.
  • MADD M1, M2, M3: Matrix addition, M1 = M2 + M3.
  • MSUB M1, M2, M3: Matrix subtraction, M1 = M2 - M3.
  • MCONV M1, M2, M3: Matrix convolution, M1 = M2 ⊗ M3.
  • MTRAN M1, M2: Matrix transpose, M1 = M2^T.
3.3 Vector Operation Instructions
  • VADD R1, R2, R3: Vector addition, R1 = R2 + R3.
  • VMUL R1, R2, R3: Vector multiplication, R1 = R2 * R3.
  • VDOT R1, R2, R3: Vector dot product, R1 = R2 · R3.
3.4 Activation Function Instructions
  • RELU R1, R2: ReLU activation function, R1 = max(0, R2).
  • SIGM R1, R2: Sigmoid activation function, R1 = 1 / (1 + exp(-R2)).
  • TANH R1, R2: Hyperbolic tangent activation function, R1 = tanh(R2).
  • SOFT R1, R2: Softmax function, R1 = exp(R2) / sum(exp(R2)).
3.5 Control Flow Instructions
  • JMP LABEL: Unconditional jump to LABEL.
  • JEQ R1, R2, LABEL: Jump to LABEL if R1 == R2.
  • JNE R1, R2, LABEL: Jump to LABEL if R1 != R2.
  • CALL LABEL: Calls a subroutine.
  • RET: Returns from a subroutine.
3.6 Parallel and Synchronization Instructions
  • BAR: Hardware barrier, waits for all threads to complete.
  • THRD R1: Creates R1 threads.
  • SYNC: Thread synchronization.
3.7 Mixed Precision Instructions
  • FP16 R1, R2: Converts R2 to FP16 format and stores it in R1.
  • INT8 R1, R2: Converts R2 to INT8 format and stores it in R1.
  • QUAN R1, R2, R3: Quantization operation, R1 = quantize(R2, R3).
  • DEQU R1, R2, R3: Dequantization operation, R1 = dequantize(R2, R3).
3.8 Custom Instructions
  • CUST R1, R2, R3: Executes a custom operation, R1 = custom_op(R2, R3).

4. Instruction Encoding Format

Fixed length of 32 bits:

  • Opcode (6 bits): Operation code.
  • Rd (5 bits): Destination register.
  • Rs1 (5 bits): Source register 1.
  • Rs2 (5 bits): Source register 2.
  • Imm (11 bits): Immediate value or offset.

5. Example Code

// Matrix multiplication
MMUL M1, M2, M3

// ReLU activation function
LOAD R1, [R2]
RELU R3, R1
STORE [R4], R3

// Conditional jump
JEQ R1, R2, LABEL
LABEL:
    ADD R1, R2, R3

6. Summary

This virtual NPU instruction set, designed based on RISC architecture and register machine principles, covers core operations including matrix operations, vector operations, activation functions, control flow, and parallel computing, meeting the requirements of modern large models. Through fixed-length instruction encoding and efficient register-register operations, it achieves high-performance computing with low power consumption.

; --------------------------------------------

Can This Instruction Set Be Translated to LLVM-IR?

Yes, this virtual NPU instruction set based on RISC architecture can be translated to LLVM IR (Intermediate Representation). LLVM IR is a low-level, hardware-independent intermediate representation with strong expressiveness and flexibility, capable of mapping well to various instruction set architectures. Below is a detailed explanation of how to translate this instruction set to LLVM IR:

1. Basic Features of LLVM IR

  • Static Single Assignment (SSA) Form: Each variable is assigned only once.
  • Strong Type System: Each value has a clear type (e.g., i32, float, <4 x float>, etc.).
  • Modular Structure: Composed of modules (Module), functions (Function), basic blocks (Basic Block), and instructions (Instruction).
  • Rich Instruction Set: Supports arithmetic operations, memory operations, control flow, function calls, etc.

2. Translation Rules

2.1 Data Movement Instructions
  • LOAD R1, [R2]:
    %R1 = load i32, i32* %R2
  • STORE [R1], R2:
    store i32 %R2, i32* %R1
  • MOV R1, R2:
    %R1 = add i32 %R2, 0
2.2 Matrix Operation Instructions
  • MMUL M1, M2, M3:
    Assuming matrix type is <128 x 128 x float>:
    %M1 = call <128 x 128 x float> @matrix_multiply(<128 x 128 x float> %M2, <128 x 128 x float> %M3)
    The @matrix_multiply function needs to be defined.
2.3 Vector Operation Instructions
  • VADD R1, R2, R3:
    Assuming vector type is <4 x float>:
    %R1 = add <4 x float> %R2, %R3
2.4 Activation Function Instructions
  • RELU R1, R2:
    %R1 = call float @relu(float %R2)
    The @relu function needs to be defined.
2.5 Control Flow Instructions
  • JMP LABEL:
    br label %LABEL
  • JEQ R1, R2, LABEL:
    %cmp = icmp eq i32 %R1, %R2
    br i1 %cmp, label %LABEL, label %next
2.6 Parallel and Synchronization Instructions
  • BAR:
    call void @barrier()
    The @barrier function needs to be defined.
  • THRD R1:
    call void @create_threads(i32 %R1)
    The @create_threads function needs to be defined.
2.7 Mixed Precision Instructions
  • FP16 R1, R2:
    %R1 = fptrunc float %R2 to half
  • INT8 R1, R2:
    %R1 = trunc i32 %R2 to i8
2.8 Custom Instructions
  • CUST R1, R2, R3:
    %R1 = call i32 @custom_op(i32 %R2, i32 %R3)
    The @custom_op function needs to be defined.

3. Complete Example

Virtual NPU Assembly Code:

LOAD R1, [R2]
RELU R3, R1
STORE [R4], R3
JEQ R1, R2, LABEL
LABEL:
    ADD R1, R2, R3

Translated to LLVM IR:

define void @main() {
entry:
  %R2 = alloca i32
  %R1 = load i32, i32* %R2
  %R3 = call float @relu(float %R1)
  %R4 = alloca i32
  store float %R3, float* %R4
  %cmp = icmp eq i32 %R1, %R2
  br i1 %cmp, label %LABEL, label %next

LABEL:
  %R1 = add i32 %R2, %R3
  br label %next

next:
  ret void
}

declare float @relu(float)

4. Summary

This virtual NPU instruction set can be well translated to LLVM IR. By defining necessary functions (such as @matrix_multiply, @relu, @barrier, etc.) and utilizing the rich instruction set of LLVM IR, complete expression of virtual NPU instructions can be achieved. This translation provides a solid foundation for subsequent optimization, code generation, and adaptation to target architectures.

@qwas982
Copy link
Author

qwas982 commented Dec 30, 2024

"Mimic LLVM-IR and design a minimal Turing machine instruction set supporting 2-bit to 128-bit vectors. Convert it into a stack machine-based instruction set. List the complete assembly mnemonic program basic block demonstration. First, use Boolean logic to determine if there are any conflicts, then design the actual specifications."

Designing a Minimal Turing Machine Instruction Set

1. Instruction Set Design Principles

  • Turing Completeness: Supports basic operations such as conditional jumps, loops, and memory read/write.
  • Vector Support: Supports vector operations ranging from 2 bits to 128 bits.
  • Stack Machine Architecture: All operations are based on the stack; operands are popped from the stack top, and results are pushed back onto the stack top.
  • Boolean Logic Conflict Detection: Ensures the instruction set has no conflicts, with clear semantics for each instruction.

2. Instruction Set Classification

2.1 Data Movement Instructions
  • PUSH : Pushes a value onto the stack.
  • POP: Pops the top value from the stack.
  • LOAD : Pushes the value from a memory address onto the stack.
  • STORE : Stores the top stack value to a memory address.
2.2 Arithmetic Operation Instructions
  • ADD: Pops the top two values from the stack, adds them, and pushes the result back onto the stack.
  • SUB: Pops the top two values from the stack, subtracts them, and pushes the result back onto the stack.
  • MUL: Pops the top two values from the stack, multiplies them, and pushes the result back onto the stack.
  • DIV: Pops the top two values from the stack, divides them, and pushes the result back onto the stack.
2.3 Logical Operation Instructions
  • AND: Pops the top two values from the stack, performs a bitwise AND operation, and pushes the result back onto the stack.
  • OR: Pops the top two values from the stack, performs a bitwise OR operation, and pushes the result back onto the stack.
  • NOT: Pops the top value from the stack, performs a bitwise NOT operation, and pushes the result back onto the stack.
2.4 Vector Operation Instructions
  • VADD : Pops the top two vectors from the stack, adds them, and pushes the result back onto the stack ( represents the vector bit size, e.g., 2, 128).
  • VMUL : Pops the top two vectors from the stack, multiplies them, and pushes the result back onto the stack.
2.5 Control Flow Instructions
  • JMP : Unconditionally jumps to a label.
  • JEQ : Pops the top two values from the stack, jumps to the label if they are equal.
  • JNE : Pops the top two values from the stack, jumps to the label if they are not equal.
  • CALL : Calls a subroutine.
  • RET: Returns from a subroutine.
2.6 Stack Operation Instructions
  • DUP: Duplicates the top stack value and pushes it back onto the stack.
  • SWAP: Swaps the top two stack values.

3. Boolean Logic Conflict Detection

  • Instruction Uniqueness: Each instruction's mnemonic is unique and unambiguous.
  • Operand Consistency: Instruction operands are consistent with the stack state; for example, ADD requires at least two values on the stack top.
  • Label Uniqueness: Control flow instruction labels are unique to avoid jump conflicts.

4. Stack Machine-Based Instruction Set Specifications

4.1 Instruction Format
  • Fixed Length: Each instruction consists of an opcode and operands.
  • Opcode: 1 byte, representing the instruction type.
  • Operands: Depending on the instruction type, may be immediate values, addresses, or labels.
4.2 Stack State
  • Stack Pointer (SP): Points to the stack top.
  • Stack Size: Dynamically expandable to support any number of operands.

5. Assembly Mnemonic Program Block Demonstration

Example Program: Calculate vector addition and determine if the result is 0.

// Define vectors
PUSH 0x01        // Low-order bit of vector 1
PUSH 0x02        // High-order bit of vector 1
PUSH 0x03        // Low-order bit of vector 2
PUSH 0x04        // High-order bit of vector 2

// Vector addition
VADD 16          // 16-bit vector addition

// Check if the result is 0
PUSH 0x00        // Push 0
PUSH 0x00        // Push 0
JEQ ZERO_LABEL   // Jump to ZERO_LABEL if the result is 0

// Result is not 0
PUSH 0x01        // Push 1
STORE 0x100      // Store the result to memory address 0x100
JMP END_LABEL    // Jump to the end label

// Result is 0
ZERO_LABEL:
PUSH 0x00        // Push 0
STORE 0x100      // Store the result to memory address 0x100

// End
END_LABEL:
RET              // Return

6. Summary

This minimal Turing machine instruction set, based on stack machine architecture, supports vector operations ranging from 2 bits to 128 bits and ensures unambiguous instruction semantics through Boolean logic conflict detection. The assembly mnemonic program block demonstration showcases practical usage of the instruction set. This design can be further extended into more complex virtual machines or hardware implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant