0%

RVF

Float Register

f0 - f31

fcsr

  • FRCSR reads fcsr by copying it into integer register rd.

  • FSCSR swaps the value in fcsr by copying the original value into integer register rd, and then writing a new value obtained from integer register rs1 into fcsr.

  • FRRM instruction reads the Rounding Mode field frm and copies it into the least-significant three bits of integer register rd, with zero in all other bits.

  • FSRM swaps the value in frm by copying the original value into integer register rd, and then writing a new value obtained from the three least-significant bits of integer register rs1 into frm.

  • FRFLAGS and FSFLAGS are defined analogously for the Accrued Exception Flags field fflags.

  • Bits 31–8 of the fcsr are reserved for other standard extensions, including the “L” standard extension for decimal floating-point. If these extensions are not present, implementations shall ignore writes to these bits and supply a zero value when read. Standard software should preserve the contents of these bits.

  • Floating-point operations use either a static rounding mode encoded in the instruction, or a dynamic rounding mode held in frm.

    Rounding modes are encoded as shown in Table. A value of 111 in the instruction’s rm field selects the dynamic rounding mode held in frm. If frm is set to an invalid value (101–111), any subsequent attempt to execute a floating-point operation with a dynamic rounding mode will raise an illegal instruction exception. Some instructions, including widening conversions, have the rm field but are nevertheless unaffected by the rounding mode; software should set their rm field to RNE (000).

  • The accrued exception flags indicate the exception conditions that have arisen on any floating-point arithmetic instruction since the field was last reset by software, as shown in Table 11.2. The base RISC-V ISA does not support generating a trap on the setting of a floating-point exception flag.

Instruction

000000000001_00000_001_01011_1110011

Load and Store Instructions

  • FLW : rd = M[rs1 + imm]
  • FSW : M[rs1 + imm] = rs2

Computational Instructions

  • FADD.S & FSUB.S & FMUL.S & FDIV.S & FSQRT.S

  • FMIN.S & FMAX.S

    For the purposes of these instructions only, the value −0.0 is considered to be less than the value +0.0. If both inputs are NaNs, the result is the canonical NaN. If only one operand is a NaN, the result is the non-NaN operand. Signaling NaN inputs set the invalid operation exception flag, even when the result is not NaN.

  • FMADD.S : rd = (rs1 × rs2) + rs3

  • FMSUB.S : rd = (rs1 × rs2) - rs3

  • FNMSUB.S : rd = -(rs1 × rs2) + rs3

  • FNMADD.S : rd = -(rs1 × rs2) - rs3

  • The 2-bit floating-point format field fmt is encoded as shown in Table. It is set to S (00) for all instructions in the F extension.

  • All floating-point operations that perform rounding can select the rounding mode using the rm field with the encoding shown in Table.

Conversion & Move & SignInjection

  • FCVT.S.W : rd(float) = rs1(int)
  • FCVT.W.S : rd(int) = rs1(float)
  • FCVT.S.WU : rd(float) = rs1(uint)
  • FCVT.WU.S : rd(uint) = rs1(float)

FMV instructions are provided to move bit patterns between the floating-point and integer registers.

The bits are not modified in the transfer, and in particular, the payloads of non-canonical NaNs are preserved.

  • FMV.X.W : rd(int) = rs1(float)
  • FMV.W.X : rd(float) = rs1(int)
  • FSGNJ : rd = sgn(rs2) * abs(rs1)
  • FSGNJN : rd = -sgn(rs2) * abs(rs1)
  • FSGNJX : rd = (sgn(rs1) ^ sgn(rs2)) * abs(rs1) or rd = sgn(rs2) * abs(rs1)

Compare Instructions

  • FEQ.S : rd = (rs1 == rs2) ? 1 : 0
  • FLT.S : rd = (rs1 < rs2) ? 1 : 0
  • FLE.S : rd = (rs1 <= rs2) ? 1 : 0

Classify Instruction

  • FCLASS.S instruction examines the value in floating-point register rs1 and writes to integer register rd a 10-bit mask that indicates the class of the floating-point number.

Test

FPnew Documentation

FPnew is a parametric floating-point unit which supports standard RISC-V operations as well as transprecision formats, written in SystemVerilog.

Table of Contents

Top-Level Interface

The top-level module of the FPU is fpnew_top and its interface is further described in this section.
FPnew uses a synchronous interface using handshaking to transfer data into and out of the FPU.

All array types are packed due to poor support of unpacked arrays in some EDA tools.
SystemVerilog interfaces are not used due to poor support in some EDA tools.

Parameters

The configuration parameters use data types defined in fpnew_pkg which are structs containing multi-dimensional arrays of custom enumeration types.
For more in-depth explanations on how to configure the unit and the layout of the types used, please refer to the Configuration Section.

Parameter Name Description
Features Specifies the features of the FPU, such as the set of supported formats and operations.
Implementation Allows to control how the above features are implemented, such as the number of pipeline stages and architecture of subunits
TagType The SystemVerilog data type of the operation tag
TrueSIMDClass If enabled, the result of a classify operation in vectorial mode will be RISC-V compliant if each output has at least 10 bits
EnableSIMDMask Enable the RISC-V floating-point status flags masking of inactive vectorial lanes. When disabled, simd_mask_i is inactive

Ports

Many ports use custom types and enumerations from fpnew_pkg to improve code structure internally (see Data Types).
As the width of some input/output signals is defined by the configuration, it is denoted W in the following table.

Port Name Direction Type Description
clk_i in logic Clock, synchronous, rising-edge triggered
rst_ni in logic Asynchronous reset, active low
operands_i in logic [2:0][W-1:0] Operands, henceforth referred to as op[i]
rnd_mode_i in roundmode_e Floating-point rounding mode
op_i in operation_e Operation select
op_mod_i in logic Operation modifier
src_fmt_i in fp_format_e Source FP format
dst_fmt_i in fp_format_e Destination FP format
int_fmt_i in int_format_e Integer format
vectorial_op_i in logic Vectorial operation select
tag_i in TagType Operation tag input
simd_mask_i in MaskType Vector mask input for the status flags
in_valid_i in logic Input data valid (see Handshake)
in_ready_o out logic Input interface ready (see Handshake)
flush_i in logic Synchronous pipeline reset
result_o out logic [W-1:0] Result
status_o out status_t RISC-V floating-point status flags fflags
tag_o out TagType Operation tag output
out_valid_o out logic Output data valid (see Handshake)
out_ready_i in logic Output interface ready (see Handshake)
busy_o out logic FPU operation in flight

Data Types

The following custom data types and enumerations used in ports of the FPU and are defined in fpnew_pkg.
Default values from the package are listed.

roundmode_e - FP Rounding Mode

Enumeration of type logic [2:0] holding available rounding modes, encoded for use in RISC-V cores:

Enumerator Value Rounding Mode
RNE 3'b000 To nearest, tie to even (default)
RTZ 3'b001 Toward zero
RDN 3'b010 Toward negative infinity
RUP 3'b011 Toward positive infinity
RMM 3'b100 To nearest, tie away from zero
ROD 3'b101 To odd
DYN 3'b111 RISC-V Dynamic RM, invalid if passed to operations
operation_e - FP Operation

Enumeration of type logic [3:0] holding the FP operation.
The operation modifier op_mod_i can change the operation carried out.
Unless noted otherwise, the first operand op[0] is used for the operation.

Enumerator Modifier Operation
FMADD 0 Fused multiply-add ((op[0] * op[1]) + op[2])
FMADD 1 Fused multiply-subtract ((op[0] * op[1]) - op[2])
FNMSUB 0 Negated fused multiply-subtract (-(op[0] * op[1]) + op[2])
FNMSUB 1 Negated fused multiply-add (-(op[0] * op[1]) - op[2])
ADD 0 Addition (op[1] + op[2]) note the operand indices
ADD 1 Subtraction (op[1] - op[2]) note the operand indices
MUL 0 Multiplication (op[0] * op[1])
DIV 0 Division (op[0] / op[1])
SQRT 0 Square root
SGNJ 0 Sign injection, operation encoded in rounding mode
RNE: op[0] with sign(op[1])
RTZ: op[0] with ~sign(op[1])
RDN: op[0] with sign(op[0]) ^ sign(op[1])
RUP: op[0] (passthrough)
SGNJ 1 As above, but result is sign-extended instead of NaN-Boxed
MINMAX 0 Minimum / maximum, operation encoded in rounding mode
RNE: minimumNumber(op[0], op[1])
RTZ: maximumNumber(op[0], op[1])
CMP 0 Comparison, operation encoded in rounding mode
RNE: op[0] <= op[1]
RTZ: op[0] < op[1]
RDN: op[0] == op[1]
CLASSIFY 0 Classification, returns RISC-V classification block
F2F 0 FP to FP cast, formats given by src_fmt_i and dst_fmt_i
F2I 0 FP to signed integer cast, formats given by src_fmt_i and int_fmt_i
F2I 1 FP to unsigned integer cast, formats given by src_fmt_i and int_fmt_i
I2F 0 Signed integer to FP cast, formats given by int_fmt_i and dst_fmt_i
I2F 1 Unsigned integer to FP cast, formats given by int_fmt_i and dst_fmt_i
CPKAB 0 Cast-and-pack op[0] and op[1] to entries 0, 1 of vector op[2].
CPKAB 1 Cast-and-pack op[0] and op[1] to entries 2, 3 of vector op[2].
CPKCD 0 Cast-and-pack op[0] and op[1] to entries 4, 5 of vector op[2].
CPKCD 1 Cast-and-pack op[0] and op[1] to entries 6, 7 of vector op[2].
fp_format_e - FP Formats

Enumeration of type logic [2:0] holding the supported FP formats.

Enumerator Format Width Exp. Bits Man. Bits
FP32 IEEE binary32 32 bit 8 23
FP64 IEEE binary64 64 bit 11 52
FP16 IEEE binary16 16 bit 5 10
FP8 binary8 8 bit 5 2
FP16ALT binary16alt 16 bit 8 7

The following global parameters associated with FP formats are set in fpnew_pkg:

1
2
localparam int unsigned NUM_FP_FORMATS = 5;
localparam int unsigned FP_FORMAT_BITS = $clog2(NUM_FP_FORMATS);
int_format_e - Integer Formats

Enumeration of type logic [1:0] holding the supported integer formats.

Enumerator Width
INT8 8 bit
INT16 16 bit
INT32 32 bit
INT64 64 bit

The following global parameters associated with integer formats are set in fpnew_pkg:

1
2
localparam int unsigned NUM_INT_FORMATS = 4;
localparam int unsigned INT_FORMAT_BITS = $clog2(NUM_INT_FORMATS);
status_t - FP Status Flags

Packed struct containing the five FP status flags as logic in order MSB to LSB:

Memeber Description
NV Invalid operation
DZ Division by zero
OF Overflow
UF Underflow
NX Inexact operation

NaN-Boxing

RISC-V mandates so-called NaN-boxing of all FP values in formats that are narrower than the widest available format in the system.
This means that all unused high-order bits of narrow formats must be set to '1, otherwise the value is considered invalid (a NaN).

Checks for whether input values are properly NaN-boxed are enabled by default but can be turned off (see Configuration).
Narrow FP output values from the FPU are always NaN-boxed.
Narrow integer output values from the FPU are sign-extended, even if unsigned.

FP32 is equal to RV32, so the operands are always boxed.

Handshake Interface

Both the input and output side of FPnew feature a valid/ready handshake interface which controls the flow of data into and out of the FPU.
The handshaking protocol is similar to ones used in common protocols such as AXI:

  • An asserted valid singnals that data on the corresponding interface is valid and stable.
  • Once valid is asserted, it must not be disasserted until the handshake is complete.
  • An asserted ready signals that the interface is capable of processing data on the following rising clock edge.
  • Once valid and ready are asserted during a rising clock edge, the transaction is complete.
  • After a completed transaction, valid may remain asserted to provide new data for the next transfer.
  • The protocol direction is top-down. ready may depend on valid but valid must not depend on ready.

Operation Tags

Operation tags are metadata accompanying an operation and can be used to link results back to the oprerations that produced them.
Tags traverse the FPU without being modified, but always stay in sync with the operation they were issued with.

Tags are an optional feature of FPnew and can be controlled by setting the TagType parameter as needed (usually a packed vector of logic, but can be any type).
In order to disable the use of tags, set TagType to logic (the default value), and bind the tag_i port to a static value.
Furthermore ensure that your synthesis tool removes static registers.

Mask for the status flags

This input is meant to be used in vectorial mode. The mask for the status flags is an input vector with NumLanes bits, and each bit can mask the status flags of a different FPU vectorial lane. This helps not make the final output flag signal dirty due to status flags from inactive lanes.
If simd_mask_i[n] == 1'b0, the nth FPU lane will be masked for this operation and its resulting status flags will not be propagated to the final output status flag.

Configuration

Main configuration of the FPU is done through parameters on the fpnew_top module.
A default selection of formats and features is defined in the package and can be controlled through these parameters.
Furthermore, the project package fpnew_pkg can be modified to provide even more custom formats to the FPU.

Configuration Parameters

Features - Feature set of the FPU

The Features parameter is used to configure the available formats and special features of the FPU.
It is of type fpu_features_t which is defined as:

1
2
3
4
5
6
7
8
typedef struct packed {
int unsigned Width;
logic EnableVectors;
logic EnableNanBox;
fmt_logic_t FpFmtMask;
ifmt_logic_t IntFmtMask;
} fpu_features_t;

The fields of this struct behave as follows:

Width - Datapath Wdith

Specifies the width of the FPU datapath and of the input and output data ports (operands_i/result_o).
It must be larger or equal to the width of the widest enabled FP and integer format.

Default: 64

EnableVectors - Vectorial Hardware Generation

Controls the generation of packed-SIMD computation units in the FPU.
If set to 1, vectorial execution units will be generated for all FP formats that are narrower than Width in order to fill up the datapath width.
For example, given Width = 64, there will be four execution units for every operation on 16-bit FP formats.

Default: 1'b1

EnableNanBox - NaN-Boxing Check Control

Controls whether input value NaN-boxing is enforced (see NaN-Boxing).
If set to 1, all values of FP formats that are narrower than Width will be considered NaN unless all unused high-order bits are set to '1.
Output FP values are always NaN-boxed, regardless of this setting.

Default: 1'b1

FpFmtMask - Enabled FP Formats

The FpFmtMask parameter is of type fmt_logic_t which is an array holding one logic bit per FP format from fp_format_e, in ascending order.

1
typedef logic [0:NUM_FP_FORMATS-1] fmt_logic_t; // Logic indexed by FP format

If a bit in FpFmtMask is set, FPU hardware for the corresponding format is generated.
Otherwise, synthesis tools can optimize away any logic associated with this format and operations on the format yield undefined results.

Default: '1 (all enabled)

IntFmtMask - Enabled Integer Formats

The IntFmtMask parameter is of type ifmt_logic_t which is an array holding one logic bit per integer format from int_format_e, in ascending order.

1
typedef logic [0:NUM_INT_FORMATS-1] ifmt_logic_t; // Logic indexed by integer format

If a bit in IntFmtMask is set, FPU hardware for the corresponding format is generated.
Otherwise, synthesis tools can optimize away any logic associated with this format and operations on the format yield undefined results.

Default: '1 (all enabled)

Implementation - Implementation Options

The FPU is divided into four operation groups, ADDMUL, DIVSQRT, NONDOMP, and CONV (see Architecture: Top-Level).
The Implementation parameter controls the implementation of these operation groups.
It is of type fpu_implementation_t which is defined as:

1
2
3
4
5
typedef struct packed {
opgrp_fmt_unsigned_t PipeRegs;
opgrp_fmt_unit_types_t UnitTypes;
pipe_config_t PipeConfig;
} fpu_implementation_t;

The fields of this struct behave as follows:

PipeRegs - Number of Pipelining Stages

The PipeRegsparameter is of type opgrp_fmt_unsigned_t which is an array of arrays, holding for each operation group for each format an unsigned value, in ascending order.

1
2
typedef logic [0:NUM_FP_FORMATS-1][31:0] fmt_unsigned_t; 		// Array of unsigned indexed by FP format
typedef fmt_unsigned_t [0:NUM_OPGROUPS-1] opgrp_fmt_unsigned_t; // Array of format-specfic unsigned indexed by operation group

This parameter sets a number of pipeline stages to be inserted into the computational units per operation group, per FP format.
As such, latencies for different operations and different formats can be freely configured.

Default: '{default: 0} (no pipelining - all operations combinatorial)

UnitTypes - HW Unit Implementation

The UnitTypesparameter is of type opgrp_fmt_unit_types_t which is an array of arrays, holding for each operation group for each format an enumeration value, in ascending order.

1
2
typedef unit_type_t [0:NUM_FP_FORMATS-1]    fmt_unit_types_t;        // Array of unit types indexed by format
typedef fmt_unit_types_t [0:NUM_OPGROUPS-1] opgrp_fmt_unit_types_t; // Array of format-specific unit types indexed by opgroup

The unit type unit_type_t is an enumeration of type logic [1:0] holding the following implementation options for a particular hardware unit:

Enumerator Description
DISABLED No hardware units will be generated for this format
PARALLEL One hardware unit per format will be generated
MERGED One combined multi-format hardware unit will be generated for all formats selecting MERGED

The UnitTypes parameter allows to control resources used for the FPU by either removing operation units for certain formats and operations, or merging multiple formats into one.
Currently, the follwoing unit types are available for the FPU operation groups:

ADDMUL DIVSQRT NONCOMP CONV
PARALLEL :heavy_check_mark: :heavy_check_mark:
MERGED :heavy_check_mark: :heavy_check_mark: :heavy_check_mark:

Default:

1
2
3
4
'{'{default: PARALLEL}, // ADDMUL
'{default: MERGED}, // DIVSQRT
'{default: PARALLEL}, // NONCOMP
'{default: MERGED}} // CONV`

(all formats within operation group use same type)

PipeConfig - Pipeline Register Placement

The PipeConfig parameter is of type pipe_config_t and controls register placement in operational units.
The requested number of registers is placed to predefined locations within the units according to the PipeConfig parameter.
For best results, we strongly encourage the use of automatic retiming options in synthesis tools to optimize the pre-placed pipeline registers.

The configuration pipe_config_t is an enumeration of type logic [1:0] holding the following implementation options for the pipelines in operational units:

Enumerator Description
BEFORE All pipeline registers are inserted at the inputs of the operational unit
AFTER All pipeline registers are inserted at the outputs of the operational unit
INSIDE All registers are inserted at roughly the middle of the operational unit (if not possible, BEFORE)
DISTRIBUTED Registers are evenly distributed to INSIDE, BEFORE, and AFTER (if no INSIDE, all BEFORE)

Adding Custom Formats

In order to add custom FP or integer formats to the FPU, it is necessary to make small changes to fpnew_pkg.
New formats can easily be added by extending the default list of available formats, and/or by changing or removing the defaults.

Namely, the following parameters and types shall be adapted:

1
2
3
4
5
6
7
8
9
// For FP formats:
localparam int unsigned NUM_FP_FORMATS
typedef enum logic [FP_FORMAT_BITS-1:0] {...} fp_format_e
localparam fp_encoding_t [0:NUM_FP_FORMATS-1] FP_ENCODINGS
localparam fmt_logic_t CPK_FORMATS

// For Int formats:
localparam int unsigned NUM_INT_FORMATS
typedef enum logic [INT_FORMAT_BITS-1:0] {...} int_format_e

Furthermore, the default configuration parameters shall be adjusted to match the dimensions of the modified format list.

No other changes should be necessary to the package or other source files of the FPU.

Architecture

The exact architecture of FPnew depends on the configuration through parameters.
The main architectural traits as well as the effect of some parameters are described henceforth.

The design philosophy begind FPnew is that the “plumbing” of the architecture is quite regular and generic and the actual operations that handle the data are located at the lowest level.
Handshaking is used to pass data through the hierarchy levels.
As such, very fine-grained clock-gating can be applied to silence all parts of the architecture that are not actively contributing to useful work, significantly increasing energy efficiency.

Top-Level

The topmost level of hierarchy in FPnew is host to several operation group blocks as well as an output arbiter.
The operation group is the highest level of grouping within FPnew and signifies a class of operations that can usually be executed on a single hardware unit - such as additions and multiplications usually being mapped to an FMA unit.

There are currently four operation groups in FPnew which are enumerated in opgroup_e as outlined in the following table:

Enumerator Description Associated Operations
ADDMUL Addition and Multiplication FMADD, FNMSUB, ADD, MUL
DIVSQRT Division and Square Root DIV, SQRT
NONCOMP Non-Computational Operations like Comparisons SGNJ, MINMAX, CMP, CLASS
CONV Conversions F2I, I2F, F2F, CPKAB, CPKCD

Most architectural decisions for FPnew are made at very fine granularity.
The big exception to this is the generation of vectorial hardware which is decided at top level through the EnableVectors parameter.

Operation Group Blocks

Each operation group is implemented in its own operation group block, each generating slices.
A unit type is selected for each format according to the settings in the Implementation parameter.
Formats can either be implemented in a format-specific PARALLEL slice, or a multi-format MERGED slice.
Both PARALLEL and MERGED slices can co-exist in case a subset of formats is assigned to both of the two options.

Format-Specific Slices (PARALLEL)

In a parallel slice, operational units capable of processing exactly one format are generated.
If EnableVectors is set, operational units are duplicated into vectorial lanes in order to fill up the width of the datapath.
Results from all lanes are collected and assembled at the output of the slice.

Implementing units as parallel slices usually yields best format-specific latency, however costs more in terms of area.

Multi-Format Slices (MERGED)

In a merged slice, operational units capable of processing multiple formats are generated.
If EnableVectors is set, operational units for narrow formats are duplicated into vectorial lanes in order to fill up the width of the datapath.
To facilitate vectorial conversions that update an input vector, the third operand is pipelined along with the operation in the CONV block.
Results from all lanes are collected and assembled at the output of the slice.

Implementing units as merged slices usually yields best total area, however costs more in terms of per-format latency.

When the ADDMUL block is implemented using the MERGED implementation, multi-format FMA (multiplication done in src_format, accumulation in dst_format) is automatically supported among all formats using MERGED.

Pipelining

Pipeline registers are inserted into the operational units directly, according to the settings in the Implementation parameter.
As such, each slice in the system can have a different latency.
Merged slices are bound to thave the largest latency of the included formats.

All pipeline registers are inserted as shift registers at predefined locations in the FPU.
For optimal mapping, retiming funcitonality of your synthesis tools should be used to balance the paths between registers.

Data traverses the pipeline stages within the operational units using the same handshaking mechanism that is also present at the top-level FPU interface.
An individual pipeline stage is only stalled if its successor stage is stalled and cannot proceed in the following cycle.
In general, different operations can overtake each other in the FPU if their latencies differ or significant backpressure exists in one of the paths.
Hence, the use of operation tags is required to identify the exiting data if more than one operation is allowed to enter the FPU.

Output Arbitration

There are round-robin arbiters located at the ouputs of slices as well as the outputs of operation group blocks that resolve contentions for the ouput port of the FPU.
Arbitration is fair, i.e. a unit cannot write the outputs twice in a row if other units are also contending for the output.

仿真 / 综合

仿真器严格按照 verilog 的仿真语义进行 RTL 的仿真功能,而综合工具只是根据代码判断设计者的意图,然后生成相关的的电路结构,具有一定的主观推断性,不严格符合 verilog 的语义。

IEEE 754

格式

  • sign (1 bits):符号位

  • exponent (8 bits):指数位

    bias:指数位偏移量,使指数位恒为正

    exponent [1, 254] -- (-bias) --> [-126,127]

    exponent = 0(全0) : subnormal number 非规格数

    exponent = 255(全1) :non-number 特殊数

  • fraction (23 bits):尾数位

    normal number:1.fraction(低位补0)

    subnormal number:0.fraction(低位补0)

normal number 规格数

  • normal number = $sign * (1.fraction) * {2} ^ {(exponent - bias)}$

  • 取值范围:$(-2 * {2} ^ {127}, -1 * {2} ^ {-126} U [1 * {2} ^ {-126}, 2 * {2} ^ {127})$

    近似:$[-3.4 * {10} ^ {38}, -1.18 * {10} ^ {-38}] U [1.18 * {10} ^ {-38}, 3.4 * {10} ^ {38}]$

subnormal number 非规格数

  • subnormal number = [sign(+/-), exponent(全0,视为1), fraction]

  • subnormal number = $sign * (0.fraction) * {2} ^ {(1 - bias)}$

  • 取值范围:$(-1 * {2} ^ {-126}, 1 * {2} ^ {-126})$

non number 特殊数

  • Infinity = [sign(+/-), exponent(全1), fraction(全0)] ($\pm {2} ^ {128}$)
  • NaN = [sign(+/-), exponent(全1), fraction(非全0)]

浮点加减

原码加减

浮点数尾数使用原码表示,因此涉及原码加减 $in1 \pm in2$

  • 比较两个操作数的符号

    • 加法:同号求和,异号求差
    • 减法:异号求和,同号求差
  • 求和

    数值位相加,若最高位产生进位则结果溢出;和的符号取被加/减数的符号

    • $sign = sign[in1]$
    • $f = f[in1] + f[in2]$ ,若最高位产生进位则结果溢出
  • 求差

    $f = f[in1] + f[in2]_补$

    • 最高数值位产生进位,表明加法结果为正,所得数值位正确。
      • $sign = sign[in1]$
    • 最高数值位没有产生进位,表明加法结果为负,得到的是数值位的补码形式,需要对结果求补。
      • $sign = - sign[in1]$

移码加减

阶差:$[\Delta E]_补 = [E_x]_移 + [-[E_y]_移]_补$

浮点运算

浮点加减

$x = M_x * {2} ^ {E_x}, y = M_y * {2} ^ {E_y}$

$x + y = (M_x * {2} ^ {E_x - E_y} + M_y) * {2} ^ {E_y}$

$x - y = (M_x * {2} ^ {E_x - E_y} - M_y) * {2} ^ {E_y}$

1. 对阶

小阶向大阶对齐,阶小的那个数尾数右移,右移的位数等于$|\Delta E|$。

设$\Delta E = E_x - E_y$ :

  • $[\Delta E]_补 = [E_x]_移 + [-[E_y]_移]_补$

  • 若$[\Delta E]_补 \leq 0$ :$E_x$ <- $E_y$ , $M_x$ <- [$M_x$ << ($E_x - E_y$)]

  • 若$[\Delta E]_补 > 0$ :$E_y$ <- $E_x$ , $M_y$ <- [$M_y$ << ($E_y - E_x$)]

注:尾数右移按原码小数方式右移,符号位不参加移位,数值位要将隐含的一位 1 右移到小数部分,空出位补 0 。右移时,低位移出的位不要丢掉,应保留并参加尾数部分的运算。

2. 尾数加减

定点原码加减:要将隐藏位与附加位都还原参加运算。

3. 尾数规格化

加减后的尾数不一定是规格化的。

右规(1b.bbb…):尾数右移一位,阶码加 1 。

左规(0.00bb…):尾数逐次左移,阶码逐次减一,直到将第一位 1 移至小数点左边。

4. 尾数舍入

对阶和右规时,对尾数右移,低位移出的位要进行保留

  • 保护位 gurad :紧跟在尾数右边的一位。

  • 舍入位 round :左规时可根据其值进行舍入。

  • 粘位 sticky :只要舍入位的右边有任何非 0 数字,粘位就被置 1 ,否则置 0 。

舍入模式 rouding mode

  • 就近舍入 RNE/RMM :舍入为最近可表示的数

    多余位的值超出规定的最低有效位值 LSB 的一半,则应向最低有效位进 1,否则简单的截尾即可。对于恰好是的一半值的这种特殊情况,要根据当前的值来分别讨论若 LSB 现为 0 ,则简单的截尾;若 LSB 现为 1 ,则向 LSB 进1。

  • 朝 0 方向舍入 RTZ:截取所需位数,舍弃后面所有位

    简单地截尾。

  • 朝 $-\infty$ 舍入 RDN :总是取左边最近可表示数

    对正数来说,直接截尾。

    对负数来说,多余位全为 0 则直接截尾,不全为 0 则向 LSB 进 1。

  • 朝 $+\infty$ 舍入 RUP :总是取右边最近可表示数

    对正数来说,多余位全为 0 则直接截尾,不全为 0 则向 LSB 进 1 。

    对负数来说,直接截尾。

5. 溢出判断

指数上溢:结果指数超过了最大允许值(127)

指数下溢:结果指数超过了最小允许值()

溢出判断电路:

  • 右规和尾数舍入
  • 左规

双通道浮点加法

经过上面分析,双路径算法就应运而生了,M.J. Flynn 等人, 提出了双路径算法(Michael.J.Flynn,1990):
它的基本思想是根据两操作数的指数差异大小来划分数据通路。这是因为,当指数差异较小时,两指数进行减法操作时.其结果将可能产生大量的前导零,这在规格化的时候将产生大量的左移操作,而移位的位数是个变量:当指数差异较大时,又将在对齐指数时产生大量的移位。浮点加法器的双数据通道划分方法在浮点加法器的结构设计中被广泛采用。

​ 1011110111100111_011_00011100_111_00

2627639c 10011000100111_011_00011100_111_00

image-20230416110913650

image-20230416110627572

8000_0a6a处是flw,取出的ir是2627_639c,最低二位是00,是一个16位压缩指令,011_00011100_111_00

image-20230416110819828

识别不了所以就跳进8000_00c4的trap entry了

设计

image-20230330195510182
  • ifu 预先译码出 rsidx 和 rsen ,从整数寄存器堆和浮点寄存器堆取出操作数。
  • decode 译码出 rsfloat ,对寄存器输出的操作数进行选择。
  • decode 译码出 fpu 需要的相关信号。
  • disp 将 rs 和 fpu 相关译码信号派遣给 alu 。
  • alu 将 rs 和 fpu 相关信号派遣给 fpu 。
  • fpu 计算得到 rd 和 fflags 。
    • rd 通过 wbck 写回,根据 decode 得到的 rdint 信号选择写回到整数或者浮点寄存器堆。
    • fflags 直接写回 csr 寄存器。
  • 修改了 valid & ready 相关信号。

仿真

VCS

  • 编译得到指令的机器码
image-20230330194533009 image-20230330194555460
  • 将指令数据读取入 itcm ,执行指令

$readmemh({"/home/chms/Workbench/riscv/E203FV/riscv-tools/riscv-tests/isa/generated/rv32uf-p-fadd.verilog"}, itcm_mem);

  • 波形结果正确,并且浮点指令测试全部 PASS

Vivado

  • vivado 对 sv 支持不足,有的语句无法识别,所以是仿真空的 fpu ,仿真整数指令。

  • 整数加法指令正常(15 + 11 = 26),结果 PASS

img

img

  • 后仿真等等结果都是 PASS
image-20230330193526517

综合

未修改 rtl 的综合时序

嵌入空 fpu 的综合时序

img

嵌入 fpu 的综合时序

holdtime slack 还提升了。

问题

未修改的 rtl

运行 HelloWorld 程序,会有报错 TDO stuck high ,但是串口可以正常输出结果。

img

image-20230330195650042

嵌入 fpu 的 rtl

运行 HelloWorld 程序,一样的报错信息,串口无输出。

img

image-20230330195707054

嵌入空 fpu 的 rtl

运行 HelloWorld 程序,一样的报错信息,串口无输出。

image-20230330193034325

img

image-20230330195711025

  • 加法、乘法、乘加,应该共享乘、加法器,还是三个部件都有独立的乘、加法器?流水线?

  • 非规格数的运算是否要实现?

  • 加法器是否需要采用双路径、前导 0 预测等等,还是使用单通道

  • 移位如何实现,桶形移位器?全定制?还是直接>>

  • 五种舍入模式要全都实现吗?

  • IEEE754 尾数为原码,booth 只用于补码?

  • 有没有运算硬件相关的书

1 微架构优化

题目:对蜂鸟E203 RISC-V内核的微架构实现进行一定优化,并通过Benchmark基准测试(Dhrystone、CoreMark、Whetstone)来进行系统性能测试,要求在同等软件环境下与原本蜂鸟SoC测试分数相比有一定提升,对蜂鸟内核微架构所进行的具体优化方式须在报告中详细说明。

1.1 分支预测

蜂鸟 E200 采用最简单的静态预测。

  • 动态预测

1.2 流水线

蜂鸟 e203 采用两级变长流水线。

1.3 取值

1.4 FIFO

2 算子扩展

题目:对蜂鸟E203 RISC-V内核进行运算算子(譬如加解密算法、浮点运算、矢量运算等)的扩展,可通过NICE协处理器接口进行添加,也可直接实现RISC-V指令子集(譬如P扩展、F/D扩展、V扩展、B扩展、K扩展等)。

  • 对于采用NICE协处理器接口进行的扩展实现,需要在蜂鸟软件开发平台HBird SDK中进行相关软件驱动的添加;

  • 对于RISC-V扩展指令(P、V)的实现,可结合开源NMSIS库(DSP、NN)进行使用,同时也可对NMSIS库相关软件实现进行优化;

  • 对于RISC-V扩展指令(K)的实现,可参考开源mbedTLS库中相关API进行函数库的实现及使用。

2.1 RVK

草案仓库

  • 算法:
    • AES
    • SHA2
    • SM3/4
  • Entropy Source, Bit manipulation…

资料

2.2 RVV/P

支持数据级并行,额外的向量寄存器组和ALU

  • RVV: 向量架构

草案仓库

数据类型和长度由向量寄存器配置(动态类型)

指令数量少、代码量少、灵活性强

  • RVP: SIMD

草案仓库

操作码提供了数据宽度和操作类型(固定)

NMSIS

API for DSP, NN…

2.3 人脸识别+AES

人脸识别和AES加密协同的SOC设计

3 NICE