RVfplib: A Fast and Compact Open-Source Floating-Point Emulation Library for Tiny RISC-V Processors

. Small, low-cost IoT devices rely on ﬂoating-point (FP) software emulation on 32-bit integer cores when the cost of a full-ﬂedged FPU is not aﬀordable. Thus, the performance and code size of the FP emulation library are decisive for meeting energy and memory-size constraints. We propose RVfplib, the ﬁrst ISA-optimized open-source library for single and double-precision IEEE 754 FP emulation on RV32IM[C] cores. RVfplib is 59% smaller and 2x faster than the GCC emulation library, on average. On benchmark programs, code size reduction is 39%, and performance boost 1.5x. RVfplib is 5.3% smaller than the leading closed-source RISC-V commercial library.


RVfplib: A Fast and Compact Open-Source
Floating-Point Emulation Library for Tiny RISC-V Processors 1 Introduction Low-cost Internet of Things (IoT) devices are often subject to tight constraints on their silicon area and memory, which are precious resources in the embedded systems domain and impact cost and energy consumption [8]. At the same time, processing FP workloads is a common requirement for many applications. FP support enables programmers to satisfy the requirements on dynamic range and precision. In addition, deriving the fixed-point variant of an algorithm proven to be safe with floating-point numbers is often time-consuming and, in some cases, very challenging. However, small cores cannot always afford hardware Floating Point Units (FPUs) and rely on software emulation of FP instructions. Consequently, the code to be stored in memory is inflated, inducing performance overhead and increased total energy consumption due to higher execution times and added memory accesses. The code size cost is particularly relevant since FP emulation support can dominate the total code size of small programs, reaching up to 8 kB just for the single and double-precision basic operations. In this scenario, using small and fast FP emulation libraries is necessary to be competitive in the market.
The RISC-V Instruction Set Architecture (ISA) is gaining industrial traction in IoT applications where cost is a major concern. The main challenge for RISC-V low-cost microcontroller units (MCUs) is to reduce code size [3], as currently experimental evidence shows that the Arm ISA (ARMv7-M), its mature compilers, and highly size-optimized libraries generate smaller code on average [12,15]. The code size issue mainly affects applications that require FP arithmetic. In this case, long FP software emulation functions add a remarkable code size overhead, even if only a few FP computations are needed.
In this work, we present the following contributions: 1. RVfplib, the first open-source IEEE 754 3 FP library for RISC-V, manually optimized for low code size and high performance for both single and doubleprecision FP. RVfplib is compatible with the RV32IM[C] ISA, and implements addition, subtraction, multiplication, division, as well as comparisons and conversions. Double-precision division is optional in RVfplib; it targets low code size and is compatible with cores without an integer divider. 2. RVfplib nd, the reduced version of RVfplib that considers subnormal inputs/outputs as correctly signed zeroes. RVfplib nd is compatible with the RV32EM[C] ISA and has smaller code size than RVfplib, making it the perfect candidate for tightly memory-constrained devices. 3. A comparison of the code size and performance of all RVfplib functions with their counterparts provided by libgcc. Moreover, we perform a code size comparison between the functions in RVfplib nd and the ones available within SEGGER emFloat, the current state-of-the-art closed-source competitor. We also compare code size of RVfplib with the Arm-optimized libgcc code. 4. An analysis of the real code size and performance impact that RVfplib has on full programs.
The rest of the paper is organized as follows: in Section 3, we describe the structure of RVfplib and the main ideas that led to its development, as well as the techniques used to optimize it and a code comparison with libgcc. In Section 4, we present the experiments used to evaluate RVfplib figures of merit, and we show the corresponding results in Section 5. We close our work with insights about further improvements to RVfplib and the conclusion of the analysis in Section 6 and Section 7.

Related work
Researchers have proposed different solutions to provide FP capabilities to a core when the system area is strictly constrained. When a full-fledged FPU leads to an excessive area increase, designers can integrate a slower but tiny FPU, crafted for tightly constrained IoT cores [2]. Another possibility is to implement hardware/software approaches, in which hardware optimizations in the integer datapath speed up critical operations used in the FP emulation libraries [13]. Nonetheless, both the solutions can be adopted only if the system tolerates the related area overhead, and do not apply to systems that already exist.
Integer-only cores that cannot afford an area increase can execute FP programs only through FP emulation libraries, usually provided by compiler vendors along with their compilation toolchain. For example, the Arm Keil compiler comes with the IEEE 754-1985 compliant fplib [1], and GCC with FP support within libgcc, its low-level runtime library [10]. Since the optimization of these libraries is essential for producing fast code with a low memory footprint, FP emulation libraries can also be manually crafted at the assembly-level to ensure the best code size and performance possible. libgcc provides optimized code for well established ISAs like Arm but lacks customized support for relatively new ISAs like RISC-V, which should rely on compiling the generic high-level FP emulation C functions. The novelty of the RISC-V solution results in sub-optimal code size and performance that makes it less attractive with respect to the Arm-based alternatives.
In addition to what is available in compiler ecosystems, designers have implemented optimized FP libraries for specific processors [14] [5] and for the maximum flexibility and compliance with the IEEE 754 standard, like SoftFloat [19]. However, these solutions are non-RISC-V specific.
To the best of our knowledge, the only available assembly-optimized RISC-V FP library is emFloat, designed by SEGGER [17]. However, this library is closedsource and does not support subnormal values, flushing them to correctly signed zeroes instead.

RVfplib design
RVfplib is the first open-source optimized FP emulation library for RISC-V processors, for both single and double-precision FP. Its main goals are low code size and increased performance. Implicitly, this implies lower energy consumption thanks to the reduced memory bandwidth and execution time.
RVfplib is wholly written in RV32IM assembly. Thanks to the modularity of the RISC-V C extension, it is also compatible with RV32IMC ISA since the compiler can compress all the compressible instructions on request.
Functions in RVfplib adhere to the interface of the corresponding libgcc functions [10] and have their same names to ensure compatibility with GCC and a fast porting to real programs. The aliasing induces GCC to automatically link using RVfplib functions, if implemented, instead of the ones from libgcc, without additional modifications to the program. Therefore, there is no need to explicitly call the RVfplib functions, as the compiler does it automatically.
RVfplib functions have been obtained with ISA-specific assembly level optimizations starting from the libgcc FP functions, with an approach similar to the one used for Arm [9]. Compliance with the IEEE 754 standard rules for FP encoding and computation presents the same deviations that hold for the libgcc FP support compiled with the default options, namely: -Exception flags are not supported, and exceptional events only provide their pre-defined output (i.e., divisions by zero result in a NaN). -All the produced NaNs are quiet, in the form of 0x7FC00000 for singleprecision and 0x7FF8000000000000 for double-precision. -Only the default round to nearest, ties to even rounding mode is supported for the majority of the operations (as in the default libgcc implementation, some of the conversion functions round toward zero).

Structure
RVfplib is a static library that comes in two different variants: -RVfplib.a: the standard version, which targets low code size and increased performance. -RVfplib nd.a, which treats subnormal values as signed zeroes and shows an even smaller code size.
Each variant includes the functions listed in Table 1, in which both the SoftFloat and the libgcc names are reported. The two not-equal functions are aliased with the equal ones, as they have the same behavior. Both libraries can be compiled with particular code that can increase performance in the presence of specific input operands, with an additional code size overhead. For example, the multiplication can include code to deal with power-of-two operands, speeding up the processing of specific patterns while increasing the code size. Choosing between one implementation or the other depends on the system constraints and input workloads.
To further push toward reducing the memory footprint of the library, we also implemented part of the same FP support environment provided by SEGGER emFloat, treating subnormal values as correctly signed zeroes. Thanks to the reduced requirement for registers in its design, RVfplib nd is compliant with the RV32EM ISA (i.e., with only 16 registers in the register file). The library currently comes with a double-precision division that does not use any integer hardware divider, which cannot be included in such small cores. 4 For this reason, this function is optional and is only included when targeting the smallest code size possible. If performance is a more critical constraint, the standard doubleprecision division from libgcc is used instead.

Design choices
RVfplib benefits from some essential ideas that, together with the functional algorithmic choices, contribute to crafting optimized RISC-V functions that reduce code size and execution times.
1. Make the common case fast: FP algorithms take different decisions depending on the received inputs and create different control paths within the code. The latency of each function strongly depends on the inputs since different data patterns are processed differently. Optimizing the paths taken by the common input patterns (normal values) is a methodology for reducing the average latency. 2. Avoid memory references: RVfplib minimizes data memory bandwidth reducing register spilling in function prologues/epilogues. This is accomplished by using only caller-saved registers. libgcc functions, on the contrary, do not limit register usage and have bloated prologues/epilogues. 3. No function calls: Whenever the code makes a call, it must also save the return address and, in general, any other already-used caller-saved registers. This process leads to additional memory operations, stack pointer adjustments, and additional jumps to/from the called function, with a consequent code size increase and degraded performance. RVfplib contains only leaf-functions (i.e., functions that do not make other function-calls). This property enables RVfplib to be independent of other external libraries, minimizing the extra code linked in the final binary. This is not the case for libgcc, as some of its functions depend upon clzsi2() calls and the related table clz tab. 4. Maximize potential compression: The RISC-V C extension allows for compressing the most common RISC-V instructions when precise register patterns are used. For example, the majority of the instructions can be compressed when using registers from the RVC (i.e., registers in the set a0, a1, a2, a3, a4, a5, s0, s1). Since s0 and s1 are callee-saved registers, RVfplib does not use them. 5. Register re-use: Register allocation is optimized at function level to overcome heuristics of the compiler, whose analysis is mainly limited to the boundaries of basic blocks. As a basic rule, an operand is placed in the first free register; when it is no longer used, the register becomes free again. 6. Performance vs. code size tradeoff : Some RVfplib functions use loops to perform iterative processes. For example, the leading zeroes count after a numerical cancellation of an effective subtraction can be reduced to a shiftand-check loop, in which the result is left-shifted until the implicit one returns to its original position. This iterative process is convenient in terms of code size, but it is slow and inefficient. For this reason, it is also possible to use a bisection algorithm to count the leading zeroes, with better performance and increased code size. The choice can be taken at compile time. In general, when the taken-branch penalty is critical, unrolling the loop helps in maximizing the number of non-taken branches.

Comparison with libgcc
FP functions from libgcc use a complex set of hierarchical C macros to be as flexible and generic as possible. When compiling the library, it is possible to set specific high-level parameters to control how the library will treat exceptions, subnormals, roundings, etc. With the default settings, no exception is raised or handled, subnormal values are not flushed to zero, and the rounding mode is rounding to nearest, ties to even (RNE). Even with these minimalistic options, the generated code is sub-optimal in terms of size and performance. In List. 1.1, we report the assembly code of eqsf2() compiled with GCC 10.2.0 and optimized for size (-Os), together with comments and labels that we added to help the reader understand the code. This function, one of the smallest of the library, returns 1 if the inputs are not equal, and 0 if they are equal. The algorithm is straightforward: 1. If at least one input is NaN, return 1.
The libgcc function unpacks both the operands in their sign, exponent, and mantissa before starting the comparison. In eqsf2(), this operation is unnecessary and is probably performed to adopt a common coding standard for the library design. Moreover, separately comparing sign, exponent, and mantissa improves the code readability but discards possible optimizations. We aimed to reach the same optimization level implementing the algorithm of List. 1.2 using C, and we managed to halve the code size of the libgcc function from 84 B to 42 B, showing the importance of choosing an optimized algorithm. However, the generated code is still 16% larger than the one generated from our assembly.
Forcing the compiler to reuse precise registers and take branches in a deterministic way is more natural in assembly than in C; during the compilation of our C function, the compiler creates unexpected intermediate operations and register moves, with negative effects on both code size and performance.
The same is true for the more complex functions of the library. Functions from libgcc are safe, generic, flexible, and parametric, but this comes at the expense of possible critical optimizations in key functions, where more precise control over the registers and the branch choices would be preferred. Assembly language helps consider a register as a container for a value, without a precise label and meaning as in C; therefore, a more opportunistic usage of the registers comes more natural, without the need of forcing the compiler to behave in a precise way.

Testing
To test RVfplib, we relied on TestFloat [20], which provides an extensive IEEE 754 testing suite for generating test-cases and checking the correctness of custom FP implementations. Internally, TestFloat uses the fully IEEE 754 compliant SoftFloat library [19] as a golden reference. We generated the inputs for each function with the TestFloat engine and compared the function outputs with both SoftFloat and libgcc golden models. Since not all functions in RVfplib have a SoftFloat implementation, we used libgcc as a golden model when it was needed (e.g., for the "greater [or equal] to" functions).

Experimental setup
To analyze the impact of RVfplib, we evaluated its code size and performance metrics in both a synthetic environment and using real programs. In the first set of experiments, we extracted the code size of each function; in the second one, we evaluated the behavior of RVfplib on real benchmarks.

Benchmarks
Since we evaluate an FP library useful for area-constrained embedded devices, we selected all the Embench benchmark suite applications [4] that use FP numbers (cubic, minver, nbody, st, ud, wikisort). On the other hand, we selected three popular algorithms that can be run on small systems at the edge, on both single and double-precisions: a convolution (conv), a fast Fourier transform (fft), and a discrete wavelet transform (dwt).

Code Size
RVfplib implements most of the FP functions provided by libgcc and all the implicit arithmetic functions available in emFloat. Therefore, we evaluated the code size of the functions of our library and compared them against the two competitors. The code size of the emFloat functions is publicly available for RV32IMC ISA [17]; thus, we compiled both RVfplib and libgcc functions with the same target using GCC 10.3 and libgcc originally compiled with the -Os flag enabled, its default setting. The functions were linked to a fixed C program, and the code size of the functions extracted from its disassembly-dump. To create realistic conditions for embedded devices and avoid intricate dependencies and code size bloating, we always linked our programs against libc nano and libm nano. For a fair analysis, we compared RVfplib and libgcc since both are compiled for minimum code size and support subnormal values, and RVfplib nd with emFloat since both flush subnormal values to zero and target minimum code size as well.
Since libgcc is freely available, we extended our comparison linking our real benchmarks against RVfplib and RVfplib nd first, and then libgcc. To measure the code size impact that the libraries have on the read-only memory footprint, we added the size of the .text and the .rodata sections. Since some programs use the FP division, we also measured their code size when linked against RVfplib with fast divisions (the double-precision one belongs to libgcc).
To complete the code size analysis, we measured the code size of the Armoptimized libgcc FP library and compared it with the code size of both the generic RISC-V libgcc support and RVfplib.

Performance
On the performance side, a full profiling of RVfplib and libgcc was performed for both the single average latencies of the functions and the execution time of the benchmarks. In the following, when referring to a function, the term latency indicates the number of cycles required to execute it.
To evaluate the function latencies, we simulated a synthetic C program on the CV32E40P processor [6] with single-cycle latency memories using Mentor QuestaSim, repeating the experiment for each function of the compared libraries. The C program is composed of a loop that makes an explicit call to the function under test during each iteration and measures the latency of each function execution, including the jump/return to/from function cycles, and then averages the total cycle count on the number of iterations. Each function is fed with 10000 randomly generated values within (0,1), and the overhead of the load/store operations before and after the call is not considered. Using 1-cycle fixed-latency memories is a best-case scenario for libgcc performance, as libgcc accesses the stack inside its FP functions while RVfplib does not, as we avoided in-function memory requests. Additional memory latency/miss penalties negatively affect only the functions from libgcc.
We also compare our results to the average latencies reported by SEGGER emFloat [17]. It is unclear, however, whether this reported performance includes latency overheads from function calls and function returns. These overheads, as well as processor-specific branch-and jump penalties, can strongly affect performance, especially for small functions. SEGGER extracted performance metrics using a GigaDevice GD32VF103 [18], which is based on a variable 2-stage pipeline RISC-V core [7]. It is likely that the jump/branch penalties of CV32E40P (from 2 to 4 cycles) [11] are higher. Moreover, SEGGER only reports latency results of their "performance-optimized" emFloat library, which is different from the one used for the code size results. For this reason, we used our fast singleprecision division and the double-precision division from libgcc to perform this comparison.
To provide insight into how RVfplib affects the execution time, we simulated our benchmarks with SPIKE, a RISC-V simulator for a simple processor that executes one instruction per cycle, and reported the different instruction counts linking with libgcc, RVfplib, and RVfplib nd. Since some benchmarks use the double-precision division, we also reported the execution times of the programs linked with RVfplib with fast divisions (the 64-bit division is taken from libgcc).

Code Size
We show the code size of the single functions of RVfplib, libgcc, and emFloat in Table 1. Comparing the total code size of the libraries, we achieve a net gain of ≈60% by replacing libgcc FP functions with the ones in RVfplib. In absolute terms, the memory savings reach 7.5 kB, which is a significant code size reduction, especially for small programs. The small embedded systems we target are area/memory size constrained and do not have hardware FPUs. Most commonly, they require performing computations on single-precision data. As such, our high code size reduction for the most frequent single-precision FP operations (i.e., addition, subtraction, multiplication), which is around 67% on average, is very significant. libgcc subtraction is automatically re-linked as a function different from the addition, even if their code is shared except for one initial sign change of the second operand. RVfplib subtraction flips the sign and then executes an addition, without any other jump that would cause extra latency.
Passing from RVfplib to RVfplib nd, which flushes subnormal values to correctly signed zeroes, allows saving another 21.6% of the library code size. This significant gain comes for free when supporting subnormal numbers is not a requirement. RVfplib nd is almost 5.3% smaller than emFloat, even if the doubleprecision division from emFloat is 30% smaller than the one from RVfplib nd. The functions that gain the most from removing the subnormal support are multiplication and division, as the addition needs only small adjustments to process the denormalized inputs.
In Fig. 1, we summarize the code size results of our benchmarks. The code size savings on libgcc span from 16% of cubic (with libgcc double-precision division) to 60% (st with RVfplib nd) and are relatively high also for large code size programs like fft64, which passes from almost 13.8 kB to 8.4 kB with more than 39% of saving. The average code size reductions with respect to libgcc are 39.3%, 36%, and 46.5% for RVfplib, RVfplib with libgcc fast divisions, and RVfplib nd, respectively.

Performance
Average latencies of each function of RVfplib and libgcc FP support are summarized in Table 2. RVfplib functions are always faster than the ones from libgcc, except for the two small divisions, which are 1.45× and 2× slower for single and double-precision, respectively. This fact underlines the importance of trying to re-implement these operations, changing the core algorithm; nevertheless, RVfplib divisions do not use the hardware integer divider, allowing for more flexibility, and the division operation is not common in simple algorithms used on small embedded systems. The fast 32-bit division in RVfplib is slightly faster than one from libgcc, and the 64-bit one is the same. The single-precision comparisons and both the multiplications show important speedups (up to 2.57× for the multiplication), and the single-precision addition in RVfplib is faster than the one from libgcc by more than 1.5×. These data are promising, as these operations are ubiquitous in almost every FP algorithm. The conversions from integers to FP numbers are the functions that obtain the highest speed gain, which peaks for converting a 64-bit unsigned integer to a double-precision FP value with more than 4× lower latency. Replacing the whole set of libgcc functions with RVfplib gives an average speedup of 2×. As already pointed out, making a comparison between RVfplib nd and em-Float performance using the average latencies reported by SEGGER is not straightforward. We could not reproduce the experiment in the same conditions since they used a device that is likely to show a lower cycle count if compared to the CV32E40P core. Moreover, it is not specified whether the latency of the jumps to/from functions was taken into account. This is especially valid for the smaller functions, that can be strongly biased by the jump to/from function latency overhead. However, if we focus on the bigger functions, the double-precision addition in emFloat (the subtraction shares the code with the addition) and both divisions are faster than the ones from RVfplib by factors around 2.7× and 1.9×, for single and double-precision, respectively.
When we measure the instruction count of the real benchmarks linked against RVfplib and libgcc, we obtain the data shown in Fig. 2. We chose these benchmarks to have a good mix of realistic examples, and we found for RVfplib and RVfplib nd an average speedup of 1.5× even if the benchmarks that use the double-precision division in RVfplib are actually slower than the ones linked against libgcc. In particular, wikisort uses the square root operation that uses the double-precision division, which is also used by st. In some benchmarks (e.g., ud), RVfplib nd performance decreases because of its 64-bit addition, which does not have a fast-path for equal operands. All the other programs show high-speed gains

Comparison with Arm
In Fig. 3, we show the code size comparison between the generic RISC-V libgcc FP support, RVfplib, and the Arm-optimized libgcc FP support. RVfplib brings the existing FP library code size gap between RISC-V and Arm from 8376 B to 840 B (10× less), reducing the Arm to RISC-V code size inflation from 196.6% to 19.7%. Arm addresses many comparison-function calls to a generic compare, reducing the total number of implemented functions and the library code size. This choice can be implemented in future versions of RVfplib as well.

Further improvements
RVfplib will be released as an open-source project under GPL license, and everyone will be allowed to contribute to its enhancement, improving and extending it. SEGGER results unequivocally show that the 64-bit addition in RVfplib can be further improved to decrease its average latency. Both the divisions can reach increased code size and performance, maybe with different algorithmic choices and exploiting the hardware divider. The optimal solution would be to offer both a version that exploits the divider and one independent from it. On the other hand, the library misses important functions, such as the square root and the trigonometric ones, to be more versatile and further save precious memory space and cycle counts. As already evaluated in [13], hardware support for Count Leading Zeroes (CLZ) helps in speeding up the FP functions (e.g., addition, truncation) and can also decrease their code size, replacing a block of instructions with only one. Such support is already present in the PULP extension and in the draft of the RISC-V B extension [16]. Another improvement to further save code size would be merging in common functions the repeated code for dealing with subnormals/special cases, especially when such input patterns are uncommon, and various comparison into one.

Conclusion
In this paper, we presented RVfplib, the first open-source assembly-optimized FP emulation library for RISC-V small integer-only processors. The library implements the primary and most common single and double-precision FP operations like addition, subtraction, multiplication, division, comparisons, conversions, and adopts the same interface as libgcc to be easily linked by GCC against real programs without any source-code modification. The library follows IEEE 754 standard guidelines for encodings and computations, with only minor and easily modifiable differences. RVfplib is smaller than the libgcc FP support by almost 60% and, on average, 2× faster. We showed that, on real benchmarks, RVfplib reduces the code size by 39% and speeds up the execution by 1.5× on average, even when considering benchmarks that heavily use the less optimized functions in RVfplib. If compared to the Arm-optimized libgcc library, RVfplib reduces the Arm to RISC-V code size inflation from 196.6% (vs. RISC-V general libgcc FP support) to 19.7%. We also presented RVfplib nd, which treats subnormal values as correctly signed zeroes, and shown that its code size is 5.3% smaller than the SEGGER emFloat FP library, the only available RISC-V optimized FP emulation library, which is closed-source and treats subnormal values in the same way.