FPGA Acceleration for Database Read Requests
The results are in, and they are consistent and compelling. Tests under a variety of conditions and in a wide range of environments have shown that FPGA-assisted data engines can process database READs dramatically more efficiently than standard CPU-based database nodes. IT organizations are reaping the benefits by deploying database architectures with FPGA-based front-ends. The front-ends handle the heavy lifting of READ servicing while lightly-loaded back-end database nodes handle everything else. The resulting clusters deliver IO up to 20x higher, and are smaller, more efficient, easier to manage and perform well even at peak loads.
Before one commits to a new piece of hardware, however, it is fair to ask some key questions. Why do FPGA-assisted engines exhibit so much better performance? Is this a transient situation, the new normal, or a breakthrough that innovators can leverage now for immediate gain?
The answer to these questions comes from comparing FPGAs and standard CPUs in two broad areas – raw power and task optimization.
FPGAs vs. CPUs
There are three main points to consider when evaluating FPGAs against CPUs for data or database acceleration:
Raw CPU power is limited
CPU task "tax" is where CPUs spend processing cycles
Networking continues to evolve
Raw CPU Power is Limited
It used to be that if you didn’t like the performance of a CPU-based application, you just waited around for about 18 months and a new CPU with twice the horsepower would become available to which you could transparently port your applications. But single-threaded CPU performance improvement has been slowed significantly since approximately 2005 due to effects like thermal issues, among others, as clock frequencies increase. This has resulted in the movement towards multi-core architectures.
Multi-core architectures are suitable for increasing throughput but when single-threaded CPUs become even moderately busy, latency can suffer badly. Usually, high-performance database applications require both high capacity and low latency.
To keep every CPU at low utilization to maintain fast response times can require a massive number of cores.
CPU Task “Tax” – or Where CPUs Spend Their Processing Cycles
Over the years there have been many architectures and significant infrastructure built around CPUs geared towards maximizing instructions executed per clock cycle (IPC). The techniques used to achieve high IPC— caching of code and data, deep instruction pipelines, register renaming, and branch prediction; all contribute to good performance for general application execution. But READ requests arrive in short duration TCP connections from a wide variety of users. TCP connections comprised of sequences of variable length packets must be continually created (and then torn down), READ requests must be parsed, data must be looked up and de-compressed, READ responses must be formulated and sent out. All of these are short duration functions that cannot be sequentially scheduled – they must be multiplexed with similar requests coming from other users. That adds up to A LOT of context switching and a lot of state management.
Last but not least, the core CPU architecture is designed to handle data types that are either 32 or 64 bit as found in computational workloads. However, I/O specific operations such as network I/O and storage I/O that have non-standard bit-width data types (e.g. network packets) or are byte-oriented (e.g. compression). This results in even more inefficient use of processor cores.
Many of the techniques that CPUs use to achieve high IPC actually get in the way and cause massive performance degradation for data-centric workloads. CPU-based database nodes executing functions such as compression, encryption, TCP/IP processing, hash computation, and READ servicing, perform badly on traditional compute-optimized environments, no matter how efficiently coded. As an example, it requires 10 or more dedicated CPU cores to service 1 Million IOPS READs arriving over a 10Gbps link -a load than can be easily handled by a single FPGA-based data engine.
Networking Continues to Evolve
Network-intensive loads are similar to databases– a lot of multiplexed packets from different users going to different destinations. Networking technology was born running on standard CPU environments. But networking technology evolved by using specialized processors and hardware assists in order to scale to the massive Internet capacity of today.
FPGAs for Database Acceleration
FPGA designs rely on spatial computing architectures utilizing micro-architectures that leverage custom memory and dedicated interconnection topologies.
The “P” in FPGA stands for Programmable. That means that the raw power of FPGAs can be optimally geared towards executing specific functions, as opposed to CPUs which are generically designed to support all general compute workloads. FPGAs can be programmed to execute repetitive functions in hardware such as TCP/IP processing, READ processing, data look-ups, compression, encryption, Flash storage I/O, and other functions that a database needs to do at scale that a CPU is de-optimized for.
Many such modules can be programmed into a single array and operate simultaneously, which gives FPGAs much more parallelism than CPUs. That means that FPGAs can be operated at much lower clock speeds, which allows FPGAs to avoid the thermal limitations into which CPU technology has smashed headlong. Therefore, unlike CPUs, FPGAs can continue to benefit from transistor density increases due to semiconductor process enhancements.
Therefore it really isn’t all that surprising that FPGA implementations are far superior to CPU-based implementations for many database functions, and that the trend is that this will be even more true in the future than it is today.
Having said all this, it doesn’t mean one should task their IT people to go buy raw FPGAs and start coding them up and porting applications to them. Leveraging the raw power of FPGAs takes significant expertise and experience to determine what functions to execute in FPGAs, how to program the FPGA to maximize efficiency, and how to move data in and out of the FPGA as an integral part of an overall system. Look for pre-built solutions that are based on a deep understanding of the application workload and environment, which provides a software interface such that transparent and immediate porting or existing applications is possible.
rENIAC is one such transparent solution built with the power of FPGAs that requires no application or software level changes.
We wrote a solution brief which includes the breakdown of the CPU vs FPGA processing cycle consumption as well as how rENIAC helps alleviate CPU consumption, you can download that (without filling out a form) below.