BLOCKBRIDGE SOFTWARE VERSION 6.0 RELEASE NOTES

We’re excited to announce the arrival of NVMe-over-TCP support for the Blockbridge data engine! This project has been a true labor of love for the entire team here at Blockbridge. It promises major gains in performance backed by widespread industry support.

NVMe Over Fabrics (TCP)

We have more than a decade of hard-won experience in iSCSI and SCSI. iSCSI is robust, ubiquitous, and well-supported. And, as we have demonstrated, it can also be really fast. However, many client implementations (i.e., initiators) do not exploit the full potential of the transport protocol and are weighed down by baggage in the client storage stack.

Almost a decade ago, NVMe was introduced into the Linux kernel. With it came the blueprints for an optimized datapath that bypasses the traditional block layer, dramatically improving performance. Years later, NVMe-over-fabrics extended the reach of the NVMe command set and I/O processing model over RDMA networks. In 2019, NVMe/TCP was born, promising NVMe performance without the complexity of RDMA.

We’ve been watching the NVMe-over-fabrics ecosystem eagerly, waiting for the initiator implementations to mature. It’s a far simpler protocol than iSCSI and carries less overhead in command processing. On the target side, NVMe allowed us to unlock significant performance gains in our storage stack by forcing us to rethink how we managed the flow of data inside our data engine.

One of our early customers for NVMe/TCP graciously donated time in their production Blockbridge environment, allowing us to really dial in the performance tuning. We measured consistently achievable average read latencies of 20.5μs for 512-byte reads from the client over a switched 25 Gigabit Ethernet network to the cache on our data engine and back. Perhaps better still, we saw 31.6μs inside guest VMs in Proxmox/QEMU on this same machine. We have a detailed writeup of the tuning and measurements here.. However, what really stood out is the dramatic improvement in the IOPS capabilities of the clients and what’s now possible in our data engine.

Release 6.0 ships with NVMe/TCP support for VMware vSphere 7 update 3c and Proxmox 7.2 or newer. In addition to everything you might expect from an NVMe-over-fabrics stack, we support the following notable features:

  • COMPARE,
  • FUSED COMPARE AND WRITE (atomic test-and-set for VMware),
  • WRITE ZEROES,
  • header and data digests, and
  • native multipathing.

DATAPLANE

We designed and built our NVMe/TCP stack from scratch, allowing us to revisit a large chunk of the I/O handling code paths in the dataplane. Dramatic improvements in IOPS, latency, and throughput are benefits of architectural changes that will pay dividends for years to come.

Scalable Packet Processing

One of the major gains in performance came from re-architecting the front-end packet processor for scalability to take advantage of high-core count CPUs. As a result, a single Blockbridge 6.0 data engine can deliver nearly 2x the bandwidth of the 5.x engine, with lower average I/O lower latency.

NVMe Optimized Coalescing

We invested in a new high-performance write coalescing algorithm for NVMe. It’s built into the packet processing logic, inside the fabrics layer, where it has advance knowledge of the data available in the socket. This technique means we spend less time hanging on to write data unnecessarily. It even handles cases where Linux issues sequential writes in reverse order.

Small I/O Acceleration

Many typical workloads consist of small 4K or sub-4K I/O’s. Every storage system has specific routines that are optimized for handling small I/O workloads. Our 6.0 data engine now has a dedicated end-to-end small I/O handling path, dramatically improving our top-end IOPS performance and latency.

Consistency and Tail Latency

We took a close look at tail latency in this release, making several incremental improvements with:

  • new offloads for housekeeping functions,
  • dynamically adjustable statistics intervals for high density deployments (i.e., 1000s of disks),
  • cache optimizations to core data structures, and
  • micro-optimizations for Zen3 and Zen4 in critical sections.

Non-blocking Intent Logs

We’ve improved the concurrency of writes and intent logging. In previous software versions, intent logging was in a code section so critical that it could force the ordering of writes to unrelated regions. Release 6.0 improves this and will only force ordering on data writes to the same region when an intent log is required.

Optimized Intent Granularity

Traditionally, our system tried to keep the write-intent bitmaps clean so as to reduce the number of regions that must be re-synchronized after a failure. The downside of this approach is more write-intent logging, with high incurred latency.

Two of our optimizations tilt the system behavior away from trying to keep the write-intent bitmap zeroed, towards reducing the number of write-intents logged.

  • We over-dirty regions of the backing volumes now on an initial write to a clean region. This optimization assumes that a write to a region will be followed by writes to surrounding regions.
  • We avoid clearing dirty bits for regions that have been recently re-written.

WEB UI

The tenant side of the web UI sees some important usability improvements in 6.0:

Global Usage Summary

Clicking on the Global entry at the top of the source list now displays a grid of usage metrics similar to those available under infrastructure. These metrics summarize the resource utilization for all of the tenant’s storage including used space, current bandwidth and IOPS. Don’t worry: the map is still there. It’s on the Locations tab.

Disk Summary

Each disk now has a condensed version of the usage metrics, on its general tab. Click on a disk to easily view its protocol, capacity, storage used, bandwidth, and IOPS.

Disk Identifiers

The SCSI NAA, UUID, and NGUID are now shown in the Disk Overview section.

Compressed Size Reporting

Tenants now see the effect of data compression on their space utilization.

Since the first 4.x release with compression, we have been reporting the uncompressed size to tenants. Our thinking was that data compression was a savings that should be visible only to administrators. Since then, we’ve seen an explosion in the use of Blockbridge as a backend for VMware and Proxmox. In these deployments the admins are the tenants, and it’s proven confusing to see the uncompressed size when looking at disks and services.

Starting with this release, the reported size of disks and virtual storage services is the actual size on disk of the data.

COMMAND-LINE TOOL

The CLI now offers complete NVMe/TCP support for bare-metal and orchestrated workflows.

NVMe Host Attach

You can now use host attach to connect disks with NVMe/TCP. Internally, the integration with nvme-cli was more straightforward than with iscsiadm. As a result, NVMe disks come online quickly. In addition, host attach supports the bulk of the connect-time parameters from nvme-cli, such as tunables for I/O queues, queue sizing, the keep alive timeout, the reconnect delay, and more.

NVMe Native Multipathing

We added support for native NVMe multipathing, with an optional fallback to DM-multipath. Local port binding works for NVMe/TCP disks, too, for the cases where your client has multiple NICs on the same subnet.

NVMe Online Resize

NVMe has a richer asynchronous event delivery mechanism than iSCSI, including device size changes. We integrated this into the CLI’s online resize capability, so it works flawlessly.

Multi-Parameter Validation

Lastly, we improved the way the CLI reports API validation errors in cases where a parameter can be used more than once, (for example, the --initiator parameter to profile create and update.) It’s easier to see what went wrong now.

PLATFORM

AMD Milan and Genoa

We began shipping systems with AMD’s “Milan” Zen 3 parts last year on release 5.2. In this release, we’ve been able to optimize the data engine to take advantage of the chiplet layout of these processors. In particular, the 64-core 7713P and 32-core 7543P parts have highly optimized data engine CPU layouts.

The AMD “Genoa” Zen 4 processors launched in November, 2022. With both PCIe Gen 5 and DDR5-4800 support, we’re looking at some truly high-performance solution possibilities.

Drive Validation

It’s no secret we’re fans of Micron. With the 9300s going EOL, we spent extensive time validating the Micron 7400 and 7450 firmware in our lab. Minor tuning was needed to compensate for subtle differences between drives. Support for 9400s is imminent.

Cluster Soft Fence

For 6.0, the cluster “soft fence” is now more tightly integrated with the hardware watchdog, reducing the occurrences of degraded redundancy due to network instability.

PROXMOX

Proxmox NVME/TCP Support

We’re pleased to offer NVMe/TCP support to Proxmox 7.2 and 7.3 users! NVMe/TCP is a game-changer for Proxmox with better multi-tenant performance, much lower latency, and faster volume attaches.

Guides & Performance Analysis

Towards the end of our development cycle, we did several deep-dives into Proxmox performance with NVMe/TCP. These articles contain a wealth of detailed performance data and useful conclusions about how to best configure your PVE setup.

Driver Improvements

Additional improvements in release 6.0 include:

  • Proxmox driver 2.3.1 uses a new technique for status checks, increasing attach/detach performance significantly.
  • We’re now encoding storage pool names to support characters that are invalid in iSCSI, notably the underscore.
  • Online volume resize has been considerably hardened in Blockbridge 6.0 and 5.2.7.
  • Fixed a rare issue where the Proxmox driver may be unable to attach a virtual disk when the Blockbridge system has more than a thousand virtual disks in the same account.

VMWARE

ESX NVME/TCP Support

The Blockbridge 6 data engine now supports NVMe/TCP for VMware vSphere 7 update 3c and newer releases. We’re happy to report that performance is considerably higher than with iSCSI!

NVMe VAAI

VMware clustering and high availability functions depend on specialized storage commands for heart-beating, cluster membership operations, and offloaded zeroing. These commands are often referred to as the VAAI commands.

Blockbridge 6 fully supports NVMe equivalents to VAAI. “Fused Commands” provide native support for COMPARE AND WRITE (i.e., “Atomic Test and Set”), and new dedicated offloads support optimized WRITE_ZEROES functions with improved performance.

Asymmetric Namespace Access

Blockbridge 6 includes support for ANA (Asymmetric Namespace Access). ANA is the NVMe equivalent of ALUA (asymmetric logical unit access). ANA is used to exchange path state information needed for high availability and is required by VMware’s NVMe multipath implementation. The Blockbridge data engine offers symmetric namespace access, as all paths are equal in our system.

VMWare NVMe/TCP Storage Guide

Our VMWare NVME/TCP Storage Guide describes how to use esxcli to attach NVMe/TCP storage to a vSphere host.

INTEGRATIONS

Blockbridge is now qualified to inter-operate with the following software versions:

  • Proxmox 7.3
  • VMware vSphere 7 update 3c
  • Grafana 8.4
  • Linux kernel 5.15+ for NVMe/TCP
  • Alerting with Microsoft Teams 1.5
2023-09-15T16:39:11+00:00