Performance & Reliability focus at Illuminosi
SSDs have brought a whole new focus on performance for storage systems. There are many aspects that need to be considered, and these all vary widely depending upon your particular market segments and customers.
In the data center world, a strong focus has been placed recently on providing a consistent access time. Some enterprise customers are starting to specify this as well. Access times to "five nines" (99.999%) are no longer unusual.
In the client world, power becomes very important, and very specific specifications are often requested for power modes. Even power draw during high power operations such as FORMAT or SANITIZE can be specified for extreme uses such as cell phones or tablets.
These terms are commonly used (and abused) to describe how SSDs perform. If you come from the HDD world ("Spinning Rust"), you may think you understand the meaning of these terms. However, many have been co-opted to suit the needs of SSD marketing folks. Because of the drastic difference in capabilities, SSDs require a whole new vocabulary to describe how they perform.
One of the subtle details of SSD performance is how consistent the performance should be. SSDs have very similar characteristics to HDDs in that both have to have media maintenance operations performed on them regularly in order for the data to be reliable. In HDDs, these maintenance operations may only occur every 7-10 days. However, in an SSD, these operations happen every day. Given the difference in performance, this is not a big deal on a percentage basis. However, until it started happening on a daily basis, nobody cared. Now, industry folks are seeing the impact and want to see it minimized. This is especially true in the data-center world, because long accesses severely impact web responsiveness. This leads to people not seeing ads, or leaving shopping carts unfulfilled. To fix this issue, many different ideas are in play, from I/O Determinism (Facebook), to Open Channel and Denali (Google & Microsoft), and now Zoned Namespaces (WDC).
Bandwidth is the measure of how much data can be pumped out of an SSD in a given time. It is usually expressed in Megabytes or Gigabytes per Second. In the "good old days" of HDDs, this was often referred to as "transfer rate".
Latency is a measure of how quickly data begins to flow from an SSD. This is a benchmark test that is very important to data center (hyperscale) customers, but their twist includes not only the average latency value, but the statistical outliers (to four or fine nines (99.99% or 99.999%). This will be expressed in millseconds or microseconds.
QoS or Quality of Service is about sharing the load equally among a number of different I/O queues on the same device. This is of concern to enterprise customers as well as hyperscalers. The typical use case where this matters is for multi-user (aka multi-tenancy) scenarios. The concern here is that one process should not be able to "starve" any other process from receiving I/O in a timely manner.
IOPS or Input/output Operations Per Second is a measure of how many read and write requests an SSD can respond to in a given time. It is normally further broken down to sub-categories based on read or write operations and sequential or random address patterns.
|Total IOPS||Total number of I/O operations per second (when performing a mix of read and write tests, typically 70% reads)|
|Random Read IOPS||Average number of random read operations per second|
|Random Write IOPS||Average number of random write operations per second|
|Sequential Read IOPS||Average number of sequential read I/O operations per second|
|Sequential Write IOPS||Average number of sequential write I/O operations per second|
Queue Depth is a specification of how many outstanding commands can be waiting to be completed by the SSD at any point in time. However, it is also an indication of how queue management is done during a benchmark, as it can also represent the number of outstanding commands in the queue at any point during the benchmark test.
Media Wear is very much a concern with SSDs. While wear out is not present in HDDs (other than total number of hours in use), the number of writes is surprisingly limited in flash media. For this reason SSDs abstract that away from the hosts by implementing an FTL or Flash Translation Layer. There are a number of details that go on underneath the surface, just as they have for HDDs. As flash matures, each generation becomes more dense than the one before. However, part of that density adjustment means that data reliability has degraded. The issues of data wear become more challenging with each succeeding generation of flash.
Write Amp or WAF (Write Amplification Factor- rhymes with "cough") is a limitation of flash because each cell can be written only a finite number of times. After that, the data returned is no longer reliable. This is not a binary thing- as flash cells approach their lifetime limit, the amount of ECC needed to correct "bad bits" increases. The interactions between all the decisions regarding ECC use can become quite complex in enterprise & data center SSDs where data integrity have very important guarantees. Design features like "fast fail" come in to play, along with controller vendors whose patents have gone well beyond the limits of Hamming codes.
Read Amp or RAF (Read Amplification Factor) is another limitation of flash. If you read a cell too many times, the data returned will no longer be valid. That means it must be rewritten in a background operation. This will happen as a part of the regular "garbage collection" cycle that SSDs have.
Performance evaluation tools
Here's a few of the common performance tools we use:
Contact us if you have any questions about how performance could impact your products. We will help you create a strategy for engineering and testing that will ensure your drives perform as your customers expect.