Modern servers can run 100 Billion operations per second, but most applications only perform tens or hundreds of thousand transactions per second. Does it really take a million CPU operations to run a single transaction? What is limiting our applications performance? What can we improve?
The key challenge is that applications have to interact with system memory which is a 100 times slower than the CPU or its direct cache memory. Applications read or store data in a persistent media like disks, and occasionally exchange messages and wait for reply from a neighbor system. Having many CPU cores in the system just made things worse. Every time CPUs access a shared resource or memory they lock it. Those are referred to as serialization or starvation points, in which the CPU stands idle. In some cases asynchronous applications can use the CPU for other tasks during that wait but in most cases they will pay a penalty of expensive locks, context switching and queueing delays.
While CPU performance grows at a rate of 60% per year, memory latency only improves at 7% per year, and the gap is widening. The same thing happens with hard disk access latency, which pretty much stayed in the same ball park for years. In order to address the problem CPUs and compilers use advanced prediction and pre-fetching techniques to load the memory into cache ahead of the execution. A similar mechanism was developed for disks, where modern file systems run dynamic algorithms to pre-fetch and cache data in memory avoiding the costly disk latency. In some cases the prediction is built into the application, e.g. a video streaming client/server can pre-fetch all the next frames so it can present them when the time has come, but when the user jumps to a random location in the movie … well you probably know what happens from your mobile YouTube experience.
SSD, Flash, and NV-RAM technologies provide a great latency improvement, from 5 milliseconds access time down to hundreds of microseconds and even less. But take the vendors’ benchmarks with a grain of salt – they usually only tell half the story. SSDs tend to have pretty high jitter and under heavy loads latency can jump to milliseconds. All these nice IOPs numbers are measured with queue depths of 256 operations and 8 parallel threads, i.e. test applications issue 2,000 operations in a batch to show the wonderful performance. In reality applications can only predict few steps ahead and parallel apps can issue few dozen IO operations at a time. In such realistic cases the IOPs performance plunges to the bottom, because an individual IO operation latency and all the software stack overhead around it is not that great in realistic workloads.
I/O Wait % Measured on a Database (Source Diablo technologies)
A number of studies demonstrated the impact of lowering storage latency and IO wait % (the time the CPU waits for completion of IOs) on the overall solution cost and the overall application responsiveness. IBM analyzed 183 databases and found that “Majority of applications will benefit greater from lower latency, than increasing CPU or lowering CPU latency”. Netflix tested Cassandra with HDDs and SSDs over Amazon AWS and found that using SSD cuts the overall solution cost by half since it requires less Cassandra instances and provided a 5X improvement in the overall user response time.
Wikibon published a detailed article on impact of Flash on overall Oracle Database TCO. The essence is that while Flash is more expensive, the reduction in latency leads to a lower number of cluster nodes required for the same workload. This is without even considering the significant time saved in DBA performance tuning.
The storage latency is responsible for just a portion of overall IO latency. When dealing with shared storage (SAN, NAS, Object) there is an additional software stack and network to cross. In many cases we measured the roundtrip network latency and the additional software stack overhead were greater than the SSD latency, which had major impact on IO wait and application efficiency. When examining messaging latency it usually climbs exponentially with the load, so under real load the situation is even worse. This is due to the fact that there is more contention, starvation and queuing delays under load.
Example of IO Latency vs IO throughput, source http://www.storageperformance.org/
Note that many database appliances like Oracle Exadata, Teradata, IBM DB2, MS SQL with SMB3, etc. use RDMA technologies over InfiniBand or Ethernet to lower the IO latency and eliminate the software stack overhead. The same goes for most of the All Flash Array storage systems which use RDMA as the internal cluster interconnect (enabling zero latency replication and synchronization). RDMA is now also integrated into open source projects such as OpenStack, Ceph and Hadoop HDFS.
We have found the source for application performance limitations to be memory and IO access latencies, and serialization problems. In some cases prediction or pre-fetching logic can help, but in many cases having lower latency IO will have a major impact on the overall application performance while lowering the total cost of the system.
While many storage systems and vendors highlight the performance in terms of max bandwidth and “ideal” IOPs, the real factor that impacts application performance is latency under real world workloads (quoted minimum latency numbers measured with only one IO in flight don’t reflect the real latency which tends to grow exponentially with the load). Storage and IO latency should be measures in a complete scenario which include mixed workloads, aspects of network latency, software stack latency and starvation, replications, etc.
In the previous post we discussed how to mix HDDs and SSDs – which is still relevant here, some data is not accessed frequently or can be read in advanced and would fit HDDs.