|By Brendan Gregg||
|January 14, 2014 11:00 AM EST||
"This excerpt is from the book, "Systems Performance: Enterprise and the Cloud", authored by Brendan Gregg, published by Prentice Hall Professional, Oct. 2013, ISBN 9780133390094, Copyright © 2014 Pearson Education, Inc. For more info, please visit the publisher site:
CPUs drive all software and are often the first target for systems performance analysis. Modern systems typically have many CPUs, which are shared among all running software by the kernel scheduler. When there is more demand for CPU resources than there are resources available, process threads (or tasks) will queue, waiting their turn. Waiting can add significant latency during the runtime of applications, degrading performance.
The usage of the CPUs can be examined in detail to look for performance improvements, including eliminating unnecessary work. At a high level, CPU usage by process, thread, or task can be examined. At a lower level, the code path within applications and the kernel can be profiled and studied. At the lowest level, CPU instruction execution and cycle behavior can be studied.
This chapter consists of five parts:
- Background introduces CPU-related terminology, basic models of CPUs, and key CPU performance concepts.
- Architecture introduces processor and kernel scheduler architecture.
- Methodology describes performance analysis methodologies, both observa- tional and experimental.
- Analysis describes CPU performance analysis tools on Linux- and Solaris- based systems, including profiling, tracing, and visualizations.
- Tuning includes examples of tunable parameters.
The first three sections provide the basis for CPU analysis, and the last two show its practical application to Linux- and Solaris-based systems.
The effects of memory I/O on CPU performance are covered, including CPU cycles stalled on memory and the performance of CPU caches. Chapter 7, Memory, continues the discussion of memory I/O, including MMU, NUMA/UMA, system interconnects, and memory busses.
For reference, CPU-related terminology used in this chapter includes the following:
- Processor: the physical chip that plugs into a socket on the system or pro- cessor board and contains one or more CPUs implemented as cores or hard- ware threads.
- Core: an independent CPU instance on a multicore processor. The use of cores is a way to scale processors, called chip-level multiprocessing (CMP).
- Hardware thread: a CPU architecture that supports executing multiple threads in parallel on a single core (including Intel's Hyper-Threading Tech- nology), where each thread is an independent CPU instance. One name for this scaling approach is multithreading.
- CPU instruction: a single CPU operation, from its instruction set. There are instructions for arithmetic operations, memory I/O, and control logic.
- Logical CPU: also called a virtual processor,1 an operating system CPU instance (a schedulable CPU entity). This may be implemented by the processor as a hardware thread (in which case it may also be called a virtual core), a core, or a single-core processor.
- Scheduler: the kernel subsystem that assigns threads to run on CPUs.
- Run queue: a queue of runnable threads that are waiting to be serviced by
- CPUs. For Solaris, it is often called a dispatcher queue.
Other terms are introduced throughout this chapter. The Glossary includes basic terminology for reference, including CPU, CPU cycle, and stack. Also see the terminology sections in Chapters 2 and 3.
The following simple models illustrate some basic principles of CPUs and CPU per- formance. Section 6.4, Architecture, digs much deeper and includes implementation- specific details.
Figure 1 shows an example CPU architecture, for a single processor with four cores and eight hardware threads in total. The physical architecture is pictured, along with how it is seen by the operating system.
Figure 1: CPU architecture
Each hardware thread is addressable as a logical CPU, so this processor appears as eight CPUs. The operating system may have some additional knowledge of topology, such as which CPUs are on the same core, to improve its scheduling decisions.
CPU Memory Caches
Processors provide various hardware caches for improving memory I/O perfor- mance. Figure 2 shows the relationship of cache sizes, which become smaller and faster (a trade-off) the closer they are to the CPU.
The caches that are present, and whether they are on the processor (integrated) or external to the processor, depend on the processor type. Earlier processors pro- vided fewer levels of integrated cache.
Figure 2: CPU cache sizes
CPU Run Queues
Figure 3 shows a CPU run queue, which is managed by the kernel scheduler.
Figure 3: CPU run queue
The thread states shown in the figure, ready to run and on-CPU, are covered in Figure 3.7 in Chapter 3, Operating Systems.
The number of software threads that are queued and ready to run is an impor- tant performance metric indicating CPU saturation. In this figure (at this instant) there are four, with an additional thread running on-CPU. The time spent waiting on a CPU run queue is sometimes called run-queue latency or dispatcher-queue latency. In this book, the term scheduler latency is used instead, as it is appropri- ate for all dispatcher types, including those that do not use queues (see the discus- sion of CFS in Section 6.4.2, Software).
For multiprocessor systems, the kernel typically provides a run queue for each CPU and aims to keep threads on the same run queue. This means that threads are more likely to keep running on the same CPUs, where the CPU caches have cached their data. (These caches are described as having cache warmth, and the approach to favor CPUs is called CPU affinity.) On NUMA systems, memory locality may also be improved, which also improves performance (this is described in Chapter 7, Memory).
It also avoids the cost of thread synchronization (mutex locks) for queue operations, which would hurt scalability if the run queue was global and shared among all CPUs.
The following are a selection of important concepts regarding CPU performance, beginning with a summary of processor internals: the CPU clock rate and how instructions are executed. This is background for later performance analysis, particularly for understanding the cycles-per-instruction (CPI) metric.
The clock is a digital signal that drives all processor logic. Each CPU instruction may take one or more cycles of the clock (called CPU cycles) to execute. CPUs exe- cute at a particular clock rate; for example, a 5 GHz CPU performs 5 billion clock cycles per second.
Some processors are able to vary their clock rate, increasing it to improve performance or decreasing it to reduce power consumption. The rate may be varied on request by the operating system, or dynamically by the processor itself. The ker- nel idle thread, for example, can request the CPU to throttle down to save power.
Clock rate is often marketed as the primary feature of the processor, but this can be a little misleading. Even if the CPU in your system appears to be fully utilized (a bottleneck), a faster clock rate may not speed up performance-it depends on what those fast CPU cycles are actually doing. If they are mostly stall cycles while waiting on memory access, executing them more quickly doesn't actually increase the CPU instruction rate or workload throughput.
CPUs execute instructions chosen from their instruction set. An instruction includes the following steps, each processed by a component of the CPU called a functional unit:
- Instruction fetch
- Instruction decode
- Memory access
- Register write-back
The last two steps are optional, depending on the instruction. Many instructions operate on registers only and do not require the memory access step.
Each of these steps takes at least a single clock cycle to be executed. Memory access is often the slowest, as it may take dozens of clock cycles to read or write to main memory, during which instruction execution has stalled (and these cycles while stalled are called stall cycles). This is why CPU caching is important, as described in Section 6.4: it can dramatically reduce the number of cycles needed for memory access.
The instruction pipeline is a CPU architecture that can execute multiple instructions in parallel, by executing different components of different instructions at the same time. It is similar to a factory assembly line, where stages of production can be executed in parallel, increasing throughput.
Consider the instruction steps previously listed. If each were to take a single clock cycle, it would take five cycles to complete the instruction. At each step of this instruction, only one functional unit is active and four are idle. By use of pipe- lining, multiple functional units can be active at the same time, processing differ- ent instructions in the pipeline. Ideally, the processor can then complete one instruction with every clock cycle.
But we can go faster still. Multiple functional units can be included of the same type, so that even more instructions can make forward progress with each clock cycle. This CPU architecture is called superscalar and is typically used with pipe- lining to achieve a high instruction throughput.
The instruction width describes the target number of instructions to process in parallel. Modern processors are 3-wide or 4-wide, meaning they can complete up to three or four instructions per cycle. How this works depends on the processor, as there may be different numbers of functional units for each stage.
Cycles per instruction (CPI) is an important high-level metric for describing where a CPU is spending its clock cycles and for understanding the nature of CPU utilization. This metric may also be expressed as instructions per cycle (IPC), the inverse of CPI.
A high CPI indicates that CPUs are often stalled, typically for memory access. A low CPI indicates that CPUs are often not stalled and have a high instruction throughput. These metrics suggest where performance tuning efforts may be best spent.
Memory-intensive workloads, for example, may be improved by installing faster memory (DRAM), improving memory locality (software configuration), or reducing the amount of memory I/O. Installing CPUs with a higher clock rate may not improve performance to the degree expected, as the CPUs may need to wait the same amount of time for memory I/O to complete. Put differently, a faster CPU may mean more stall cycles but the same rate of completed instructions.
The actual values for high or low CPI are dependent on the processor and processor features and can be determined experimentally by running known work- loads. As an example, you may find that high-CPI workloads run with a CPI at ten or higher, and low CPI workloads run with a CPI at less than one (which is possi- ble due to instruction pipelining and width, described earlier).
It should be noted that CPI shows the efficiency of instruction processing, but not of the instructions themselves. Consider a software change that added an inefficient software loop, which operates mostly on CPU registers (no stall cycles): such a change may result in a lower overall CPI, but higher CPU usage and utilization.
CPU utilization is measured by the time a CPU instance is busy performing work during an interval, expressed as a percentage. It can be measured as the time a CPU is not running the kernel idle thread but is instead running user-level application threads or other kernel threads, or processing interrupts.
High CPU utilization may not necessarily be a problem, but rather a sign that the system is doing work. Some people also consider this an ROI indicator: a highly utilized system is considered to have good ROI, whereas an idle system is considered wasted. Unlike with other resource types (disks), performance does not degrade steeply under high utilization, as the kernel supports priorities, preemption, and time sharing. These together allow the kernel to understand what has higher priority, and to ensure that it runs first.
The measure of CPU utilization spans all clock cycles for eligible activities, including memory stall cycles. It may seem a little counterintuitive, but a CPU may be highly utilized because it is often stalled waiting for memory I/O, not just executing instructions, as described in the previous section.
CPU utilization is often split into separate kernel- and user-time metrics.
Internet of Things (IoT) will be a hybrid ecosystem of diverse devices and sensors collaborating with operational and enterprise systems to create the next big application. In their session at @ThingsExpo, Bramh Gupta, founder and CEO of robomq.io, and Fred Yatzeck, principal architect leading product development at robomq.io, discussed how choosing the right middleware and integration strategy from the get-go will enable IoT solution developers to adapt and grow with the industry, while at the same time reduce Time to Market (TTM) by using plug and play capabilities offered by a robust IoT ...
Oct. 9, 2015 02:00 AM EDT Reads: 2,191
Today’s connected world is moving from devices towards things, what this means is that by using increasingly low cost sensors embedded in devices we can create many new use cases. These span across use cases in cities, vehicles, home, offices, factories, retail environments, worksites, health, logistics, and health. These use cases rely on ubiquitous connectivity and generate massive amounts of data at scale. These technologies enable new business opportunities, ways to optimize and automate, along with new ways to engage with users.
Oct. 9, 2015 02:00 AM EDT Reads: 160
The buzz continues for cloud, data analytics and the Internet of Things (IoT) and their collective impact across all industries. But a new conversation is emerging - how do companies use industry disruption and technology enablers to lead in markets undergoing change, uncertainty and ambiguity? Organizations of all sizes need to evolve and transform, often under massive pressure, as industry lines blur and merge and traditional business models are assaulted and turned upside down. In this new data-driven world, marketplaces reign supreme while interoperability, APIs and applications deliver un...
Oct. 9, 2015 02:00 AM EDT Reads: 278
The Internet of Things (IoT) is growing rapidly by extending current technologies, products and networks. By 2020, Cisco estimates there will be 50 billion connected devices. Gartner has forecast revenues of over $300 billion, just to IoT suppliers. Now is the time to figure out how you’ll make money – not just create innovative products. With hundreds of new products and companies jumping into the IoT fray every month, there’s no shortage of innovation. Despite this, McKinsey/VisionMobile data shows "less than 10 percent of IoT developers are making enough to support a reasonably sized team....
Oct. 9, 2015 02:00 AM EDT Reads: 202
“In the past year we've seen a lot of stabilization of WebRTC. You can now use it in production with a far greater degree of certainty. A lot of the real developments in the past year have been in things like the data channel, which will enable a whole new type of application," explained Peter Dunkley, Technical Director at Acision, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Oct. 9, 2015 01:45 AM EDT Reads: 7,008
Through WebRTC, audio and video communications are being embedded more easily than ever into applications, helping carriers, enterprises and independent software vendors deliver greater functionality to their end users. With today’s business world increasingly focused on outcomes, users’ growing calls for ease of use, and businesses craving smarter, tighter integration, what’s the next step in delivering a richer, more immersive experience? That richer, more fully integrated experience comes about through a Communications Platform as a Service which allows for messaging, screen sharing, video...
Oct. 9, 2015 12:00 AM EDT Reads: 1,130
SYS-CON Events announced today that Dyn, the worldwide leader in Internet Performance, will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Dyn is a cloud-based Internet Performance company. Dyn helps companies monitor, control, and optimize online infrastructure for an exceptional end-user experience. Through a world-class network and unrivaled, objective intelligence into Internet conditions, Dyn ensures traffic gets delivered faster, safer, and more reliably than ever.
Oct. 8, 2015 10:00 PM EDT Reads: 590
There are so many tools and techniques for data analytics that even for a data scientist the choices, possible systems, and even the types of data can be daunting. In his session at @ThingsExpo, Chris Harrold, Global CTO for Big Data Solutions for EMC Corporation, will show how to perform a simple, but meaningful analysis of social sentiment data using freely available tools that take only minutes to download and install. Participants will get the download information, scripts, and complete end-to-end walkthrough of the analysis from start to finish. Participants will also be given the pract...
Oct. 8, 2015 09:15 PM EDT Reads: 285
The IoT market is on track to hit $7.1 trillion in 2020. The reality is that only a handful of companies are ready for this massive demand. There are a lot of barriers, paint points, traps, and hidden roadblocks. How can we deal with these issues and challenges? The paradigm has changed. Old-style ad-hoc trial-and-error ways will certainly lead you to the dead end. What is mandatory is an overarching and adaptive approach to effectively handle the rapid changes and exponential growth.
Oct. 8, 2015 09:00 PM EDT Reads: 116
Mobile messaging has been a popular communication channel for more than 20 years. Finnish engineer Matti Makkonen invented the idea for SMS (Short Message Service) in 1984, making his vision a reality on December 3, 1992 by sending the first message ("Happy Christmas") from a PC to a cell phone. Since then, the technology has evolved immensely, from both a technology standpoint, and in our everyday uses for it. Originally used for person-to-person (P2P) communication, i.e., Sally sends a text message to Betty – mobile messaging now offers tremendous value to businesses for customer and empl...
Oct. 8, 2015 05:30 PM EDT Reads: 232
Can call centers hang up the phones for good? Intuitive Solutions did. WebRTC enabled this contact center provider to eliminate antiquated telephony and desktop phone infrastructure with a pure web-based solution, allowing them to expand beyond brick-and-mortar confines to a home-based agent model. It also ensured scalability and better service for customers, including MUY! Companies, one of the country's largest franchise restaurant companies with 232 Pizza Hut locations. This is one example of WebRTC adoption today, but the potential is limitless when powered by IoT.
Oct. 8, 2015 04:30 PM EDT Reads: 7,471
You have your devices and your data, but what about the rest of your Internet of Things story? Two popular classes of technologies that nicely handle the Big Data analytics for Internet of Things are Apache Hadoop and NoSQL. Hadoop is designed for parallelizing analytical work across many servers and is ideal for the massive data volumes you create with IoT devices. NoSQL databases such as Apache HBase are ideal for storing and retrieving IoT data as “time series data.”
Oct. 8, 2015 02:45 PM EDT Reads: 498
Clearly the way forward is to move to cloud be it bare metal, VMs or containers. One aspect of the current public clouds that is slowing this cloud migration is cloud lock-in. Every cloud vendor is trying to make it very difficult to move out once a customer has chosen their cloud. In his session at 17th Cloud Expo, Naveen Nimmu, CEO of Clouber, Inc., will advocate that making the inter-cloud migration as simple as changing airlines would help the entire industry to quickly adopt the cloud without worrying about any lock-in fears. In fact by having standard APIs for IaaS would help PaaS expl...
Oct. 8, 2015 02:30 PM EDT Reads: 652
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Valley. The program, to be aired during the peak viewership season of the year, will have a major impac...
Oct. 8, 2015 01:00 PM EDT Reads: 261
SYS-CON Events announced today that ProfitBricks, the provider of painless cloud infrastructure, will exhibit at SYS-CON's 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. ProfitBricks is the IaaS provider that offers a painless cloud experience for all IT users, with no learning curve. ProfitBricks boasts flexible cloud servers and networking, an integrated Data Center Designer tool for visual control over the cloud and the best price/performance value available. ProfitBricks was named one of the coolest Clo...
Oct. 8, 2015 01:00 PM EDT Reads: 763
Organizations already struggle with the simple collection of data resulting from the proliferation of IoT, lacking the right infrastructure to manage it. They can't only rely on the cloud to collect and utilize this data because many applications still require dedicated infrastructure for security, redundancy, performance, etc. In his session at 17th Cloud Expo, Emil Sayegh, CEO of Codero Hosting, will discuss how in order to resolve the inherent issues, companies need to combine dedicated and cloud solutions through hybrid hosting – a sustainable solution for the data required to manage I...
Oct. 8, 2015 01:00 PM EDT Reads: 477
Apps and devices shouldn't stop working when there's limited or no network connectivity. Learn how to bring data stored in a cloud database to the edge of the network (and back again) whenever an Internet connection is available. In his session at 17th Cloud Expo, Bradley Holt, Developer Advocate at IBM Cloud Data Services, will demonstrate techniques for replicating cloud databases with devices in order to build offline-first mobile or Internet of Things (IoT) apps that can provide a better, faster user experience, both offline and online. The focus of this talk will be on IBM Cloudant, Apa...
Oct. 8, 2015 12:45 PM EDT Reads: 508
WebRTC is about the data channel as much as about video and audio conferencing. However, basically all commercial WebRTC applications have been built with a focus on audio and video. The handling of “data” has been limited to text chat and file download – all other data sharing seems to end with screensharing. What is holding back a more intensive use of peer-to-peer data? In her session at @ThingsExpo, Dr Silvia Pfeiffer, WebRTC Applications Team Lead at National ICT Australia, will look at different existing uses of peer-to-peer data sharing and how it can become useful in a live session to...
Oct. 8, 2015 12:00 PM EDT Reads: 607
As a company adopts a DevOps approach to software development, what are key things that both the Dev and Ops side of the business must keep in mind to ensure effective continuous delivery? In his session at DevOps Summit, Mark Hydar, Head of DevOps, Ericsson TV Platforms, will share best practices and provide helpful tips for Ops teams to adopt an open line of communication with the development side of the house to ensure success between the two sides.
Oct. 8, 2015 12:00 PM EDT Reads: 576
SYS-CON Events announced today that IBM Cloud Data Services has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. IBM Cloud Data Services offers a portfolio of integrated, best-of-breed cloud data services for developers focused on mobile computing and analytics use cases.
Oct. 8, 2015 11:00 AM EDT Reads: 728