|By Brendan Gregg||
|January 14, 2014 11:00 AM EST||
"This excerpt is from the book, "Systems Performance: Enterprise and the Cloud", authored by Brendan Gregg, published by Prentice Hall Professional, Oct. 2013, ISBN 9780133390094, Copyright © 2014 Pearson Education, Inc. For more info, please visit the publisher site:
CPUs drive all software and are often the first target for systems performance analysis. Modern systems typically have many CPUs, which are shared among all running software by the kernel scheduler. When there is more demand for CPU resources than there are resources available, process threads (or tasks) will queue, waiting their turn. Waiting can add significant latency during the runtime of applications, degrading performance.
The usage of the CPUs can be examined in detail to look for performance improvements, including eliminating unnecessary work. At a high level, CPU usage by process, thread, or task can be examined. At a lower level, the code path within applications and the kernel can be profiled and studied. At the lowest level, CPU instruction execution and cycle behavior can be studied.
This chapter consists of five parts:
- Background introduces CPU-related terminology, basic models of CPUs, and key CPU performance concepts.
- Architecture introduces processor and kernel scheduler architecture.
- Methodology describes performance analysis methodologies, both observa- tional and experimental.
- Analysis describes CPU performance analysis tools on Linux- and Solaris- based systems, including profiling, tracing, and visualizations.
- Tuning includes examples of tunable parameters.
The first three sections provide the basis for CPU analysis, and the last two show its practical application to Linux- and Solaris-based systems.
The effects of memory I/O on CPU performance are covered, including CPU cycles stalled on memory and the performance of CPU caches. Chapter 7, Memory, continues the discussion of memory I/O, including MMU, NUMA/UMA, system interconnects, and memory busses.
For reference, CPU-related terminology used in this chapter includes the following:
- Processor: the physical chip that plugs into a socket on the system or pro- cessor board and contains one or more CPUs implemented as cores or hard- ware threads.
- Core: an independent CPU instance on a multicore processor. The use of cores is a way to scale processors, called chip-level multiprocessing (CMP).
- Hardware thread: a CPU architecture that supports executing multiple threads in parallel on a single core (including Intel's Hyper-Threading Tech- nology), where each thread is an independent CPU instance. One name for this scaling approach is multithreading.
- CPU instruction: a single CPU operation, from its instruction set. There are instructions for arithmetic operations, memory I/O, and control logic.
- Logical CPU: also called a virtual processor,1 an operating system CPU instance (a schedulable CPU entity). This may be implemented by the processor as a hardware thread (in which case it may also be called a virtual core), a core, or a single-core processor.
- Scheduler: the kernel subsystem that assigns threads to run on CPUs.
- Run queue: a queue of runnable threads that are waiting to be serviced by
- CPUs. For Solaris, it is often called a dispatcher queue.
Other terms are introduced throughout this chapter. The Glossary includes basic terminology for reference, including CPU, CPU cycle, and stack. Also see the terminology sections in Chapters 2 and 3.
The following simple models illustrate some basic principles of CPUs and CPU per- formance. Section 6.4, Architecture, digs much deeper and includes implementation- specific details.
Figure 1 shows an example CPU architecture, for a single processor with four cores and eight hardware threads in total. The physical architecture is pictured, along with how it is seen by the operating system.
Figure 1: CPU architecture
Each hardware thread is addressable as a logical CPU, so this processor appears as eight CPUs. The operating system may have some additional knowledge of topology, such as which CPUs are on the same core, to improve its scheduling decisions.
CPU Memory Caches
Processors provide various hardware caches for improving memory I/O perfor- mance. Figure 2 shows the relationship of cache sizes, which become smaller and faster (a trade-off) the closer they are to the CPU.
The caches that are present, and whether they are on the processor (integrated) or external to the processor, depend on the processor type. Earlier processors pro- vided fewer levels of integrated cache.
Figure 2: CPU cache sizes
CPU Run Queues
Figure 3 shows a CPU run queue, which is managed by the kernel scheduler.
Figure 3: CPU run queue
The thread states shown in the figure, ready to run and on-CPU, are covered in Figure 3.7 in Chapter 3, Operating Systems.
The number of software threads that are queued and ready to run is an impor- tant performance metric indicating CPU saturation. In this figure (at this instant) there are four, with an additional thread running on-CPU. The time spent waiting on a CPU run queue is sometimes called run-queue latency or dispatcher-queue latency. In this book, the term scheduler latency is used instead, as it is appropri- ate for all dispatcher types, including those that do not use queues (see the discus- sion of CFS in Section 6.4.2, Software).
For multiprocessor systems, the kernel typically provides a run queue for each CPU and aims to keep threads on the same run queue. This means that threads are more likely to keep running on the same CPUs, where the CPU caches have cached their data. (These caches are described as having cache warmth, and the approach to favor CPUs is called CPU affinity.) On NUMA systems, memory locality may also be improved, which also improves performance (this is described in Chapter 7, Memory).
It also avoids the cost of thread synchronization (mutex locks) for queue operations, which would hurt scalability if the run queue was global and shared among all CPUs.
The following are a selection of important concepts regarding CPU performance, beginning with a summary of processor internals: the CPU clock rate and how instructions are executed. This is background for later performance analysis, particularly for understanding the cycles-per-instruction (CPI) metric.
The clock is a digital signal that drives all processor logic. Each CPU instruction may take one or more cycles of the clock (called CPU cycles) to execute. CPUs exe- cute at a particular clock rate; for example, a 5 GHz CPU performs 5 billion clock cycles per second.
Some processors are able to vary their clock rate, increasing it to improve performance or decreasing it to reduce power consumption. The rate may be varied on request by the operating system, or dynamically by the processor itself. The ker- nel idle thread, for example, can request the CPU to throttle down to save power.
Clock rate is often marketed as the primary feature of the processor, but this can be a little misleading. Even if the CPU in your system appears to be fully utilized (a bottleneck), a faster clock rate may not speed up performance-it depends on what those fast CPU cycles are actually doing. If they are mostly stall cycles while waiting on memory access, executing them more quickly doesn't actually increase the CPU instruction rate or workload throughput.
CPUs execute instructions chosen from their instruction set. An instruction includes the following steps, each processed by a component of the CPU called a functional unit:
- Instruction fetch
- Instruction decode
- Memory access
- Register write-back
The last two steps are optional, depending on the instruction. Many instructions operate on registers only and do not require the memory access step.
Each of these steps takes at least a single clock cycle to be executed. Memory access is often the slowest, as it may take dozens of clock cycles to read or write to main memory, during which instruction execution has stalled (and these cycles while stalled are called stall cycles). This is why CPU caching is important, as described in Section 6.4: it can dramatically reduce the number of cycles needed for memory access.
The instruction pipeline is a CPU architecture that can execute multiple instructions in parallel, by executing different components of different instructions at the same time. It is similar to a factory assembly line, where stages of production can be executed in parallel, increasing throughput.
Consider the instruction steps previously listed. If each were to take a single clock cycle, it would take five cycles to complete the instruction. At each step of this instruction, only one functional unit is active and four are idle. By use of pipe- lining, multiple functional units can be active at the same time, processing differ- ent instructions in the pipeline. Ideally, the processor can then complete one instruction with every clock cycle.
But we can go faster still. Multiple functional units can be included of the same type, so that even more instructions can make forward progress with each clock cycle. This CPU architecture is called superscalar and is typically used with pipe- lining to achieve a high instruction throughput.
The instruction width describes the target number of instructions to process in parallel. Modern processors are 3-wide or 4-wide, meaning they can complete up to three or four instructions per cycle. How this works depends on the processor, as there may be different numbers of functional units for each stage.
Cycles per instruction (CPI) is an important high-level metric for describing where a CPU is spending its clock cycles and for understanding the nature of CPU utilization. This metric may also be expressed as instructions per cycle (IPC), the inverse of CPI.
A high CPI indicates that CPUs are often stalled, typically for memory access. A low CPI indicates that CPUs are often not stalled and have a high instruction throughput. These metrics suggest where performance tuning efforts may be best spent.
Memory-intensive workloads, for example, may be improved by installing faster memory (DRAM), improving memory locality (software configuration), or reducing the amount of memory I/O. Installing CPUs with a higher clock rate may not improve performance to the degree expected, as the CPUs may need to wait the same amount of time for memory I/O to complete. Put differently, a faster CPU may mean more stall cycles but the same rate of completed instructions.
The actual values for high or low CPI are dependent on the processor and processor features and can be determined experimentally by running known work- loads. As an example, you may find that high-CPI workloads run with a CPI at ten or higher, and low CPI workloads run with a CPI at less than one (which is possi- ble due to instruction pipelining and width, described earlier).
It should be noted that CPI shows the efficiency of instruction processing, but not of the instructions themselves. Consider a software change that added an inefficient software loop, which operates mostly on CPU registers (no stall cycles): such a change may result in a lower overall CPI, but higher CPU usage and utilization.
CPU utilization is measured by the time a CPU instance is busy performing work during an interval, expressed as a percentage. It can be measured as the time a CPU is not running the kernel idle thread but is instead running user-level application threads or other kernel threads, or processing interrupts.
High CPU utilization may not necessarily be a problem, but rather a sign that the system is doing work. Some people also consider this an ROI indicator: a highly utilized system is considered to have good ROI, whereas an idle system is considered wasted. Unlike with other resource types (disks), performance does not degrade steeply under high utilization, as the kernel supports priorities, preemption, and time sharing. These together allow the kernel to understand what has higher priority, and to ensure that it runs first.
The measure of CPU utilization spans all clock cycles for eligible activities, including memory stall cycles. It may seem a little counterintuitive, but a CPU may be highly utilized because it is often stalled waiting for memory I/O, not just executing instructions, as described in the previous section.
CPU utilization is often split into separate kernel- and user-time metrics.
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device. For more information, please visit https://www.mangoapps.com/.
May. 28, 2016 04:30 PM EDT Reads: 892
WebRTC is bringing significant change to the communications landscape that will bridge the worlds of web and telephony, making the Internet the new standard for communications. Cloud9 took the road less traveled and used WebRTC to create a downloadable enterprise-grade communications platform that is changing the communication dynamic in the financial sector. In his session at @ThingsExpo, Leo Papadopoulos, CTO of Cloud9, will discuss the importance of WebRTC and how it enables companies to fo...
May. 28, 2016 03:45 PM EDT Reads: 2,535
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discuss how businesses can gain an edge over competitors by empowering consumers to take control through IoT. We'll cite examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He'll also highlight how IoT can revitalize and restore outdated business models, making them profitable...
May. 28, 2016 02:00 PM EDT Reads: 2,921
In his session at 18th Cloud Expo, Bruce Swann, Senior Product Marketing Manager at Adobe, will discuss how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects). Bruce Swann has more than 15 years of experience working with digital marketing disciplines like web analytics, social med...
May. 28, 2016 02:00 PM EDT Reads: 1,358
IoT generates lots of temporal data. But how do you unlock its value? How do you coordinate the diverse moving parts that must come together when developing your IoT product? What are the key challenges addressed by Data as a Service? How does cloud computing underlie and connect the notions of Digital and DevOps What is the impact of the API economy? What is the business imperative for Cognitive Computing? Get all these questions and hundreds more like them answered at the 18th Cloud Expo...
May. 28, 2016 01:00 PM EDT Reads: 2,332
SYS-CON Events announced today that ContentMX, the marketing technology and services company with a singular mission to increase engagement and drive more conversations for enterprise, channel and SMB technology marketers, has been named “Sponsor & Exhibitor Lounge Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York. “CloudExpo is a great opportunity to start a conversation with new prospects, but what happens after the...
May. 28, 2016 11:15 AM EDT Reads: 1,239
SYS-CON Events announced today the How to Create Angular 2 Clients for the Cloud Workshop, being held June 7, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Angular 2 is a complete re-write of the popular framework AngularJS. Programming in Angular 2 is greatly simplified. Now it’s a component-based well-performing framework. The immersive one-day workshop led by Yakov Fain, a Java Champion and a co-founder of the IT consultancy Farata Systems and...
May. 28, 2016 11:00 AM EDT Reads: 4,051
Customer experience has become a competitive differentiator for companies, and it’s imperative that brands seamlessly connect the customer journey across all platforms. With the continued explosion of IoT, join us for a look at how to build a winning digital foundation in the connected era – today and in the future. In his session at @ThingsExpo, Chris Nguyen, Group Product Marketing Manager at Adobe, will discuss how to successfully leverage mobile, rapidly deploy content, capture real-time d...
May. 28, 2016 10:45 AM EDT Reads: 1,621
SYS-CON Events announced today that BMC Software has been named "Siver Sponsor" of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2015 at the Javits Center in New York, New York. BMC is a global leader in innovative software solutions that help businesses transform into digital enterprises for the ultimate competitive advantage. BMC Digital Enterprise Management is a set of innovative IT solutions designed to make digital business fast, seamless, and optimized from mainframe to mo...
May. 28, 2016 09:45 AM EDT Reads: 2,264
What a difference a year makes. Organizations aren’t just talking about IoT possibilities, it is now baked into their core business strategy. With IoT, billions of devices generating data from different companies on different networks around the globe need to interact. From efficiency to better customer insights to completely new business models, IoT will turn traditional business models upside down. In the new customer-centric age, the key to success is delivering critical services and apps wit...
May. 28, 2016 09:15 AM EDT Reads: 1,196
Join us at Cloud Expo | @ThingsExpo 2016 – June 7-9 at the Javits Center in New York City and November 1-3 at the Santa Clara Convention Center in Santa Clara, CA – and deliver your unique message in a way that is striking and unforgettable by taking advantage of SYS-CON's unmatched high-impact, result-driven event / media packages.
May. 28, 2016 09:00 AM EDT Reads: 2,432
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, will provide an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life ...
May. 28, 2016 08:45 AM EDT Reads: 1,981
SYS-CON Events announced today that MobiDev will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business managers to a full-scale mobile software company with over 200 develope...
May. 28, 2016 07:15 AM EDT Reads: 2,687
SoftLayer operates a global cloud infrastructure platform built for Internet scale. With a global footprint of data centers and network points of presence, SoftLayer provides infrastructure as a service to leading-edge customers ranging from Web startups to global enterprises. SoftLayer's modular architecture, full-featured API, and sophisticated automation provide unparalleled performance and control. Its flexible unified platform seamlessly spans physical and virtual devices linked via a world...
May. 28, 2016 06:00 AM EDT Reads: 2,259
Companies can harness IoT and predictive analytics to sustain business continuity; predict and manage site performance during emergencies; minimize expensive reactive maintenance; and forecast equipment and maintenance budgets and expenditures. Providing cost-effective, uninterrupted service is challenging, particularly for organizations with geographically dispersed operations.
May. 28, 2016 05:00 AM EDT Reads: 2,120
SYS-CON Events announced today TechTarget has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget is the Web’s leading destination for serious technology buyers researching and making enterprise technology decisions. Its extensive global networ...
May. 28, 2016 05:00 AM EDT Reads: 3,246
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management...
May. 28, 2016 04:15 AM EDT Reads: 3,207
SYS-CON Events announced today Object Management Group® has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
May. 28, 2016 03:00 AM EDT Reads: 2,585
As cloud and storage projections continue to rise, the number of organizations moving to the cloud is escalating and it is clear cloud storage is here to stay. However, is it secure? Data is the lifeblood for government entities, countries, cloud service providers and enterprises alike and losing or exposing that data can have disastrous results. There are new concepts for data storage on the horizon that will deliver secure solutions for storing and moving sensitive data around the world. ...
May. 28, 2016 03:00 AM EDT Reads: 1,322
The essence of data analysis involves setting up data pipelines that consist of several operations that are chained together – starting from data collection, data quality checks, data integration, data analysis and data visualization (including the setting up of interaction paths in that visualization). In our opinion, the challenges stem from the technology diversity at each stage of the data pipeline as well as the lack of process around the analysis.
May. 28, 2016 01:30 AM EDT Reads: 1,465