Welcome!

Java IoT Authors: Liz McMillan, Elizabeth White, Pat Romanski, Yeshim Deniz, Zakia Bouachraoui

Related Topics: Java IoT

Java IoT: Article

Hyper-Threading Java

Hyper-Threading Java

In early 2002 Intel became the first chip manufacturer to release a processor incorporating a new technology known as Simultaneous Multithreading, or SMT. Intel's SMT implementation (dubbed Hyper-Threading or HT) has been available in their Xeon processor line for over a year, with little fanfare. In April 2003, Intel announced that HT technology will be added to its desktop-focused Pentium 4 line of processors. With HT enabled on one of these new systems, the BIOS will present a single processor to the operating system as two logical processors.

As Java developers, we should all be excited about this new feature of Intel processors. The java.lang.Thread object was one of the key factors driving Java to the strong position it enjoys in the server-side applications market. Both client and server applications written in Java often make heavy use of threads. Indeed even if an application does not use threads explicitly, all JVMs will use at least one background thread ­ the garbage collector. SMT holds the promise of significantly increasing Java's server-side performance by more completely utilizing existing processor cycles in multithreaded applications.

This article attempts to explain the concepts of Simultaneous Multithreading in layman's terms, presents the development of an n-thread benchmarking suite, and uses that suite to produce concrete results of multithreaded benchmarks on HT and non-HT systems. We'll investigate various operation types to determine the factors that affect Java performance enhancements on Hyper-Threaded processors. Finally a series of conclusions and speculations are derived from the data collected.

Understanding Symmetric Multithreading on Intel Processors
Intel processors with HT technology carry two copies of the processor's architectural state on the same chip. This second architectural state stores a second thread context. Conceptually, this type of processor architecture splits each physical processor into two or more logical processors. Physical SMT processors present themselves to the operating system as separate logical processors. As we'll see later, it can then become important for the operating system to be aware of and to differentiate between logical and physical processors. Figure 1 illustrates the difference between SMT and non-SMT processors.

What is the benefit of SMT? As it turns out, the more expensive processor resources can find themselves underutilized while an active thread performs long latency operations. A cache miss, for instance, will require the processor to make a request to main memory. The majority of the processor's resources remain idle for this period of time; however, the processor presents itself to the operating system as busy. SMT systems use this slice of time to execute the operations of another on-chip thread context.

SMT processors contain an onboard scheduler to interleave multiple threads operating on the physical processor. If a thread encounters a long latency, the processor will immediately execute the instructions of the second on-chip processor state. For two threads accessing the same processor resources, the onboard scheduler will interleave the threads much the same as a software thread scheduler. This interleaving has a small amount of overhead, which can decrease the efficiency of the processor in certain situations. On an aggregate basis, however, processor performance is increased.

Using SMT it becomes apparent that depending on the work that each thread is doing on adjacent logical processors, we could see performance increases or decreases. Various papers (see references) studying multithreaded performance indicate generally positive results, with some research indicating perceived performance gains as high as 50%.

HT-Enabled Systems
Intel Hyper-Threading requires support from three fundamental components of a system:

  1. The processor
  2. The chipset
  3. The operating system
Processors Supporting HT
Hyper-Threading was incorporated into the Xeon class processors in early 2002. Xeon is not to be confused with Pentium III Xeon. When Intel changed the Xeon's core to P4, it dropped the P4 designation, calling the processor simply Xeon. Recently, HT has found its way to the desktop P4 processor. Not all processors in each of these processor classes are capable of Hyper-Threading, however.

Table 1 indicates which processors support Hyper-Threading. The table also indicates factors that you can use to determine whether a given Intel processor supports HT.

With the release of the 3.06GHz Pentium 4, Intel changed the P4 logo, incorporating the letters H and T to indicate that it's a Hyper-Threading processor.

All recent Xeon processors support Hyper-Threading, but again, be sure to watch out for the 256KB L2 Cache version, which does not.

Chipset Support for HT
Not all chipsets support HT. Check with your chipset manufacturer to ensure that you can enable and disable HT support via the BIOS.

All HT chipsets interleave processor numbering to help less sophisticated thread schedulers make complete use of available physical processors. The chipset will present the logical processors to the OS as follows:

Logical CPU0 = Physical CPU0, Logical CPU0
Logical CPU1 = Physical CPU1, Logical CPU0
Logical CPU2 = Physical CPU0, Logical CPU1
Logical CPU3 = Physical CPU1, Logical CPU1

Operating Systems Supporting HT
Given a processor and chipset that support Hyper-Threading, the operating system must also be HT aware. Table 2 shows the OS support for several currently available operating systems commonly run on Intel-based hardware.

Windows
The Windows 2000 operating systems do not differentiate between logical and physical processors. Therefore a 32-processor HT system will support only 32 logical processors. It will work; however, the additional processor resources will not be utilized.

Windows users should check software licensing agreements to confirm that they recognize logical processors. Generally XP will support licensing on a per physical CPU basis, while Windows 2000 will see logical processors as physical processors for licensing purposes.

Figure 2 shows a Windows XP Pro task manager on a dual-processor HT system, note the four distinct "CPU Usage History" charts depicting the four logical processors.

Linux
The 2.4 kernel began supporting Hyper-Threading on the Intel Xeon processor as of version 2.4.18. The thread scheduler in 2.4, however, does not understand the difference between logical and physical processors, in addition to many other SMT scheduler optimizations, similar to the Windows 2000 family of products. This can lead to degraded performance in situations where two threads are scheduled concurrently on one physical processor, while the other physical processor is left idle.

As of kernel version 2.5.32, the thread scheduler was updated with advanced features to support Hyper-Threading. The 2.5.x kernel is the development branch that will become the 2.6 kernel. The exact release schedule for 2.6 is unknown, but in a recent interview Linus Torvalds indicated that 2.6 would likely be released in Q4 2003.

Figure 3 shows a Red Hat 7.3 installation running the 2.4.18 kernel with Hyper-Threading enabled on the system. Note the four CPU states indicated as CPU0-CPU3 on top. Also note that CPU0 is running at 100.1% utilization ­ wow, Hyper-Threading is cool!

Threaded Benchmarking on HT and Non-HT Systems
Our goal here is to understand the effects of Hyper-Threading processors on the performance of multithreaded Java applications. To do this, we need a test bed that will allow us to execute heavily threaded operations and track performance variations against thread count in HT and non-HT systems.

Thread Bench Design
At a basic level, the test bed should be able to execute multiple operations across n threads, observing the total throughput of operations per unit of time for a run. On a dual-processor system, we should see nearly double the performance on a CPU-intensive operation using two threads instead of one. The performance of CPU-intensive threaded operations on HT systems will vary based on the operations and the level of concurrency possible on a single physical processor.

Our focus here is to explore which types of operations will and will not benefit from HT technology. Given this we need to be able to quickly implement and test multiple types of operations.

There are several Java benchmarking systems available on the market. Many are older and focused on applet performance. Some newer benchmark systems like VolanoMark or SPECjbb2000 test the threaded performance of systems; however, they don't allow us to customize and focus on specific individual operations that could affect performance on an HT system.

These requirements drove the design and coding of an n-thread Java benchmark framework. The framework supports pluggable operation classes and produces plottable results for a range of thread counts from a single test suite execution.

Figure 4 presents a functional/UML diagram for the system design.

The resulting benchmarking framework has the following features:

  • Initialization of operations on the JIT: Modern JIT compilers will optimize "hot spots" in the code. The performance of any given operation will improve over the life of the VM, so the ThreadBench framework gives operations a chance to initialize on the JIT before the tests commence.
  • Operation abstraction: By developing a generic operation interface and using dynamic class loading and initialization of the operation to be tested, we can quickly prototype and test various processor-intensive operations.
  • Test suites: Using test suites, ThreadBench runs a given operation configuration through several iterations of the test with different numbers of threads. This allows a series of tests to be repeatedly run on several machine configurations with minimal effort.
  • Multiple runs: To smooth out anomalies in the test, each data point is created by averaging data from several runs. This is configurable; some tests have a larger standard deviation than others.

    The code for this article can be downloaded from the JDJ Web site, www.sys-con.com/java/sourcec.cfm.

    Factors Affecting Performance
    Use of Threads

    This seems obvious; however, it needs to be mentioned: single-threaded applications (often client applications) will see little performance gain. Server-side Java applications make extensive use of threads, making them excellent candidates for performance improvement from SMT.

    Nonthreaded applications may still see some benefit. Java's garbage collection and background JIT compilers operate as daemon threads in the local JVM. In addition, concurrent processes could make use of the additional processor resources.

    The Operating System's Thread Scheduler
    In an HT system, a single physical processor is presented to the OS as two logical processors. This requires the OS to differentiate between physical and logical processors and make intelligent decisions about thread scheduling.

    The thread scheduler on a dual-processor HT system will see four logical processors. A poor thread scheduler could schedule two CPU-intensive threads onto separate logical processors representing the same physical processor. This would result in a perceived performance decrease on an HT-based system.

    CPU Resource Utilization
    Hyper-Threaded processors do not duplicate all available resources. Two threads performing fundamentally similar operations on separate logical processors will likely see little performance gain. For HT to be a benefit, the two threads coexisting on a physical CPU must perform a variety of operations to allow the processor to make better use of latency.

    Performance of Threaded Benchmarks on HT and Non-HT Systems
    Tests were run on two HT-capable dual-processor systems (see Table 3).

    Hyper-Threading requires BIOS support, making it easy to enable and disable the feature in the boot setup program for various runs.

    Each test was run with the Sun JDK 1.4.1_02, using the ­server flag on the Linux and XP systems. Tests were also run with the IBM 1.4.0 JVM, with no command-line flags, on the Linux system.

    The tests devised are by no means comprehensive. The goal was to stress the processor, using different processor resources, to try to gain some insight into the effects of SMT processing. The series of tests was run on each of the above systems, with and without HT enabled. Each of the operation algorithms tested is briefly described, followed by results and some discussion and interpretation.

    Note: To save space, the XP and Linux tests are shown on the same plots. The data should not be directly compared, however. The tests were run on different physical hardware, indeed the processor speeds on the XP machine were higher than on the Linux machine.

    Test 1: Gaussian Elimination, 500x500 matrix (Floating point intensive)
    Gaussian elimination is a very common algorithm used to solve systems of linear equations ­ a common task in finite element applications, weather simulation, coordinate transformations, and economic modeling among other things. Algorithmic optimizations are often done for sparse/banded matrices; however, the core of the work is fundamentally the same ­ large numbers of floating point calculations are required.

    To simulate this, a Gaussian elimination algorithm with scaled partial pivoting and back substitution is used (see Figure 5). A full matrix is constructed of random doubles using Math.random(). The population of the matrix is carried out in the setup() method and is not considered part of the operation.

    This operation carries out large numbers of simple floating point operations on doubles. All calculations are done in the Java call stack, though it's highly likely that the code was optimized by the JIT before the tests were run.

    It seems that this operation does not scale well into threads on any JVM. The Sun VM on Microsoft with Hyper-Threading does significantly worse than the Linux JVMs with or without Hyper-Threading. There are no synchronizations in the operation whatsoever. Poor scaling into threads could be due to memory barriers, or contention for a bus or main memory.

    Test 2: Calculation of 2000! (Integer intensive)
    Calculation of factorial (! operator) is used often in probability calculations. It's used as a portion of the formula for combinations and permutations. Factorial is defined as follows:

    N! = 1 x 2 x 3 x 4 x S x N

    Combinations are an interesting calculation in poker, and illustrate a potential use of the factorial operator. To calculate the number of five-card combinations in a 52-card deck, we use the combinations formula:

    Possible poker hands= 52C5 =52C5=52!5! (52-5)!

    Factorial calculations of even small integers grow rapidly, requiring the use of the java.math.BigInteger class. Calculations of factorials result in a large number of integer multiplications.

    The factorial calculations shown in Figure 6 do show some consistent, limited benefit from Hyper-Threading. Indeed, for four threads the IBM JVM shows a 17% increase in performance using an HT-enabled system.

    Incidentally, there are 2,598,960 five-card combinations in a 52-card deck.

    Test 3: 150K calculations of Math.tan() (Floating point, mixed stack)
    This test simply calculates the tangent of an angle 150,000 times in a tight loop (see Figure 7).

    All Java threads have two call stacks: one for Java calls, the other for C calls. The java.lang.Math.tan(double) function is native, calculating an approximation of tangent with a 27th order polynomial. It's likely that the reason this operation scales so well into Hyper-Threading is the constant call stack switching, giving the processor time to utilize its secondary thread context.

    Test 4: Prime number search
    A prime number search operation was created using the BigInteger class and a very simplistic direct search factorization. The poor algorithm is not as important as the type of calculations being performed. This class performs a large number of BigInteger divisions.

    It is difficult to tell what is going on in Figure 8, beyond the fact that the IBM JVM is beating Sun's. The IBM JVM scales well into threading this operation. It does even better when Hyper-Threading is enabled. The Sun VM scales poorly into threads, and it becomes worse with additional thread contexts. You could speculate that this behavior is characteristic of a low-level synchronization contention issue in the Sun JVM.

    Testing Summary
    The plots above give some general idea of how these various operations scale into threads. In most cases, the HT performance gains are modest. The following is a summary of performance differences seen with Hyper-Threading enabled versus disabled for each of the tested JVMs.

    IBM 1.4.0, Linux 2.4.18

    ThreadsGaussFactorialMath.tan()Prime
    14.13%3.92%-0.10%3.06%
    21.92%7.39%1.62%-2.42%
    30.21%11.45%34.99%1.96%
    4-2.58%16.98%75.84%9.84%
    6-3.56%13.33%60.96%4.53%
    8-0.69%  2.41%

    Sun 1.4.1, Linux 2.4.18

    ThreadsGaussFactorialMath.tan()Prime
    10.99%0.28%-0.75%0.30%
    2-1.20%0.35%-1.76%6.10%
    3-2.20%8.21%23.76%6.30%
    4-3.63%8.28%62.74%-30.08%
    6-4.13%7.71%62.96%-27.50%
    8-4.73%  -28.28%

    Sun 1.4.1, Windows XP Pro

    ThreadsGaussFactorialMath.tan()Prime
    1-0.51%0.93%0.62%-1.32%
    2-1.18%0.98%-6.17%14.07%
    3-12.90%3.53%7.85%-0.74%
    4-23.96%4.61%11.74%-24.14%
    6-23.23%6.35%11.79%-23.46%
    8-23.66%  -23.36%

    Conclusion
    When I began this project, I fully expected to see marked performance gains using Hyper-Threading over identical hardware not using HT. In the course of testing, I've learned quite a bit about performance differences for Java on various platforms, hardware configurations, and virtual machines. Hyper-Threading is not the boon I had expected. In some situations, performance gains for HT reached the 75% mark, which is considerable. There was little significant performance degradation using HT, so using it seems to be largely on the upside.

    Perhaps the more important finding is that the IBM JVMs perform significantly better than the Sun JVMs. In addition, the IBM JVMs scaled far better with threads than did Sun's offering. If performance is of key concern, and you're not using some of the more esoteric features of the Sun JVM, IBM JVMs deserve serious consideration.

    Most server-side Java applications are not doing computationally intensive tasks. The tasks focus more heavily on socket IO ­ communicating with databases, clients via HTTP, RMI, Web services, and the like. Processors will be given plenty of socket IO wait time to schedule parallel tasks. For socket-IO-bound applications, be sure to consider the relative skill of your operating system in the IP arena.

    The introduction of Hyper-Threading on desktop P4 systems is also exciting. Java developers often develop on Windows or Linux-based desktop systems and deploy onto larger SMP and potentially SMT systems. HT will allow a desktop developer and user to see some of the benefits of threaded applications long before deployment to the higher-end systems.

    SMT technology is here to stay. Intel's Hyper-Threading implementation is sure to be the first of many. Chip industry watchers speculate that Simultaneous Multithreading and thread-level parallelism will spell the ultimate end of the "megahertz wars." A chip's performance will be tied less to its internal clock speed and more to the bells and whistles it incorporates. Other chip manufacturers are sure to follow suit, and all implementations will improve in quality over time.

    Operating systems are also continually improving their support for Hyper-Threading. It does seem strange that the performance on an XP system, which should be HT optimized, was often less HT friendly than the 2.4.18 Linux kernel, which is HT ignorant. As more sophisticated support for HT is built into operating systems, we should see more significant performance gains using HT in the Java world.

    The combination of Java and Linux in the datacenter is rapidly gaining ground on the Solaris/Java platform. The majority of these new Linux servers are running high-end Intel-based hardware. Hyper-Threading will give this trend a further push in the Linux direction.

    For now, given a piece of hardware that's HT capable, the configuration that offers the best performance under most conditions is the IBM 1.4.0 JVM on Linux with Hyper-Threading enabled.

    Resources

  • Microsoft license clarification for SMT systems: www.microsoft.com/nz/licensing/downloads/ hyper_threading_processors_licensing_brief.doc Intel Processor Specsheets
  • Xeon: www.intel.com/products/server/processors/ server/xeon/index.htm
  • Xeon DP: www.intel.com/design/xeon/prodbref/index.htm
  • Xeon MP: www.intel.com/products/server/processors/ server/xeon_mp/index.htm
  • Pentium 4: http://developer.intel.com/design/pentium4/ datashts/298643.htm
  • P4 Chipset matrix indicating HT support: www.intel.com/design/chipsets/linecard.htm
  • IBM Whitepaper on Linux and Hyper-Threading: www-106.ibm.com/developerworks/linux/ library/l-htl/?dwzone=linux
  • LinuxWorld article indicating Q4 2003 release of 2.6 Kernel: www.linuxworld.com/story/33805.htm

    Glossary

  • Physical processor: A silicon-based hardware processor
  • Logical processor: A hardware/software system making pseudo-parallel use of a single physical processor
  • Simultaneous Multithreading (SMT): The use of logical processors to increase processing throughput on a single physical processor
  • Symmetric Multiprocessing (SMP): The use of multiple physical processors in parallel, each running separate threads of execution
  • Hyper-Threading: Intel's marketing name for its SMT technology on Xeon and Pentium 4 processors
  • More Stories By Paul Bemowski

    Paul Bemowski is an independent consultant, focusing on Java and
    Linux solutions to enterprise computing problems.
    email: [email protected]
    url: http://www.jetools.com

    Comments (6)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    IoT & Smart Cities Stories
    Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
    Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
    "MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
    Cloud computing delivers on-demand resources that provide businesses with flexibility and cost-savings. The challenge in moving workloads to the cloud has been the cost and complexity of ensuring the initial and ongoing security and regulatory (PCI, HIPAA, FFIEC) compliance across private and public clouds. Manual security compliance is slow, prone to human error, and represents over 50% of the cost of managing cloud applications. Determining how to automate cloud security compliance is critical...
    Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
    Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
    When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things'). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing. IoT is not about the devices, its about the data consumed and generated. The devices are tools, mechanisms, conduits. This paper discusses the considerations when dealing with the...
    Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
    Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
    Machine learning has taken residence at our cities' cores and now we can finally have "smart cities." Cities are a collection of buildings made to provide the structure and safety necessary for people to function, create and survive. Buildings are a pool of ever-changing performance data from large automated systems such as heating and cooling to the people that live and work within them. Through machine learning, buildings can optimize performance, reduce costs, and improve occupant comfort by ...