Java IoT Authors: APM Blog, Stackify Blog, XebiaLabs Blog, Liz McMillan, William Schmarzo

Related Topics: IBM Cloud, Java IoT, Microservices Expo, Linux Containers, Agile Computing, @DXWorldExpo

IBM Cloud: Article

Predictive Failure Analytics with Optimization for Big Data

Applying data mining and advanced statistical methods to analyze, diagnose and improve manufacturing yield

We present a unique case study of applying data mining and advanced statistical methods to analyze, diagnose and improve manufacturing yield, especially for rare failure event prediction. Intra-die process variations in nanometer technology nodes pose significant challenges to robust design practices. Geometric variations along with random dopant fluctuation effects have had significant impact on Memory functionality/yield. Inaccuracies in the models and variabilities in the process are more pronounced and force us to understand the variability effects in a processor chip with higher accuracy and fidelity and considering more physical effects than ever before. In this case study, we use predictive failure analytics to learn and optimize critical components of the processor, and deal with massive amounts of data using server farms for parallel processing. The technique can handle large numbers of process and design variables which cause mismatches in transistors, demonstrating the capability of high dimensionality, accuracy and efficiency. This increases the confidence level in the functionality and operability of system-on-chip as a whole. The underlying algorithms are generic and can be applied to big data analysis, and in particular, the techniques and framework would be very amenable to a cloud computing architectures for both scalability of processing power and data handling, and for enabling such analysis for organizations that would otherwise not have the means.

At present, predicting success or failure of rare events before such events actually occurs is a multi-billion dollar proposal. Research, development and productization of rare event prediction technology is ongoing in the areas of finance, health, manufacturing, supply, business, workforce many other related areas which big data analytics are required. This is not just a mathematicians' dream, but a technology that can be applied in real world applications for reducing cost, improving productivity, optimizing product performance, and optimizing financial outcomes. While the analytics and framework which we developed can be generally applied to many diverse fields, we are here disclosing a unique case study of the technology as applied to Integrated Circuit (IC) Manufacturing and design and which has been successfully commercialized. By observing how the generic algorithms are applied in a specific field, it is easier to understand how it can be applied to multi-faceted applications in the predictive analytics domain.

With the rapid scaling of CMOS technology, die-to-die and intra-die process variation effects are increasing dramatically (Figure 1). To meet the demand of high density memory, designers use the smallest devices and most aggressive design topologies & geometries for SRAM cells (Static random access memory is the main type of memory used in microprocessors, networking chips, cell phone chips, etc.). The drive for small dimension transistors leads to unavoidable manufacturing variations between neighboring transistors of the memory cell, due to variations in transistor physical features & dopants on the atomic scale. Namely, threshold voltage mismatch between neighboring devices can lead to large number of fails in memory designs and can degrade SRAM performance and yield. When combined with other effects such as narrow width effects, SER, temperature and process variations and parasitic transistor resistance, the scaling of SRAMs becomes increasingly difficult due to reduced margins [1-4]; end-of-life effects can further aggravate the situation [1]. The same applies for logic design. In fact, designing for the worst-case is simply not feasible any more. Statistical timing techniques have been used to achieve full-chip and full-process coverage based on high-level models, and enable robust design practices [6]. Furthermore, statistical techniques have been shown to improve quality in the context of at-speed test. To enable full-chip analysis, however, these models do sacrifice accuracy, and deal mainly with 3-sigma estimates.

Figure 1: Classification of variation sources. To ensure chip performance & yield, all of these sources of variation must be considered and controlled.

Accurate modeling and efficient statistical methodologies which can handle large number of variables form the crux of predictive analytics. Memories in microprocessors occupy 50-60% of the area and are critical for storage. In the past, the arena of statistical analysis for logic (latches, decoders etc) and SRAM memory has not been addressed adequately in the context of circuit design, especially when rare event failure estimation is involved; this is true not only from the performance perspective but also from the functional behavior perspective. Hence, there is a need to capture not only average logic delay distributions, but also possible design fails. As the number of elements (e.g. latches) increase in the designs, it is possible that a rare functional fail could occur; this is especially true when we want to guarantee the yield for millions of chips. Furthermore, it is necessary to analyze the yield of the memory design in-situ with the peripheral logic. This raises the need for simultaneous statistical analysis of the memory/logic unit.

In this case study, we employ superfast Monte-Carlo compatible techniques intended for memory analysis to such custom logic applications. We first revisit the methodology and its use as an analysis tool for different designs. We then take concrete examples to demonstrate the methodology for custom logic. Namely, we go over case studies of memory interacting with logic in-terms of the local evaluation circuits undergoing Fast-Read-Before-Write. Finally we conclude with the examples of memory decode logic and hit-logic. Figure 2 provides an overview of the applications under study in terms of the components in commonly used chip design. Recently published work [6] on an IBM POWER8 microprocessor shows close to 4.2B transistors with 12 cores with L2 and shared L3 SRAM Cache memories.

As is the case with state-of-the-art microprocessors, memory units occupy 50-60% of the chip while the logic occupies the rest. Hence, prediction of yields through variability analysis is of prime importance targeting first memory elements and then logic.

Figure 2: Partitioning of logic and memory in state-of-the-art chip design ([7]).

Predictive Analytics for Memory Yield Design And Beyond
Traditionally, the Monte Carlo method has been adopted to estimate yield and fail probabilities of a given design. (F However, with the increasing demand of density and chip-yield requirements, stringent requirements on the fail probability are necessary, which are in the range of 1-per-million or lower making it impossible to rely on traditional statistical methods as illustrated in Figure 3. Also as the number of input variables increases Monte Carlo suffers from extraordinary speed reduction and becomes impractical to be used in rare failure events.

Figure 3: Prior Art: Monte Carlo method and its alternatives can lead to inaccuracies in the yield estimate given the limited number of sample points.

In [4], we proposed mixture importance sampling as a comprehensive and computationally efficient method for purposes of estimating low fail probabilities of SRAM designs. The method relies on adjusting the (natural) Monte Carlo sampling function, to produce more samples in the important region(s) (see Figure 4). It is based on the following fact.

where Ep[Q] is the expected value of Q with respect to the sampling function p(x), g(x) is the distorted sampling function, and p(x) is the natural distribution. The method is theoretically sound, and with the proper choice of g(x), we are able to obtain accurate results with a relatively small number of simulations. We refer the reader to [4] for more details.

Figure 4: Importance sampling helps improve the rate of sampling in the important regions as opposed to traditional Monte Carlo.

For SRAM cells, important metrics such as read/write margins, stability and performance are subjected to process variation and this can degrade the yield. Figure 5 illustrates a schematic sketch of a 6-transistor SRAM cell; often, to enable improved yield, the cell and logic supplies are separated. Here, we allocate Vcs to cell supply and Vdd to the bitline logic. We also enlist two different cases: (1) wordline connected to Vdd, and (2) wordline connected to Vcs. We then rely on our methodology to study the yield under different dual supply topologies and conditions [8]. Figure 6 illustrates an example of model-to-hardware corroboration for the case 1 for combined stability results. Similarly case 2 can be handled. The same technique can be applied to logic.

Figure 5: Schematic of an SRAM cell, and possible dual supply scenarios.

Figure 6: (a) Operating and failure regions through proposed predictive methodology in the Vdd x Vcs space for case 1: wordline connected to Vdd.

Figure 6: (b) Hardware data for the operating/failure region shows close matching.

Predictive Analytics for High Dimensionality
Historically, high dimension problems pose a special challenge in the field of statistical analysis, so analyses have been generally limited to looking at a limited number of dimensions, with 100 dimensions being a typical limit. However, many chip failure mechanisms may have contributing effects from hundreds or thousands of transistors, and each transistor may have several parameters of interest, resulting in 30,000 or more dimensions. Furthermore, for ease of use, designers would rather not identify the scope of transistors which may impact performance & failure, but rather let the tool perform all the diagnostics, resulting in even more parameters and consequently higher dimensions. To understand why high dimensions pose a special problem, consider the SRAM case of one dimension, where there is typically a perfect correlation between probability of model parameter variation and output variation, as shown in Figure 7. However, in very high dimension spaces, there is no correlation between distance from nominal (likelihood of a sample point), and failure probability of the chip. For example, Figure 7 illustrates a sample case with 1,000 dimensions where distance from nominal has no discernible correlation with failure. Real case data is also presented for a 126 dimension case and it is evident that the pass/fail points are overlapping relative to their distance from nominal. In other words, the fail points cannot be distinguished from the pass points according to likelihood of occurrence of the points. This is a fatal problem for methods that order the Monte Carlo samples by distance to nominal or attempt to perform some sort of filtering of the points by such criteria. Rather, more sophisticated techniques are needed that consider the sensitivity of the failure to each dimension, not only at nominal but also in regions where failures are more likely.

One of the critical aspects of predictive statistical prediction technology is the error control algorithm and ability to monitor and diagnose the tools' convergence. This is particularly critical and challenging for very high sigma application such as that illustrated in Figure 8, where 20,000 samples are required to analyze the tail of the distribution out to 8s. Note, on the right side of Figure 8, the framework provides continuously updated diagnostics about the convergence with upper & lower confidence intervals, versus sample number.

Figure 7: No correlation between distance from nominal (likelihood of a sample point), and failure probability in high dimension (not to scale). Case data with overlapping pass/fail samples.

Figure 8: >7s high sigma analysis for 6T SRAM bitcell

More Stories By Rajiv V. Joshi

Dr. Rajiv V. Joshi is a research staff member at T. J. Watson research center, IBM. He received his B.Tech I.I.T (Bombay, India), M.S (M.I.T) and Dr. Eng. Sc. (Columbia University). His novel interconnects processes and structures for aluminum, tungsten and copper technologies which are widely used in IBM for various technologies from sub-0.5μm to 14nm. He has led successfully pervasive statistical methodology for yield prediction and also the technology-driven SRAM at IBM Server Group. He commercialized these techniques. He received 3 Outstanding Technical Achievement (OTAs), 3 highest Corporate Patent Portfolio awards for licensing contributions, holds 54 invention plateaus and has over 200 US patents and over 350 including international patents.

Dr. Joshi has authored and co-authored over 175 papers. He is a recipient of 2013 IEEE CAS Industrial Pioneer award and 2013 Mehboob Khan Award from Semiconductor Research corporation. He is Distinguished Lecturer for IEEE CAS and EDS society. He is IEEE and ISQED fellow and distinguished alumnus of IIT Bombay. He serves as an Associate Editor of TVLSI. He served on committees of ISLPED (Int. Symposium Low Power Electronic Design), IEEE VLSI design, IEEE CICC, IEEE Int. SOI conf ISQED and Advanced Metallization Program committees. He is an industry liaison for universities as a part of the Semiconductor Research Corporation.

More Stories By Bruce W. McGaughy

Dr. Bruce W. McGaughy is a Distinguished Engineer, Simulation Chief Architect. He received a BS in Electrical Engineering from the University of Illinois at Urbana/Champaign and an MS and PhD in Electrical Engineering and Computer Science from the University of California at Berkeley, in 1994, 1995 and 1997, respectively. He has conducted and published research in the fields of circuit simulation, device physics, reliability, electronic design automation, computer architecture and fault tolerant computing. Prior to his current assignment, he worked for Integrated Device Technolgy (IDT), Siemens, Intel, Berkeley Technology Associates, and Celestry. In 2003, Dr. McGaughy was the group director in charge of circuit simulation R&D at Cadence, including Spectre, SpectreRF and UltraSim.

In 2006, Dr. McGaughy became the distinguished engineer and chief architect for Cadence simulation products, including Spectre, SpectreRF, UltraSim and AMS Designer. In 2008, he joined ProPlus Design Solutions as the Senior VP of Engineering and Chief Technology Officer. He is in charge of all of ProPlus R&D efforts, including the BsimProPlus model extraction platform, the NanoSpice parallel spice simulator, and the NanoYield DFY platform.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@ThingsExpo Stories
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things’). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing? IoT is not about the devices, it’s about the data consumed and generated. The devices are tools, mechanisms, conduits. In his session at Internet of Things at Cloud Expo | DXWor...