Welcome!

Java IoT Authors: Yeshim Deniz, Liz McMillan, Elizabeth White, Pat Romanski, Frank Lupo

Related Topics: @BigDataExpo, Java IoT, @CloudExpo

@BigDataExpo: Article

Big Data Analytics By @TheEbizWizard | @CloudExpo #BigData

Had Mark Twain lived today, we might hear him utter the oath lies, damn lies, and analytics

Big Data Analytics Raises the Bar for Data Preparation

Had Mark Twain lived today, we might hear him utter the oath lies, damn lies, and analytics. Statistics to be sure may still be used to distort the truth – but now with the sudden explosion of big data, analytics threaten the same fate.

I’m not talking about intentional distortion here – that’s another story entirely. Rather, the risk of unintentional distortion via data analytics is becoming increasingly prevalent, as the sheer quantity of data increases, as well as the availability and usability of the analytics tools on the market.

The data scientists themselves aren’t the problem. In fact, the more qualified data scientists we have, the better. But there aren’t enough of these rare professionals to go around.

Furthermore, the ease of use and availability of increasingly mature analytics and other business intelligence (BI) tools are opening up the world of “hands on” analytics to an increasingly broad business audience – few of whom have any particular training in data science.

Are today’s BI tools to blame for this problem? Not really – after all, the tools are unquestionably getting better and better. The root of the problem is data preparation.

After all, the smartest analytics tool in the world can only do so much with poorly organized, incomplete, or incorrect input data – the proverbial garbage-in, garbage-out problem, now compounded by the diversity of data types, levels of structure, and overall context challenges that today’s big data represent.

’Twas not always thus. Back in the good old first-generation data warehouse days, data preparation tasks were more straightforward, and the people responsible for tackling these activities did so for a living.

Now, data preparation is more diverse and challenging, and we’re asking data laypeople to do their best to shoehorn big data sets as best they can into their newfangled analytics tools. No wonder the end result can be such a mess.

A Closer Look at Data Preparation
Integrating multiple data sources, either by physically moving them or via data virtualization, typically involves data preparation. Traditional preparation tasks often include:

  • Bringing basic metadata like column names and numeric value types into a consistent state, for example, by renaming columns or changing all numbers into the same kind of integer.
  • Rudimentary data transformations, for example, taking a field that contains people’s full names and splitting them into first name and last name fields.
  • Making sure missing values are handled consistently. Is a missing value the same as an empty string, or perhaps the dreaded NULL?
  • Routine aggregation tasks, like counting all the records in a particular ZIP Code and entering the total into a separate field.

So far so good – while a data expert will have no problems with these tasks, many an Excel-savvy business analyst can tackle them without distorting results as well.

However, when big data enter the picture, data preparation becomes more complicated, as the variety of data structure and the volume of information increase. Additional data preparation activities may now include:

  • Data wrangling – the manual conversion of data from one raw form to another, especially when the data aren’t in a tabular format. What do you do if your source data contain, say, video files, Word documents, and Twitter streams, all mixed together?
  • Semantic processing – extracting entities from textual data, for example, identifying people and place names. Semantic processing may also include the resolution of ambiguities, for example, recognizing whether “Paris Hilton” is a socialite or a hotel.
  • Mathematical processing – yes, even statistics may be useful here. There are numerous mathematical approaches for identifying clusters or other patterns in information that will help with further analysis.

It’s important to note that the challenge with these more advanced data preparation techniques isn’t simply that inexperienced people won’t be able to perform them. The worry is that they will think they are properly preparing the data, when it fact they are doing it wrong. The end result will hopefully be obviously incorrect, but an even more dangerous scenario is when the final analysis seems correct but in reality is not.

Addressing Data Preparation Challenges
A common knee-jerk reaction to the scenarios described above is simply to establish rules to prevent unqualified users from monkeying with data preparation tasks in the first place. However, such draconian data governance measures typically have no place in a modern data-centric business environment.

The better approach is to provide additional data preparation and data integration tooling that data professionals may configure, but business analysts and other business users may use to prepare data for themselves. In other words, establish a governed, self-service model for data preparation.

For example, data professionals can preconfigure the reusable Snaps from SnapLogic so that they can handle the messier details of data preparation, as well as data access and other transformation tasks. The broader audience of users can then assemble data pipelines simply by snapping together the Snaps. See the illustration below for a SnapLogic pipeline that these “citizen integrators” can create to combine data.

snaplogic2

SnapLogic Pipeline (source: SnapLogic)

It’s also possible to create nested sub-pipelines, so that business users assemble pre-assembled and preconfigured sub-pipelines as well as Snaps into larger data integration pipelines. Such pipelines can be made up of many levels of nested sub-pipelines, and SnapLogic can guarantee the delivery of data from each sub-pipeline (much the same as traditional queues offer guaranteed delivery, extended to many other types of Snaps).

SnapLogic also offers a sub-pipeline review, allowing both experts and business users to see the processing steps within each sub-pipeline, as well as relevant data governance capabilities that support this self-service data preparation approach. For example, it offers a lifecycle management feature that allows for the comparison and testing of Snaps and sub-pipelines before business users get their hands on them.

The Intellyx Take
In the case of SnapLogic, it falls to the data integration layer to resolve the challenges with data preparation. In truth, SnapLogic is essentially a data integration tooling vendor – but there is an important lesson here: data preparation is in reality an aspect of data integration, and in fact, data governance is part of the data integration story as well.

As enterprises leverage big data across their organizations, it becomes increasingly important to support the full breadth of personnel who will be working with such information, in order to get the best results from the resulting analysis. Leveraging data preparation capabilities like those found in SnapLogic’s pipelines is a critical enabler of useful, accurate data analysis.

SnapLogic is an Intellyx client, but Intellyx retains full editorial control of this article.

More Stories By Jason Bloomberg

Jason Bloomberg is the leading expert on architecting agility for the enterprise. As president of Intellyx, Mr. Bloomberg brings his years of thought leadership in the areas of Cloud Computing, Enterprise Architecture, and Service-Oriented Architecture to a global clientele of business executives, architects, software vendors, and Cloud service providers looking to achieve technology-enabled business agility across their organizations and for their customers. His latest book, The Agile Architecture Revolution (John Wiley & Sons, 2013), sets the stage for Mr. Bloomberg’s groundbreaking Agile Architecture vision.

Mr. Bloomberg is perhaps best known for his twelve years at ZapThink, where he created and delivered the Licensed ZapThink Architect (LZA) SOA course and associated credential, certifying over 1,700 professionals worldwide. He is one of the original Managing Partners of ZapThink LLC, the leading SOA advisory and analysis firm, which was acquired by Dovel Technologies in 2011. He now runs the successor to the LZA program, the Bloomberg Agile Architecture Course, around the world.

Mr. Bloomberg is a frequent conference speaker and prolific writer. He has published over 500 articles, spoken at over 300 conferences, Webinars, and other events, and has been quoted in the press over 1,400 times as the leading expert on agile approaches to architecture in the enterprise.

Mr. Bloomberg’s previous book, Service Orient or Be Doomed! How Service Orientation Will Change Your Business (John Wiley & Sons, 2006, coauthored with Ron Schmelzer), is recognized as the leading business book on Service Orientation. He also co-authored the books XML and Web Services Unleashed (SAMS Publishing, 2002), and Web Page Scripting Techniques (Hayden Books, 1996).

Prior to ZapThink, Mr. Bloomberg built a diverse background in eBusiness technology management and industry analysis, including serving as a senior analyst in IDC’s eBusiness Advisory group, as well as holding eBusiness management positions at USWeb/CKS (later marchFIRST) and WaveBend Solutions (now Hitachi Consulting).

@ThingsExpo Stories
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
SYS-CON Events announced today that Avere Systems, a leading provider of enterprise storage for the hybrid cloud, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Avere delivers a more modern architectural approach to storage that doesn't require the overprovisioning of storage capacity to achieve performance, overspending on expensive storage media for inactive data or the overbui...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant th...
Digital transformation is changing the face of business. The IDC predicts that enterprises will commit to a massive new scale of digital transformation, to stake out leadership positions in the "digital transformation economy." Accordingly, attendees at the upcoming Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA, Oct 31-Nov 2, will find fresh new content in a new track called Enterprise Cloud & Digital Transformation.
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, will discuss how given the magnitude of today's applicati...
SYS-CON Events announced today that Golden Gate University will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Since 1901, non-profit Golden Gate University (GGU) has been helping adults achieve their professional goals by providing high quality, practice-based undergraduate and graduate educational programs in law, taxation, business and related professions. Many of its courses are taug...