Java IoT Authors: Pat Romanski, Zakia Bouachraoui, Yeshim Deniz, Elizabeth White, Liz McMillan

Related Topics: @DXWorldExpo, Java IoT, @CloudExpo

@DXWorldExpo: Article

Big Data Analytics By @TheEbizWizard | @CloudExpo #BigData

Had Mark Twain lived today, we might hear him utter the oath lies, damn lies, and analytics

Big Data Analytics Raises the Bar for Data Preparation

Had Mark Twain lived today, we might hear him utter the oath lies, damn lies, and analytics. Statistics to be sure may still be used to distort the truth – but now with the sudden explosion of big data, analytics threaten the same fate.

I’m not talking about intentional distortion here – that’s another story entirely. Rather, the risk of unintentional distortion via data analytics is becoming increasingly prevalent, as the sheer quantity of data increases, as well as the availability and usability of the analytics tools on the market.

The data scientists themselves aren’t the problem. In fact, the more qualified data scientists we have, the better. But there aren’t enough of these rare professionals to go around.

Furthermore, the ease of use and availability of increasingly mature analytics and other business intelligence (BI) tools are opening up the world of “hands on” analytics to an increasingly broad business audience – few of whom have any particular training in data science.

Are today’s BI tools to blame for this problem? Not really – after all, the tools are unquestionably getting better and better. The root of the problem is data preparation.

After all, the smartest analytics tool in the world can only do so much with poorly organized, incomplete, or incorrect input data – the proverbial garbage-in, garbage-out problem, now compounded by the diversity of data types, levels of structure, and overall context challenges that today’s big data represent.

’Twas not always thus. Back in the good old first-generation data warehouse days, data preparation tasks were more straightforward, and the people responsible for tackling these activities did so for a living.

Now, data preparation is more diverse and challenging, and we’re asking data laypeople to do their best to shoehorn big data sets as best they can into their newfangled analytics tools. No wonder the end result can be such a mess.

A Closer Look at Data Preparation
Integrating multiple data sources, either by physically moving them or via data virtualization, typically involves data preparation. Traditional preparation tasks often include:

  • Bringing basic metadata like column names and numeric value types into a consistent state, for example, by renaming columns or changing all numbers into the same kind of integer.
  • Rudimentary data transformations, for example, taking a field that contains people’s full names and splitting them into first name and last name fields.
  • Making sure missing values are handled consistently. Is a missing value the same as an empty string, or perhaps the dreaded NULL?
  • Routine aggregation tasks, like counting all the records in a particular ZIP Code and entering the total into a separate field.

So far so good – while a data expert will have no problems with these tasks, many an Excel-savvy business analyst can tackle them without distorting results as well.

However, when big data enter the picture, data preparation becomes more complicated, as the variety of data structure and the volume of information increase. Additional data preparation activities may now include:

  • Data wrangling – the manual conversion of data from one raw form to another, especially when the data aren’t in a tabular format. What do you do if your source data contain, say, video files, Word documents, and Twitter streams, all mixed together?
  • Semantic processing – extracting entities from textual data, for example, identifying people and place names. Semantic processing may also include the resolution of ambiguities, for example, recognizing whether “Paris Hilton” is a socialite or a hotel.
  • Mathematical processing – yes, even statistics may be useful here. There are numerous mathematical approaches for identifying clusters or other patterns in information that will help with further analysis.

It’s important to note that the challenge with these more advanced data preparation techniques isn’t simply that inexperienced people won’t be able to perform them. The worry is that they will think they are properly preparing the data, when it fact they are doing it wrong. The end result will hopefully be obviously incorrect, but an even more dangerous scenario is when the final analysis seems correct but in reality is not.

Addressing Data Preparation Challenges
A common knee-jerk reaction to the scenarios described above is simply to establish rules to prevent unqualified users from monkeying with data preparation tasks in the first place. However, such draconian data governance measures typically have no place in a modern data-centric business environment.

The better approach is to provide additional data preparation and data integration tooling that data professionals may configure, but business analysts and other business users may use to prepare data for themselves. In other words, establish a governed, self-service model for data preparation.

For example, data professionals can preconfigure the reusable Snaps from SnapLogic so that they can handle the messier details of data preparation, as well as data access and other transformation tasks. The broader audience of users can then assemble data pipelines simply by snapping together the Snaps. See the illustration below for a SnapLogic pipeline that these “citizen integrators” can create to combine data.


SnapLogic Pipeline (source: SnapLogic)

It’s also possible to create nested sub-pipelines, so that business users assemble pre-assembled and preconfigured sub-pipelines as well as Snaps into larger data integration pipelines. Such pipelines can be made up of many levels of nested sub-pipelines, and SnapLogic can guarantee the delivery of data from each sub-pipeline (much the same as traditional queues offer guaranteed delivery, extended to many other types of Snaps).

SnapLogic also offers a sub-pipeline review, allowing both experts and business users to see the processing steps within each sub-pipeline, as well as relevant data governance capabilities that support this self-service data preparation approach. For example, it offers a lifecycle management feature that allows for the comparison and testing of Snaps and sub-pipelines before business users get their hands on them.

The Intellyx Take
In the case of SnapLogic, it falls to the data integration layer to resolve the challenges with data preparation. In truth, SnapLogic is essentially a data integration tooling vendor – but there is an important lesson here: data preparation is in reality an aspect of data integration, and in fact, data governance is part of the data integration story as well.

As enterprises leverage big data across their organizations, it becomes increasingly important to support the full breadth of personnel who will be working with such information, in order to get the best results from the resulting analysis. Leveraging data preparation capabilities like those found in SnapLogic’s pipelines is a critical enabler of useful, accurate data analysis.

SnapLogic is an Intellyx client, but Intellyx retains full editorial control of this article.

More Stories By Jason Bloomberg

Jason Bloomberg is a leading IT industry analyst, Forbes contributor, keynote speaker, and globally recognized expert on multiple disruptive trends in enterprise technology and digital transformation. He is ranked #5 on Onalytica’s list of top Digital Transformation influencers for 2018 and #15 on Jax’s list of top DevOps influencers for 2017, the only person to appear on both lists.

As founder and president of Agile Digital Transformation analyst firm Intellyx, he advises, writes, and speaks on a diverse set of topics, including digital transformation, artificial intelligence, cloud computing, devops, big data/analytics, cybersecurity, blockchain/bitcoin/cryptocurrency, no-code/low-code platforms and tools, organizational transformation, internet of things, enterprise architecture, SD-WAN/SDX, mainframes, hybrid IT, and legacy transformation, among other topics.

Mr. Bloomberg’s articles in Forbes are often viewed by more than 100,000 readers. During his career, he has published over 1,200 articles (over 200 for Forbes alone), spoken at over 400 conferences and webinars, and he has been quoted in the press and blogosphere over 2,000 times.

Mr. Bloomberg is the author or coauthor of four books: The Agile Architecture Revolution (Wiley, 2013), Service Orient or Be Doomed! How Service Orientation Will Change Your Business (Wiley, 2006), XML and Web Services Unleashed (SAMS Publishing, 2002), and Web Page Scripting Techniques (Hayden Books, 1996). His next book, Agile Digital Transformation, is due within the next year.

At SOA-focused industry analyst firm ZapThink from 2001 to 2013, Mr. Bloomberg created and delivered the Licensed ZapThink Architect (LZA) Service-Oriented Architecture (SOA) course and associated credential, certifying over 1,700 professionals worldwide. He is one of the original Managing Partners of ZapThink LLC, which was acquired by Dovel Technologies in 2011.

Prior to ZapThink, Mr. Bloomberg built a diverse background in eBusiness technology management and industry analysis, including serving as a senior analyst in IDC’s eBusiness Advisory group, as well as holding eBusiness management positions at USWeb/CKS (later marchFIRST) and WaveBend Solutions (now Hitachi Consulting), and several software and web development positions.

IoT & Smart Cities Stories
The challenges of aggregating data from consumer-oriented devices, such as wearable technologies and smart thermostats, are fairly well-understood. However, there are a new set of challenges for IoT devices that generate megabytes or gigabytes of data per second. Certainly, the infrastructure will have to change, as those volumes of data will likely overwhelm the available bandwidth for aggregating the data into a central repository. Ochandarena discusses a whole new way to think about your next...
CloudEXPO | DevOpsSUMMIT | DXWorldEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
All in Mobile is a place where we continually maximize their impact by fostering understanding, empathy, insights, creativity and joy. They believe that a truly useful and desirable mobile app doesn't need the brightest idea or the most advanced technology. A great product begins with understanding people. It's easy to think that customers will love your app, but can you justify it? They make sure your final app is something that users truly want and need. The only way to do this is by ...
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...
DXWorldEXPO LLC announced today that Big Data Federation to Exhibit at the 22nd International CloudEXPO, colocated with DevOpsSUMMIT and DXWorldEXPO, November 12-13, 2018 in New York City. Big Data Federation, Inc. develops and applies artificial intelligence to predict financial and economic events that matter. The company uncovers patterns and precise drivers of performance and outcomes with the aid of machine-learning algorithms, big data, and fundamental analysis. Their products are deployed...
Dynatrace is an application performance management software company with products for the information technology departments and digital business owners of medium and large businesses. Building the Future of Monitoring with Artificial Intelligence. Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more busine...
Cell networks have the advantage of long-range communications, reaching an estimated 90% of the world. But cell networks such as 2G, 3G and LTE consume lots of power and were designed for connecting people. They are not optimized for low- or battery-powered devices or for IoT applications with infrequently transmitted data. Cell IoT modules that support narrow-band IoT and 4G cell networks will enable cell connectivity, device management, and app enablement for low-power wide-area network IoT. B...
The hierarchical architecture that distributes "compute" within the network specially at the edge can enable new services by harnessing emerging technologies. But Edge-Compute comes at increased cost that needs to be managed and potentially augmented by creative architecture solutions as there will always a catching-up with the capacity demands. Processing power in smartphones has enhanced YoY and there is increasingly spare compute capacity that can be potentially pooled. Uber has successfully ...
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
When talking IoT we often focus on the devices, the sensors, the hardware itself. The new smart appliances, the new smart or self-driving cars (which are amalgamations of many ‘things'). When we are looking at the world of IoT, we should take a step back, look at the big picture. What value are these devices providing. IoT is not about the devices, its about the data consumed and generated. The devices are tools, mechanisms, conduits. This paper discusses the considerations when dealing with the...