Welcome!

Java IoT Authors: Yeshim Deniz, Liz McMillan, Elizabeth White, Pat Romanski, Frank Lupo

Related Topics: @CloudExpo, Linux Containers, Open Source Cloud, Apache, @BigDataExpo

@CloudExpo: Article

Apache Spark vs. Hadoop | @CloudExpo #BigData #DevOps #Microservices

A choice of job styles

If you’re running Big Data applications, you’re going to want to look at some kind of distributed processing system. Hadoop is one of the best-known clustering systems, but how are you going to process all your data in a reasonable time frame? Apache Spark offers services that go beyond a standard MapReduce cluster.

A choice of job styles
MapReduce has become a standard, perhaps
the standard, for distributed file systems. While it’s a great system already, it’s really geared toward batch use, with jobs needing to queue for later output. This can severely hamper your flexibility. What if you want to explore some of your data? If it’s going to take all night, forget about it.

With Apache Spark, you can act on your data in whatever way you want. Want to look for interesting tidbits in your data? You can perform some quick queries. Want to run something you know will take a long time? You can use a batch job. Want to process your data streams in real time? You can do that too.

The biggest advantage of modern programming languages is their use of interactive shells. Sure, Lisp did that back in the ‘60s, but it was a long time before the kind of power to program interactively became available to the average programmer. With Python and Scala you can try out your ideas in real time and develop algorithms iteratively, without the time-consuming write/compile/test/debug cycle.

RDDs
The key to Spark’s flexibility is the Resilient Distributed Datasets, or RDDs. RDDs maintain a lineage of everything that’s done to your data. They’re fine-grained, keeping track of all changes that have been made from other transformations such as
map or join. This means that it’s possible to recover from failures by rebuilding from these transformations (which is why they’re called Resilient Distributed Datasets).

RDDs also represent data in memory, which is a lot faster than always pulling data off of disks—even with SSDs making their way into data centers. While having your data in memory might seem like a recipe for slow performance, Spark uses lazy evaluation, only making transformations on data when you specifically ask for the result. This is why you can get queries so quickly even on very large datasets.

You might have recognized the term “lazy evaluation” from functional programming languages like Haskell. RDDs are only loaded when specific actions produce some kind of output; for example, printing to a text file. You can have a complex query over your data, but it won’t actually be evaluated until you ask for it. And the query might only find a specific subset of your data instead of plowing through the whole thing. This lazy evaluation lets you create complex queries on large datasets without incurring a performance penalty.

RDDs are also immutable, which leads to greater protection against data loss even though they’re in memory. In case of an error, Spark can go back to the last part of an RDD’s lineage and recover from there rather than relying on a checkpoint-based system on a disk.

Spark and Hadoop, Not as Different as You Think
Speaking of disks, you might be wondering whether Spark replaces a Hadoop cluster. That’s really a false dichotomy. Hadoop and Spark work
together. While Spark provides the processing, Hadoop handles the actual storage and resource management. After all, you can’t store data in your memory forever.

With the combination of Spark and Hadoop in the same cluster, you can cut down on a lot of overhead in maintaining different clusters. This combined cluster will give you unlimited scale for Big Data operations.

Who’s Using Spark?
When you have your Big Data cluster in place, you’ll be able to do lots of interesting things. From genome sequencing analysis, to digital advertising to a major credit card company who uses Spark to match thousands of transactions at once
for possible fraud detection. Cisco does something similar with a cloud-based security product to spot possible hacking before it turns into a major data breach. Geneticists use it to match genes to new medicines.

Conclusion
Apache Spark builds on Hadoop and then goes beyond it by adding stream processing capabilities. The MapR distribution is the only one that offers everything you need right out of the box to enable real-time data processing.

For a more in-depth view into how Spark and Hadoop benefit from each other, read chapter four of the free interactive ebook: Getting Started with Apache Spark: From Inception to Production, by James A. Scott.

More Stories By Jim Scott

Jim has held positions running Operations, Engineering, Architecture and QA teams in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

@ThingsExpo Stories
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
SYS-CON Events announced today that Avere Systems, a leading provider of enterprise storage for the hybrid cloud, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Avere delivers a more modern architectural approach to storage that doesn't require the overprovisioning of storage capacity to achieve performance, overspending on expensive storage media for inactive data or the overbui...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant th...
Digital transformation is changing the face of business. The IDC predicts that enterprises will commit to a massive new scale of digital transformation, to stake out leadership positions in the "digital transformation economy." Accordingly, attendees at the upcoming Cloud Expo | @ThingsExpo at the Santa Clara Convention Center in Santa Clara, CA, Oct 31-Nov 2, will find fresh new content in a new track called Enterprise Cloud & Digital Transformation.
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, will discuss how given the magnitude of today's applicati...
SYS-CON Events announced today that Golden Gate University will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Since 1901, non-profit Golden Gate University (GGU) has been helping adults achieve their professional goals by providing high quality, practice-based undergraduate and graduate educational programs in law, taxation, business and related professions. Many of its courses are taug...