Click here to close now.

Welcome!

Java Authors: Pat Romanski, Carmen Gonzalez, Yeshim Deniz, Liz McMillan, Elizabeth White

Related Topics: Cloud Expo, Java, Microservices Journal, Virtualization, Open Web, Apache

Cloud Expo: Blog Post

Cloudera Impala – Closing the Near Real Time Gap Working with Big Data

Building data structures and loading data

By

On October 24, 2012 Cloudera announced the release of Cloudera Impala and the commercial support subscription service of Cloudera Enterprise Real Time Query (RTQ). During the Hadoop World/STRATA Conference in NYC, I was invited over to see a demonstration. Impala is a SQL based Real Time Query/Ad Hoc query engine built on top of HDFS or Hbase. As I watched the demonstration unfold, I wondered if one of the remaining technology gaps in the NOSQL arsenal had been closed.  What gap you ask? Near Real Time Analytics on a NOSQL stack. Working with customers across the Cyber Security customer space, not only do they face the familiar BIGDATA horsemen of the apocalypse: Volume, Velocity and Variety but one more large challenge crept in: Time (V3T).  The Near Real Time Analysis/Near Real Time Analytic capability that Cloudera Impala provides is essential in many high value use cases associated with Cyber Security: comparing current activity with observed historical norms, correlation of many disparate data sources/enrichment and automated threat detection algorithms.

When the demonstration concluded, the Cloudera representatives and I discussed the potential of performing an informal independent evaluation of Cloudera Impala against some of the common Real Time/Near Real Time use cases in Cyber Security. I agreed to step up and perform an independent evaluation as well as developing a demonstration platform for FedCyber 2012 (almost three weeks hence for inquiring minds).  So let us set the field: a new BETA technology, NO prior exposure to the technology or documentation, a vendor making promises, addressing a large technology gap and three weeks to implement, seemed straight forward; no pressure.

The day after I returned from the STRATA Conference, I returned to my office and provisioned four Virtual Machines in order to build the Impala demonstration. As a committer/contributor for SherpaSurfing an open source Cyber Security solution, I have an abundance of data sets, enrichment sources, Hive data structures and services.  Given the amount of time and the audience for FedCyber 2012, I decided to focus on some Intrusion Detection and Netflow related use cases for the demonstration. The data sets for the demonstration included base data sets:  20 million Netflow events, 8 million Intrusion Detection System events and enrichment: Geographic, Blacklist, Whitelist and Protocol related information. Each of the selected uses cases for this demonstration is critical to the Perform Near-Real Time Network Analysis domain in Cyber Security. The name for the demonstration system was decided to be the Impala Mission Demonstration Platform (IMDP).  The IMDP was implemented based on vendor recommendations with no tuning or optimization.

The IMDP effort provided me with my first opportunity to work with Cloudera Manager. Although this post is focused on Cloudera Impala I would be remiss not to mention Cloudera Manager. I have worked with Hadoop since 1.0 and built more than a few clusters over the years. I used the installation and configuration guides provided with Cloudera Impala and followed the recommendations. One of the first recommendations was use of the Cloudera Manager. Using the Cloudera Manager (CDH 4.1), I was able to roll out a four node cluster in two hours.  I was able to discover the hosts, manage services and provision them in accordance with the IMDP deployment plan. The deployment plan consisted of:

  • node 1 – hbase, hdfs, impala,  mapreduce
  • node2 – hbase, hdfs, impala,  mapreduce
  • node3 – hbase(region server, master), hdfs(namenode), impala(impalad, statestore),  mapreduce(job tracker, tasktracker) , hue, oozie and zookeeper
  • node4 – Application Tier, Cloudera Manager

The Cloudera Manager saved at least two days of effort in deploying the cluster, the tight integration with the support portal, comprehensive help and one place to work with all properties of the entire cluster and view space consumption metrics; verdict on Cloudera Manager: Cloudera masterful, bold stroke, thumbs up.

Now that the cluster build-out completed; I shifted attention to deploying and configuring the Cloudera Impala service.  Using Cloudera Manager, I deployed Impala on three nodes: three instances of Impalad and one impala state store, in a matter of minutes. I completed the deployment and configuration of the Hive MetaStore. Keeping in mind this is a BETA; the documentation was complete, but fragmented on deployment and configuration (HIVE MetaStore portion); verdict on impala deployment and configuration: solid for a BETA (needs an example hive-site.xml, configuration guide needs better flow).

At this point all configuration and deployment was completed, attention turned to building data structures and loading data. I took the Data Definition Language (DDL) scripts or data structures for ten data sources and enrichment; ported them over to Hive and tested them in less than four hours. It is worthy of mention that the data sources for this demonstration are large flat tables: netflow and intrusion detection system. Cloudera Impala uses HIVE as an Extract Transform Load (ETL) engine, using Hive I defined all of the data structures in source files which were sourced using hive shell: created a database (Sherpa). Hive was then used to load data into the tables that were just created. Creating data structures in Hive was simple as usual and loading data sets was quick (20 million netflow events in 57 seconds). Logging into impala-shell, issued a refresh of the MetaStore and I was working with data. I performed verification of the data load, all data loaded and no issues were revealed. One area of potential improvement would be more comprehensive messages on load failure. Defining the data structures and loading data using Hive was nothing new; verdict:  really good; easy to use, easy to load, but need to improve failed load messages.

Finally, we moved on to the most interesting stage which is using Cloudera Impala in a series of Real Time Query (RTQ) scenarios that are common across the Cyber Security customer space. The real world scenarios selected come from the perform netflow analysis set of use case(s). In each of these scenarios, the exact same queries were executed on the same cluster using Hive and then Impala against the same data structures (database and tables).  In the Hive approach, we traverse the batch processing stack and with Impala we traverse the Real Time Query (RTQ) stack performing a series of analytics. In the first use case, I ran a five tuple (sip, sport, dip, dport, protocol) summary covering bytes per packet, summing bytes and packets for a 20 million event set resulted in: identical result sets, Hive 82 seconds – Impala 6 seconds.   In the second use case, I performed a summary of destination ports where the source port is 80 which resulted in: identical result sets, Hive 57 seconds, Impala 5 seconds. In the third use case, I performed correlation between netflow and intrusion detection systems, correlating netflow with intrusion detection events for several hours which resulted in: identical result sets, Hive 40 seconds, Impala sub-second.  Finally, for FedCyber 2012, I developed a java based situational awareness dashboard which connected to Cloudera Impala via ODBC and executed analytics performing: correlation of blacklists, Intrusion Detection, Netflow, statistical cubes for ten hours with a refresh of every five seconds without failure or issue.  The ODBC implementation easily provided the ability to export data to desktop tools (using ODBC) and common BI tools as advertised. Developing and Using Cloudera Impala verdict: This is as advertised; easy to use, easy to implement on, very fast, very flexible and more than capable of running real time analytics. The Impala shell is limited but much of the demonstration work was done using result sets so it was not an impediment.

In summation, I have worked for over a decade across the vast BIGDATA technology space covering Legacy Relational Database, Data Warehouse, and NOSQL; Cloudera Impala proved more than capable of running near real time analytics and providing mission relevance to customers with a Near Real Time (NRT) requirement.  Based on my initial review Cloudera Impala appears to be a bold step in closing the gap of near real time analytics on a NOSQL stack. I did encounter some minor problems, but the few problems and limitations that were encountered in this demonstration were documented and published in the known issues document so they will not be shared; none were show stoppers.

The notes, details and all of the lessons learned, data structures and the configuration guide from the demonstration are being published out on Github under SherpaSurfing in the coming days. These documents cover everything in detail and will enable developers to replicate the demonstration platform and get a jump start on Cloudera Impala.  Finally, I would like to thank two contributors: Hanh Le, Robert Webb and Six3 Systems for helping me pull this off.

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley, former CTO of the Defense Intelligence Agency (DIA), is Founder and CTO of Crucial Point LLC, a technology research and advisory firm providing fact based technology reviews in support of venture capital, private equity and emerging technology firms. He has extensive industry experience in intelligence and security and was awarded an intelligence community meritorious achievement award by AFCEA in 2008, and has also been recognized as an Infoworld Top 25 CTO and as one of the most fascinating communicators in Government IT by GovFresh.

@ThingsExpo Stories
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Things. Akana enables enterprises to share data as APIs, connect and integrate applications, drive part...
SYS-CON Events announced today that CommVault has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. A singular vision – a belief in a better way to address current and future data management needs – guides CommVault in the development of Singular Information Management® solutions for high-performance data protection, universal availability and sim...
SYS-CON Events announced today that SafeLogic has been named “Bag Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. SafeLogic provides security products for applications in mobile and server/appliance environments. SafeLogic’s flagship product CryptoComply is a FIPS 140-2 validated cryptographic engine designed to secure data on servers, workstations, appliances, mobile devices, and in the Cloud.
SYS-CON Events announced today that StorPool Storage will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. StorPool is distributed storage software that allows service providers, enterprises and other cloud builders to run data storage on standard x86 servers, instead of using expensive and inefficient storage arrays (SAN).
GENBAND has announced that SageNet is leveraging the Nuvia platform to deliver Unified Communications as a Service (UCaaS) to its large base of retail and enterprise customers. Nuvia’s cloud-based solution provides SageNet’s customers with a full suite of business communications and collaboration tools. Two large national SageNet retail customers have recently signed up to deploy the Nuvia platform and the company will continue to sell the service to new and existing customers. Nuvia’s capabilities include HD voice, video, multimedia messaging, mobility, conferencing, Web collaboration, deskt...
SYS-CON Events announced today that Site24x7, the cloud infrastructure monitoring service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Site24x7 is a cloud infrastructure monitoring service that helps monitor the uptime and performance of websites, online applications, servers, mobile websites and custom APIs. The monitoring is done from 50+ locations across the world and from various wireless carriers, thus providing a global perspective of the end-user experience. Site24x7 supports monitoring H...
Sonus Networks introduced the Sonus WebRTC Services Solution, a virtualized Web Real-Time Communications (WebRTC) offer, purpose-built for the Cloud. The WebRTC Services Solution provides signaling from WebRTC-to-WebRTC applications and interworking from WebRTC-to-Session Initiation Protocol (SIP), delivering advanced real-time communications capabilities on mobile applications and on websites, which are accessible via a browser.
SYS-CON Events announced today that Intelligent Systems Services will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Established in 1994, Intelligent Systems Services Inc. is located near Washington, DC, with representatives and partners nationwide. ISS’s well-established track record is based on the continuous pursuit of excellence in designing, implementing and supporting nationwide clients’ mission-critical systems. ISS has completed many successful projects in Healthcare, Commercial, Manufacturing, ...
SYS-CON Events announced today that B2Cloud, a provider of enterprise resource planning software, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. B2cloud develops the software you need. They have the ideal tools to help you work with your clients. B2Cloud’s main solutions include AGIS – ERP, CLOHC, AGIS – Invoice, and IZUM
SYS-CON Events announced today that Tufin, the market-leading provider of Security Policy Orchestration Solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. As the market leader of Security Policy Orchestration, Tufin automates and accelerates network configuration changes while maintaining security and compliance. Tufin's award-winning Orchestration Suite™ gives IT organizations the power and agility to enforce security policy across complex, multi-vendor enterprise networks. With more than 1...
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cloudian, Inc., is a Foster City, California - based software company specializing in cloud storage software. The main product is Cloudian, an Amazon S3-compliant cloud object storage platform, the bedrock of cloud computing systems, that enables cloud service providers and enterprises to build reliable, affordable and scalable cloud storage solu...
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the industry’s first all flash version of HyperConverged Appliances that include both compute and storag...
Temasys has announced senior management additions to its team. Joining are David Holloway as Vice President of Commercial and Nadine Yap as Vice President of Product. Over the past 12 months Temasys has doubled in size as it adds new customers and expands the development of its Skylink platform. Skylink leads the charge to move WebRTC, traditionally seen as a desktop, browser based technology, to become a ubiquitous web communications technology on web and mobile, as well as Internet of Things compatible devices.
“With easy-to-use SDKs for Atmel’s platforms, IoT developers can now reap the benefits of realtime communication, and bypass the security pitfalls and configuration complexities that put IoT deployments at risk,” said Todd Greene, founder & CEO of PubNub. PubNub will team with Atmel at CES 2015 to launch full SDK support for Atmel’s MCU, MPU, and Wireless SoC platforms. Atmel developers now have access to PubNub’s secure Publish/Subscribe messaging with guaranteed ¼ second latencies across PubNub’s 14 global points-of-presence. PubNub delivers secure communication through firewalls, proxy ser...
SYS-CON Events announced today that IDenticard will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. IDenticard™ is the security division of Brady Corp (NYSE: BRC), a $1.5 billion manufacturer of identification products. We have small-company values with the strength and stability of a major corporation. IDenticard offers local sales, support and service to our customers across the United States and Canada. Our partner network encompasses some 300 of the world's leading systems integrators and security s...
SYS-CON Events announced today that On the Avenue Marketing Group, a sales and marketing firm that utilizes events to market and sell products to consumers, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. On the Avenue Marketing Group (OTA) is a sales and marketing firm that utilizes events to market and sell products to consumers. On behalf of our clients, we attend thousands of fairs, festivals, expos, concerts, conferences, and sporting events annually, helping them reach millions of individuals ...
Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 16th Cloud Expo at the Javits Center in New York June 9-11 will find fresh new content in a new track called PaaS | Containers & Microservices Containers are not being considered for the first time by the cloud community, but a current era of re-consideration has pushed them to the top of the cloud agenda. With the launch of Docker's initial release in March of 2013, interest was revved up several notches. Then late last...
“In the past year we've seen a lot of stabilization of WebRTC. You can now use it in production with a far greater degree of certainty. A lot of the real developments in the past year have been in things like the data channel, which will enable a whole new type of application," explained Peter Dunkley, Technical Director at Acision, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Ciqada will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Ciqada™ makes it easy to connect your products to the Internet. By integrating key components - hardware, servers, dashboards, and mobile apps - into an easy-to-use, configurable system, your products can quickly and securely join the internet of things. With remote monitoring, control, and alert messaging capability, you will meet your customers' needs of tomorrow - today! Ciqada. Let your products take flight. For more inform...
Health care systems across the globe are under enormous strain, as facilities reach capacity and costs continue to rise. M2M and the Internet of Things have the potential to transform the industry through connected health solutions that can make care more efficient while reducing costs. In fact, Vodafone's annual M2M Barometer Report forecasts M2M applications rising to 57 percent in health care and life sciences by 2016. Lively is one of Vodafone's health care partners, whose solutions enable older adults to live independent lives while staying connected to loved ones. M2M will continue to gr...