YOUR FEEDBACK
Cloud Computing: Do You Really Want Your Data in the Cloud?
Don Dodge wrote: D Cheng, Of course in-house systems go down. What I am sa...


2007 West
GOLD SPONSORS:
Active Endpoints
Your SOA Needs BPEL for Orchestration
BEA
Virtualized SOA: Adaptive Infrastructure for Demanding Applications
Nexaweb
Overcoming Bandwidth Challenges with Nexaweb
TIBCO
What is Service Virtualization?
SILVER SPONSORS:
WSO2
Using Web Services Technologies and FOSS Solutions
Click For 2007 East
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TOP THREE LINKS YOU MUST CLICK ON


Crunching Big Data with Java
One Team, One Month, One JVM

Digg This!

Page 2 of 4   « previous page   next page »

What type of processing is the data going through? The processing may be some sort of data transformation, data mining analytics, business intelligence, data aggregation, matching, or data cleansing task, but a key factor is that it applies to the whole data set. Multi-core machines provide a great opportunity for streamlining this processing. To take full advantage of these hardware resources, however, software developers need better approaches.

And the amount of data gets larger and larger every day. The ease and reduced cost to capture, transfer, and store information have resulted in huge collections of records, transactions, and events. As capacities for processing (more cores) and storage (faster and higher-capacity disks) continue to increase at dizzying speeds, the processing requirements increase almost as fast. Plus no one ever wants to let go of that data once they have it, because once you have the data, you want to squeeze every possible ounce of value out if it.

Data-Oriented Thinking
Within the disciplines of data mining, data quality, matching, searching and other data-intensive efforts, there exist compute-intensive algorithms for doing specific functions. It seems natural to solve our performance and scalability problems by focusing in and optimizing these algorithms using standard techniques.

While the above approach helps, it's only one part of solving the problem. A more general approach is needed, a different way of attacking the problem. Instead of thinking in terms of the functions to run on the data (the algorithms) or even the objects to use to process the data, I suggest you think in terms of the data itself. Consider the way the data needs to flow to be transformed, processed, mined, cleansed, or matched to reach the end goal. This dataflow way of thinking isn't a step back to the flow chart days of COBOL, but a higher-level data-oriented abstraction. Once you can break a data-intensive, analytic problem into such a form, it can be mapped to a dataflow graph that may be easily implemented in Java. We'll see how in the course of this article.

A dataflow is a graph structure that consists of two basic building blocks: a processing node and a data edge. A processing node (or process) transforms its input data in some way and outputs the transformed data to its output queue. The edges in the graph are dataflow queues. Dataflow queues are blocking queues used to transmit data between the processing nodes. A data processing application is built in a dataflow framework by stitching together dataflow processes to provide the desired data transformations. Please see Figure 1 for depiction of a simple dataflow graph.

If you use command line shells such as sh or bash, you're already familiar with this design methodology. From the shell command line you can run command pipelines such as:

awk -F, '{print $3}' < file.txt | sort | unique -c | sort -rn

(This pipeline reads the third column of the input CSV file, sorts the values, gets a count of each distinct value, and then sorts the distinct values by their frequency in reverse order.) This command pipeline is hooked together using the standard input and standard output data streams of each command. The shell creates a process for each command, hooks the outputs to inputs and then lets the data flow. None of the commands need to know about each other, they just need to honor the contract of writing their output to standard output and reading their input from standard input. Contract-based programming in action!

It's the same with dataflow programming. A dataflow process defines a contract: its input and output dataflow queues. A process may also have properties that can be set that are analogous to command line options of shell commands. Building a dataflow graph is very similar to building a shell pipeline. However, dataflow processes can normally have multiple inputs and multiple outputs. Shells commands can have two outputs: standard output and standard error, but only one input. There are other differences as well, but the analogy is useful to understanding the concepts. Figure 1 depicts the given shell command line example implemented as a dataflow graph. In the dataflow graph, the awk command is replaced with a reader process. The uniq command is replaced with a Group operator that uses a row count aggregator.

It's important to note that dataflow implements pipeline parallelism by its very nature. Each process in a dataflow graph works independently (somewhat) of other processes in the graph. As a process handles its input and does its transformations, it writes the resulting data to its output. Pipelining is a very powerful construct that allows for simple parallelism. It's one of the basic building blocks of parallel algorithm structure.

Let's look more closely at pipelining. Referring to the dataflow graph in Figure 1, the ReadText process begins reading the input file and outputs the third column of data for every input row. The ReadText can write each record to its output as it goes; it doesn't have to process all of its input before producing any output. It's a pipeline-friendly process. The Sort process, however, isn't pipeline-friendly. The Sort process must read all of its input data and process it before producing any output. To do so, Sort may have to create temporary space on local disk for merging. This is required since the Sort process may be asked to handle more input data than it can sort in memory.

This illustrates that attacking huge collections of data with the dataflow approach requires some new rules: any data-oriented framework must be data scalable. In other words, it should be able to handle gigabytes to terabytes of data without relying on in-memory algorithms that fail when billions of rows of data have to be handled.

A dataflow graph can also support the "divide and conquer" technique to enhance scalability. This is another basic parallel algorithm structure. With this technique, the input data is partitioned in some way and the same algorithm is applied to each partition. This allows the data to be processed in parallel. On computers with multi-core hardware, divide and conquer (a k a horizontal partitioning) can provide a huge boost to performance.



Page 2 of 4   « previous page   next page »

About Jim Falgout
Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.

Eman wrote: Funny, Cos, you are pointing out how Java isn't all that "free & open" like its corp. creator claims it is... the beauty of open source + patent law = morass of bear traps Frankly, I haven't seen any Java framework that holds a match to this DataRush thing... download and see for yourself.
read & respond »
Cos wrote: Daah! Check US Patent 7,020,699 Filed: December 19, 2001
read & respond »
LATEST JAVA STORIES & POSTS
Saving Your Investment: Transforming J2EE applications into Web 2.0 using GWT
The pressure is on to keep pace with Web 2.0 entrants into the marketplace. Rewriting is expensive; adding AJAX widgets results in a complex, unmaintainable application. Both require you to hire scarce JavaScript developers. Google Web Toolkit -- the SDK that allows you to write
WSRP Really Works! - Part 2
A standard from OASIS called Web Services for Remote Portlets (WSRP) is used so portlets can be decoupled from a portal. In part one (JDJ, Volume. 13, issue 3) of this article, we introduced the relevant standards and specifications and then demonstrated WSRP's capabilities by co
Adobe's Kevin Lynch and Microsoft's Scott Guthrie to Keynote AJAX World RIA Conference & Expo
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted
Sun Expects Q4 Earnings Above Estimates
On Tuesday evening Sun issued a fourth-quarter guidance range largely above analysts' estimates. The company pre-announced that revenue for its fiscal fourth quarter ended June was $3.725 billion to $3.8 billion, with gross margin in the 44-45% range. Sun expects non-GAAP profits
Virtualization Conference Keynote Webcast Live on SYS-CON.TV
Brian Stevens, the Chief Technology Officer and Vice President of Engineering of Red Hat, delivered his Virtualization Keynote 'The Future of the Virtual Enterprise' at SYS-CON's Virtualization Conference & Expo 2007 West in San Francisco. 'Virtualization is the hottest subject
The Beauty of JavaScript
JavaScript is one of the most interesting and misunderstood programming languages in common use today. Most developers will go their entire careers without realizing its full potential. It's not often that you get a language that supports the feature set that JavaScript does, whi
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON FEATURED WHITEPAPERS

SPONSORED BY INFRAGISTICS
SOA in a JVM: OSGi Service Platform - A Dynamic Component System for Java
There are many forces that influence technological evolution. After a decade of building enterprise
AJAX and Enterprise RIA Tools - JSF, Flex, and JavaFX
2008 is going to be an important year for Rich Internet Applications. Most organizations are deliver
Final Voting Phase on OpenAjax Browser Wishlist
The OpenAjax Alliance is developing an Ajax industry wishlist for future browsers, using a dedicated
AJAX World RIA Conference News - Netflix UI Guru To Present on Crafting Rich Web Interfaces
In every field of design one of the first things students do is learn from the work of others. They
Infragistics Releases CTP UI Components for Microsoft Silverlight Beta 2
Infragistics announced the availability of two Community Technology Preview (CTP) User Interface (UI
Yahoo User Interface 2.5.2 Released
The YUI development team has released version 2.5.2; you can download the new release from SourceFor
ADS BY GOOGLE
BREAKING JAVA NEWS
Domark International, Inc. Completes Its Acquisition of Javaco, Inc.
Domark International, Inc. (OTCBB:DOMK) announced today that it has completed its acqui