|
|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TOP THREE LINKS YOU MUST CLICK ON FrontPage Feature
Crunching Big Data with Java
One Team, One Month, One JVM
By: Jim Falgout
Mar. 30, 2008 04:00 AM
Digg This!
Page 2 of 4
« previous page
next page »
What type of processing is the data going through? The processing may be some sort of data transformation, data mining analytics, business intelligence, data aggregation, matching, or data cleansing task, but a key factor is that it applies to the whole data set. Multi-core machines provide a great opportunity for streamlining this processing. To take full advantage of these hardware resources, however, software developers need better approaches. And the amount of data gets larger and larger every day. The ease and reduced cost to capture, transfer, and store information have resulted in huge collections of records, transactions, and events. As capacities for processing (more cores) and storage (faster and higher-capacity disks) continue to increase at dizzying speeds, the processing requirements increase almost as fast. Plus no one ever wants to let go of that data once they have it, because once you have the data, you want to squeeze every possible ounce of value out if it.
Data-Oriented Thinking While the above approach helps, it's only one part of solving the problem. A more general approach is needed, a different way of attacking the problem. Instead of thinking in terms of the functions to run on the data (the algorithms) or even the objects to use to process the data, I suggest you think in terms of the data itself. Consider the way the data needs to flow to be transformed, processed, mined, cleansed, or matched to reach the end goal. This dataflow way of thinking isn't a step back to the flow chart days of COBOL, but a higher-level data-oriented abstraction. Once you can break a data-intensive, analytic problem into such a form, it can be mapped to a dataflow graph that may be easily implemented in Java. We'll see how in the course of this article. A dataflow is a graph structure that consists of two basic building blocks: a processing node and a data edge. A processing node (or process) transforms its input data in some way and outputs the transformed data to its output queue. The edges in the graph are dataflow queues. Dataflow queues are blocking queues used to transmit data between the processing nodes. A data processing application is built in a dataflow framework by stitching together dataflow processes to provide the desired data transformations. Please see Figure 1 for depiction of a simple dataflow graph. If you use command line shells such as sh or bash, you're already familiar with this design methodology. From the shell command line you can run command pipelines such as: awk -F, '{print $3}' < file.txt | sort | unique -c | sort -rn (This pipeline reads the third column of the input CSV file, sorts the values, gets a count of each distinct value, and then sorts the distinct values by their frequency in reverse order.) This command pipeline is hooked together using the standard input and standard output data streams of each command. The shell creates a process for each command, hooks the outputs to inputs and then lets the data flow. None of the commands need to know about each other, they just need to honor the contract of writing their output to standard output and reading their input from standard input. Contract-based programming in action! It's the same with dataflow programming. A dataflow process defines a contract: its input and output dataflow queues. A process may also have properties that can be set that are analogous to command line options of shell commands. Building a dataflow graph is very similar to building a shell pipeline. However, dataflow processes can normally have multiple inputs and multiple outputs. Shells commands can have two outputs: standard output and standard error, but only one input. There are other differences as well, but the analogy is useful to understanding the concepts. Figure 1 depicts the given shell command line example implemented as a dataflow graph. In the dataflow graph, the awk command is replaced with a reader process. The uniq command is replaced with a Group operator that uses a row count aggregator. It's important to note that dataflow implements pipeline parallelism by its very nature. Each process in a dataflow graph works independently (somewhat) of other processes in the graph. As a process handles its input and does its transformations, it writes the resulting data to its output. Pipelining is a very powerful construct that allows for simple parallelism. It's one of the basic building blocks of parallel algorithm structure. Let's look more closely at pipelining. Referring to the dataflow graph in Figure 1, the ReadText process begins reading the input file and outputs the third column of data for every input row. The ReadText can write each record to its output as it goes; it doesn't have to process all of its input before producing any output. It's a pipeline-friendly process. The Sort process, however, isn't pipeline-friendly. The Sort process must read all of its input data and process it before producing any output. To do so, Sort may have to create temporary space on local disk for merging. This is required since the Sort process may be asked to handle more input data than it can sort in memory. This illustrates that attacking huge collections of data with the dataflow approach requires some new rules: any data-oriented framework must be data scalable. In other words, it should be able to handle gigabytes to terabytes of data without relying on in-memory algorithms that fail when billions of rows of data have to be handled. A dataflow graph can also support the "divide and conquer" technique to enhance scalability. This is another basic parallel algorithm structure. With this technique, the input data is partitioned in some way and the same algorithm is applied to each partition. This allows the data to be processed in parallel. On computers with multi-core hardware, divide and conquer (a k a horizontal partitioning) can provide a huge boost to performance. Page 2 of 4 « previous page next page »
LATEST JAVA STORIES & POSTS
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK SPONSORED BY INFRAGISTICS
BREAKING JAVA NEWS
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||