YOUR FEEDBACK
Jeremy Geelan wrote: In response to inquiries and suggestions from readers this lexicon has recently...


2008 East
DIAMOND SPONSOR:
Data Direct
Frontiers in Data Access: The Coming Wave in Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
Intel
Virtualization – Path to Predictive Enterprise
Green Hills
IT Security in a Hostile World
JBoss / freedom oss
Practical SOA Approach
GOLD SPONSORS:
Software AG
The Art & Science of SOA: How Governance Enables Adoption
PlateSpin
Effective Planning for Virtual Infrastructure Growth
Fujitsu
Automated Business Process Discovery & Virtualization Service
Ceedo
Workspace Virtualization
Click For 2007 West
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TOP THREE LINKS YOU MUST CLICK ON


Crunching Big Data with Java
One Team, One Month, One JVM

A Problem of Matching
Recently I was presented with a problem of de-duplicating (a k a fuzzy matching) tens of millions of records. And this huge volume is just the start. Eventually, the number of records may grow into the hundreds of millions. I teamed with two co-workers to meet the project's tight timeline.

A fuzzy matching application can basically be broken into the following modules of functionality:

  • Cleansing and standardization
  • Blocking
  • Field comparison
  • Classification
  • Filtering
The cleansing and standardization phase handles getting the data into the right form for matching. For example, postal addresses may need to be standardized against a U.S. postal database.

The blocking phase attempts to place similar records in the same blocks for candidate record pair generation. This is a key step, since many records can be generated during this phase. During de-dup, the number of records output per block is (N * (N -1) )/ 2, where N is the number of records in the block. One common way of blocking is to use some sort of geographic data, such as parts of a postal address, as blocking keys. Any of the key fields may have encoding, such as Soundex, applied as part of the blocking phase.

Why group into blocks in the first place? Because comparing every record to every other record is normally impossible (or at least would take an enormous amount of time). For only a few thousand records, this is not a problem. But for larger datasets, it's a huge problem. Table 1 shows the numbers of candidate pairs generated if all input rows are compared to all other rows. As you can see, a data explosion happens quickly. Blocking is needed to cut down on the size of groups used to generate candidate record pairs and thereby dramatically reduce the resulting number of comparisons.

The field comparison phase compares fields from the record pairs using specified comparison methods. Some common comparison methods are: Levenshtein Edit Distance, Damerau-Levenshtein, Jaro, Jaro-Winkler, Q-Gram, and exact match. Each pair of fields is compared and generates a field score.

For each record pair, the field scores are then used to classify the pair as either being a good match, a possible match, or not a match. A possible match may need review by a human to discern the quality of the match. Classification results in a record score being produced. Once each record pair is scored, the data can be filtered and output. The filtering is normally used to exclude candidate matches that don't meet the specified criteria (i.e., a record score below a certain threshold).

Now, let's take a data-oriented approach to this problem and see how we can start to break it down. Our first assumption is that the data is already cleansed and standardized, so we'll exclude that phase from our solution. The next phase is blocking. In this phase we optionally apply encoding to blocking key fields and basically join the data against itself. This is done to create candidate pairs of records for the field comparison phase. For the de-dup problem, a standard join won't work; it will create too many redundant candidate record pairs. What is needed is a "group pairs" operator that prevents generating duplicate candidates. For example, if records A, B, and C are in the same blocking group, we want to generate candidate pairs A-B, A-C and B-C. All the other variations (A-A, B-A, B-B, ...) are redundant and shouldn't be generated.

Blocking by key fields implies that we can use hash partitioning to divide and conquer this step. As long as we hash partition on the same keys (optionally encoded) we can run the blocking operation in parallel. Figure 2 depicts this design with a partition count of 4.

For each candidate pair of records, now combined into one record, the field comparisons can be implemented. This is a good place to apply the task parallelism pattern. With task parallelism, we have many different tasks we want to run on our data. Unlike divide and conquer, the tasks are not similar. In this case, each record is independent of all others, so we can actually apply both patterns, divide the data, and then run the set of comparison tasks against it. Figure 3 shows our field comparison design with different comparisons applied to two input fields.

The field comparison results are then used for each candidate pair to determine a record score. This is done using a classification method. As with field comparisons, the pair classifications have no data dependencies on other records and so may be done completely in parallel. We'll take advantage of the task-based parallelism employed for field comparisons and tack on a pipelined record classification that's fed the output of each comparator.

Once a record has been classified, it can be filtered. We can take advantage of pipeline parallelism and tack on the filtering to the record classification.

Now that the problem has been laid out, it's time to move on to implementation and the question asked earlier about the role of Java in all of this.

Dataflow and Java
So far, we've developed a data-oriented design that fits into a dataflow paradigm very nicely. How do we implement this in Java? Well, first, why would I choose to implement this application in Java? A common misconception about Java is that it doesn't perform and doesn't scale. The argument about Java not scaling is mainly blamed on garbage collection. While that may have been true several Java versions ago, it's not so now. Java can perform well and it can scale. Take a look at Figure 4 to see a benchmark test run with a dataflow framework written in Java. The test shows that the JVM scales to use all 32 cores of the machine being tested using the Levenshtein edit distance measure (which we'll use in our matching application).

Granted, I know I may lose some edge of performance using Java. It's true that in C I'd have more control of memory and so could utilize cache line sizes and other tricks, such as processor affinity, to get better performance. But for programmer productivity, it's hard to beat Java when used with the IDEs available today, not to mention the rich libraries that are available. I'm willing to trade a few points of performance to write an application in Java over C given how much more productive I can be in Java. Java is portable, fully object-oriented, easy to code, and used widely. For all of these reasons, the dataflow framework used to implement the matching application discussed in this article is written in Java - otherwise it would have taken my team much longer than a month.


About Jim Falgout
Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.

LATEST JAVA STORIES & POSTS
What's the key to team and individual developer productivity in maintaining and extending a large application? Let’s start by making the following assertions: A developer's knowledge of an application code base is likely the single biggest factor of individual productivity. Cor...
An applet, a Java program that runs in a browser, often has to access the client resources. However, the security manager prevents an applet from accessing client resources. To access client resources, the applet has to have the proper permission. With this permission the applet ...
Three-letter acronyms (TLAs) are hardly new in Information Technology: EAI, ESB, SOA, BPM, BAM, ETL, MDM; the list goes on and on. This article is about yet another three-letter acronym, EDA, which stands for Event-Driven Architecture. EDA is not a brand new technology, but rathe...
Furthering its dedication to providing Java developers productivity with choice, Oracle announced the Oracle Enterprise Pack for Eclipse, a new component of Oracle Fusion Middleware. This release marks the first free Eclipse 3.4 environment to support Oracle WebLogic Server 10g R...
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted...
Red Hat CTO Brian Stevens, Citrix CTO Simon Crosby, Egenera CTO Pete Manca, Allen Stewart, Group Manager, Windows Virtualization at Microsoft, and Brian Duckering, Sr. Director of Products and Alliances at Symantec were the top industry executives who joined Jeremy Geelan in the ...
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS

SPONSORED BY INFRAGISTICS
There are many forces that influence technological evolution. After a decade of building enterprise ...
2008 is going to be an important year for Rich Internet Applications. Most organizations are deliver...
The OpenAjax Alliance is developing an Ajax industry wishlist for future browsers, using a dedicated...
In every field of design one of the first things students do is learn from the work of others. They ...
Infragistics announced the availability of two Community Technology Preview (CTP) User Interface (UI...
The YUI development team has released version 2.5.2; you can download the new release from SourceFor...
ADS BY GOOGLE
BREAKING JAVA NEWS

SpringSource, a leading provider of infrastructure software and the company behind ...