| By Jim Falgout | Article Rating: |
|
| March 30, 2008 04:00 AM EDT | Reads: |
25,363 |
A Problem of Matching
Recently I was presented
with a problem of de-duplicating (a k a fuzzy matching) tens of
millions of records. And this huge volume is just the start.
Eventually, the number of records may grow into the hundreds of
millions. I teamed with two co-workers to meet the project's tight
timeline.
A fuzzy matching application can basically be broken into the following modules of functionality:
- Cleansing and standardization
- Blocking
- Field comparison
- Classification
- Filtering
The blocking phase attempts to place similar records in the same blocks for candidate record pair generation. This is a key step, since many records can be generated during this phase. During de-dup, the number of records output per block is (N * (N -1) )/ 2, where N is the number of records in the block. One common way of blocking is to use some sort of geographic data, such as parts of a postal address, as blocking keys. Any of the key fields may have encoding, such as Soundex, applied as part of the blocking phase.
Why group into blocks in the first place? Because comparing every record to every other record is normally impossible (or at least would take an enormous amount of time). For only a few thousand records, this is not a problem. But for larger datasets, it's a huge problem. Table 1 shows the numbers of candidate pairs generated if all input rows are compared to all other rows. As you can see, a data explosion happens quickly. Blocking is needed to cut down on the size of groups used to generate candidate record pairs and thereby dramatically reduce the resulting number of comparisons.
The field comparison phase compares fields from the record pairs using specified comparison methods. Some common comparison methods are: Levenshtein Edit Distance, Damerau-Levenshtein, Jaro, Jaro-Winkler, Q-Gram, and exact match. Each pair of fields is compared and generates a field score.
For each record pair, the field scores are then used to classify the pair as either being a good match, a possible match, or not a match. A possible match may need review by a human to discern the quality of the match. Classification results in a record score being produced. Once each record pair is scored, the data can be filtered and output. The filtering is normally used to exclude candidate matches that don't meet the specified criteria (i.e., a record score below a certain threshold).
Now, let's take a data-oriented approach to this problem and see how we can start to break it down. Our first assumption is that the data is already cleansed and standardized, so we'll exclude that phase from our solution. The next phase is blocking. In this phase we optionally apply encoding to blocking key fields and basically join the data against itself. This is done to create candidate pairs of records for the field comparison phase. For the de-dup problem, a standard join won't work; it will create too many redundant candidate record pairs. What is needed is a "group pairs" operator that prevents generating duplicate candidates. For example, if records A, B, and C are in the same blocking group, we want to generate candidate pairs A-B, A-C and B-C. All the other variations (A-A, B-A, B-B, ...) are redundant and shouldn't be generated.
Blocking by key fields implies that we can use hash partitioning to divide and conquer this step. As long as we hash partition on the same keys (optionally encoded) we can run the blocking operation in parallel. Figure 2 depicts this design with a partition count of 4.
For each candidate pair of records, now combined into one record, the field comparisons can be implemented. This is a good place to apply the task parallelism pattern. With task parallelism, we have many different tasks we want to run on our data. Unlike divide and conquer, the tasks are not similar. In this case, each record is independent of all others, so we can actually apply both patterns, divide the data, and then run the set of comparison tasks against it. Figure 3 shows our field comparison design with different comparisons applied to two input fields.
The field comparison results are then used for each candidate pair to determine a record score. This is done using a classification method. As with field comparisons, the pair classifications have no data dependencies on other records and so may be done completely in parallel. We'll take advantage of the task-based parallelism employed for field comparisons and tack on a pipelined record classification that's fed the output of each comparator.
Once a record has been classified, it can be filtered. We can take advantage of pipeline parallelism and tack on the filtering to the record classification.
Now that the problem has been laid out, it's time to move on to implementation and the question asked earlier about the role of Java in all of this.
Dataflow and Java
So far, we've developed a
data-oriented design that fits into a dataflow paradigm very nicely.
How do we implement this in Java? Well, first, why would I choose to
implement this application in Java? A common misconception about Java
is that it doesn't perform and doesn't scale. The argument about Java
not scaling is mainly blamed on garbage collection. While that may have
been true several Java versions ago, it's not so now. Java can perform
well and it can scale. Take a look at Figure 4
to see a benchmark test run with a dataflow framework written in Java.
The test shows that the JVM scales to use all 32 cores of the machine
being tested using the Levenshtein edit distance measure (which we'll
use in our matching application).
Granted, I know I may lose some edge of performance using Java. It's true that in C I'd have more control of memory and so could utilize cache line sizes and other tricks, such as processor affinity, to get better performance. But for programmer productivity, it's hard to beat Java when used with the IDEs available today, not to mention the rich libraries that are available. I'm willing to trade a few points of performance to write an application in Java over C given how much more productive I can be in Java. Java is portable, fully object-oriented, easy to code, and used widely. For all of these reasons, the dataflow framework used to implement the matching application discussed in this article is written in Java - otherwise it would have taken my team much longer than a month.
Published March 30, 2008 Reads 25,363
Copyright © 2008 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jim Falgout
Jim Falgout is solutions architect for Pervasive Software, where he applied dataflow principles to help architect Pervasive DataRush. He is active in the Java development community; in May of 2007, he presented a technical paper titled 'Unleashing the Power of Multi-Core Processors: Scalable Data Processing in Java Technology' at JavaOne.
![]() |
Eman 04/05/08 10:33:42 AM EDT | |||
Funny, Cos, you are pointing out how Java isn't all that "free & open" like its corp. creator claims it is... the beauty of open source + patent law = morass of bear traps Frankly, I haven't seen any Java framework that holds a match to this DataRush thing... download and see for yourself. |
||||
![]() |
Cos 03/27/08 08:05:17 PM EDT | |||
Daah! Check US Patent 7,020,699 |
||||
- Kindle 2 vs Nook
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- Confessions of a Ulitzer Addict
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- It's the Java vs. C++ Shootout Revisited!
- Cloud Computing Can Revitalize Your Career as Software Developer
- IBM Could "Reinvent" Java: Mills
- Oracle & Cloud Computing: Exclusive Q&A with SVP Richard Sarwal
- A Brief History of Cloud Computing
- Kindle 2 vs Nook
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing Expo: Exclusive Q&A with Yahoo! SVP Cloud Computing
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- A Cup of AJAX? Nay, Just Regular Java Please
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- The i-Technology Right Stuff
- JavaServer Faces (JSF) vs Struts
- Rich Internet Applications with Adobe Flex 2 and Java
- Java vs C++ "Shootout" Revisited
- Bean-Managed Persistence Using a Proxy List
- Reporting Made Easy with JasperReports and Hibernate
- Creating a Pet Store Application with JavaServer Faces, Spring, and Hibernate
- What's New in Eclipse?
- Why Do 'Cool Kids' Choose Ruby or PHP to Build Websites Instead of Java?
- i-Technology Predictions for 2007: Where's It All Headed?










































