|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TOP THREE LINKS YOU MUST CLICK ON FrontPage Feature Crunching Big Data with Java
One Team, One Month, One JVM
By: Jim Falgout
Mar. 30, 2008 04:00 AM
A Problem of Matching A fuzzy matching application can basically be broken into the following modules of functionality:
The blocking phase attempts to place similar records in the same blocks for candidate record pair generation. This is a key step, since many records can be generated during this phase. During de-dup, the number of records output per block is (N * (N -1) )/ 2, where N is the number of records in the block. One common way of blocking is to use some sort of geographic data, such as parts of a postal address, as blocking keys. Any of the key fields may have encoding, such as Soundex, applied as part of the blocking phase. Why group into blocks in the first place? Because comparing every record to every other record is normally impossible (or at least would take an enormous amount of time). For only a few thousand records, this is not a problem. But for larger datasets, it's a huge problem. Table 1 shows the numbers of candidate pairs generated if all input rows are compared to all other rows. As you can see, a data explosion happens quickly. Blocking is needed to cut down on the size of groups used to generate candidate record pairs and thereby dramatically reduce the resulting number of comparisons. The field comparison phase compares fields from the record pairs using specified comparison methods. Some common comparison methods are: Levenshtein Edit Distance, Damerau-Levenshtein, Jaro, Jaro-Winkler, Q-Gram, and exact match. Each pair of fields is compared and generates a field score. For each record pair, the field scores are then used to classify the pair as either being a good match, a possible match, or not a match. A possible match may need review by a human to discern the quality of the match. Classification results in a record score being produced. Once each record pair is scored, the data can be filtered and output. The filtering is normally used to exclude candidate matches that don't meet the specified criteria (i.e., a record score below a certain threshold). Now, let's take a data-oriented approach to this problem and see how we can start to break it down. Our first assumption is that the data is already cleansed and standardized, so we'll exclude that phase from our solution. The next phase is blocking. In this phase we optionally apply encoding to blocking key fields and basically join the data against itself. This is done to create candidate pairs of records for the field comparison phase. For the de-dup problem, a standard join won't work; it will create too many redundant candidate record pairs. What is needed is a "group pairs" operator that prevents generating duplicate candidates. For example, if records A, B, and C are in the same blocking group, we want to generate candidate pairs A-B, A-C and B-C. All the other variations (A-A, B-A, B-B, ...) are redundant and shouldn't be generated. Blocking by key fields implies that we can use hash partitioning to divide and conquer this step. As long as we hash partition on the same keys (optionally encoded) we can run the blocking operation in parallel. Figure 2 depicts this design with a partition count of 4. For each candidate pair of records, now combined into one record, the field comparisons can be implemented. This is a good place to apply the task parallelism pattern. With task parallelism, we have many different tasks we want to run on our data. Unlike divide and conquer, the tasks are not similar. In this case, each record is independent of all others, so we can actually apply both patterns, divide the data, and then run the set of comparison tasks against it. Figure 3 shows our field comparison design with different comparisons applied to two input fields. The field comparison results are then used for each candidate pair to determine a record score. This is done using a classification method. As with field comparisons, the pair classifications have no data dependencies on other records and so may be done completely in parallel. We'll take advantage of the task-based parallelism employed for field comparisons and tack on a pipelined record classification that's fed the output of each comparator. Once a record has been classified, it can be filtered. We can take advantage of pipeline parallelism and tack on the filtering to the record classification. Now that the problem has been laid out, it's time to move on to implementation and the question asked earlier about the role of Java in all of this.
Dataflow and Java Granted, I know I may lose some edge of performance using Java. It's true that in C I'd have more control of memory and so could utilize cache line sizes and other tricks, such as processor affinity, to get better performance. But for programmer productivity, it's hard to beat Java when used with the IDEs available today, not to mention the rich libraries that are available. I'm willing to trade a few points of performance to write an application in Java over C given how much more productive I can be in Java. Java is portable, fully object-oriented, easy to code, and used widely. For all of these reasons, the dataflow framework used to implement the matching application discussed in this article is written in Java - otherwise it would have taken my team much longer than a month. LATEST JAVA STORIES & POSTS
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK SPONSORED BY INFRAGISTICS
BREAKING JAVA NEWS
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||