| By Sunil Venkayala | Article Rating: |
|
| April 7, 2005 12:00 AM EDT | Reads: |
47,239 |
Confusion matrix is a two-dimensional, N x N table that indicates the number of correct and incorrect predictions a classification model made on specific test data. It provides a measure of how well a classification model predicts the outcome and where it makes mistakes.
Lift is a measure of how much better prediction results are using a model as opposed to chance. To explain the lift we will use a product campaign example. Say product campaigning to all 100,000 existing customers results in sales of 10,000 products. However, by using the mining model say we sell 9,000 products by campaigning to only 30,000 selected customers. So by using the mining model, campaign efficiency is increased three times, so the lift value is computed as 3, i.e., (9000/30000)/(10000/100000).
Listing 6 illustrates how to test the churn model by executing the classification test task using "CHURN_TEST_TABLE." After successfully completing the task, a classification test metrics object is created in the DME. It can be retrieved from the DME to explore the test metrics. (Listings 6-8 can be downloaded from www.sys-con.com/java/sourcec.cfm.)
Apply the Mining Model
After evaluating the model, the model is ready to be deployed to make predictions. JDM provides an ApplySettings interface that encapsulates the settings related to the apply operation. The apply operation will result in an output table with the predictions for each case. Apply settings can be configured to produce different contents in the output table. For more details on apply settings, refer to JDM API documentation.
In this example, we use the top prediction apply setting to produce the top prediction for each case. The DataSetApplyTask is used to apply the churn model on the "CHURN_APPLY_TABLE." JDM supports RecordApplyTask to compute the prediction for a single record; this task is useful for real-time predictions. In this example, we use the dataset apply task to do the batch apply to make predictions for all the records in the "CHURN_APPLY_TABLE".
Listing 7 illustrates how to apply the "CHURN_MODEL" on "CHURN_APPLY_TABLE" to produce an output table "CHURN_APPLY_RESULTS" that will have the predicted churn value "YES or NO" for each customer.
After doing the apply task, a "CHURN_APPLY_RESULTS" table will be created with two columns, "CUSOMER_ID" and "PREDICTED_CHURN." The probability associated with each prediction can be obtained by specifying it in the ApplySettings.
Here the mapTopPrediction method is used to map the top prediction value to the column name. The source destination map is used to carry over some of the columns from the input table to the apply-output table along with the prediction columns. In this case, "CUSTOMER_ID" column is carried over from the apply-input table to the output table. JDM specifies many other output formats so applications can generate the apply-output table in the required format. A discussion of all the available options is beyond the scope of this article.
Figure 2 summarizes the JDM data mining process flow that we did in this example.
Market Basket Analysis Example
To explain the use of unsupervised data mining in a practical scenario, we'll use one of the most popular data mining problems called market basket analysis.
The purpose of market basket analysis is to determine what products customers buy together. Knowing what products people buy together can be helpful to traditional retailers and web stores like Amazon.
The information can be used to design store layouts, web page designs, and catalog designs by keeping all cross-sell and up-sell products together. It can also be used in product promotions like discounts for cross-sell or up-sell products. Direct marketers can use basket analysis results to decide what new products to offer their prior customers.
To do market basket analysis, it's necessary to list the transactions customers made. Sometimes customer demographics and promotion/discount details are used to infer rules related to demographics and promotions. Here we use five transactions at a pizza store. For simplicity's sake, we'll ignore the demographics and promotion/discount details.
Transaction 1: Pepperoni Pizza, Diet Coke, Buffalo wings
Transaction 2: Buffalo wings, Diet Coke
Transaction 3: Pepperoni Pizza, Diet Coke
Transaction 4: Diet Coke, French Fries
Transaction 5: Diet Coke, Buffalo wings
The first step is to transform the transaction data above into a transactional format, i.e., a table with transaction id and product name columns. The table will look like Table 2. Only the items purchased are listed.
An association function is used for market basket analysis. An association model extracts the rules stating the support and confidence in each rule. The user can specify the minimum support, minimum confidence, and maximum rule length as build settings before building the model.
Since we have only five transactions, we'll build a model to extract all the possible rules by specifying minimum support as 0.1, minimum confidence as 0.51, and no maximum limit for the rule length. This model produces five rules (see Table 3).
In a typical scenario, you may have millions of transactions with thousands of products, so understanding the support and confidence measures and how these are calculated provides good insight into which rules need to be selected for a business problem.
Support is the percentage of records containing the item combination compared to the total number of records. For example take Rule 1, which says, "If Buffalo wings are purchased then diet coke will also be purchased." To calculate the support for this rule, we need to know how many of the five transactions conform to the rule. Actually, three transactions, i.e., 1, 2 and 5, conform to it. So the support for this rule is 3/5=0.6.
Confidence of an association rule is the support for the combination divided by the support for the condition. Support gives an incomplete measure of the quality of an association rule. If you compare Rule 1 with Rule 5, both of them have the same support, i.e., 0.6, because support is not directional. Confidence is directional, so that makes Rule 1 a better rule than Rule 5.
Rule length can be used to limit the length of the rules. When there are thousands of items/products with millions of transactions, rules get complex and lengthy, so it's used to limit the length of the rules in a model.
Using JDM to Solve the Market Basket Problem
So how does one use JDM API to build an association rules model and extract the appropriate rules from the model?
Typically data for association rules will be in a transactional format. A transactional format table will have three columns: "case id", "attribute name," and "attribute value" columns.
In JDM by using the PhysicalAttributeRole enumeration, the transactional format data can be described. The AssociationSettings interface is used to specify the build settings for association rules. It has minimum support, minimum confidence, and maximum rule-length settings that can be used to control the size of association rules model.
Listing 8 illustrates building a market-basket analysis model using the JDM association function and exploring the rules from the model using rule filters.
Conclusion
The use of data mining to solve business problems is on the upswing. JDM provides a standard Java interface for developing vendor-neutral data-mining applications. JDM supports common data-mining operations, as well as the creation, persistence, access, and maintenance of the metadata supporting mining activities. Oracle initiated a new JSR-247 to work on new features for a future version of the JDM standard.
References
Published April 7, 2005 Reads 47,239
Copyright © 2005 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Sunil Venkayala
Sunil Venkayala is a J2EE and XML group leader and principal member of technical staff at Oracle Data Mining Technologies group. He is an expert group member of the Java Data Mining (JDM) standard developed under JSR-73. Sunil has more than five years of experience in developing applications using predictive technologies available in the Oracle Database. He has more than seven years of experience working with Java and Internet technologies.
![]() |
NaveenKumarSR 03/24/09 04:09:58 AM EDT | |||
Hi Thanks & Regards |
||||
- Kindle 2 vs Nook
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- Confessions of a Ulitzer Addict
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- It's the Java vs. C++ Shootout Revisited!
- Cloud Computing Can Revitalize Your Career as Software Developer
- IBM Could "Reinvent" Java: Mills
- Oracle & Cloud Computing: Exclusive Q&A with SVP Richard Sarwal
- A Brief History of Cloud Computing
- Kindle 2 vs Nook
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing Expo: Exclusive Q&A with Yahoo! SVP Cloud Computing
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- A Cup of AJAX? Nay, Just Regular Java Please
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- The i-Technology Right Stuff
- JavaServer Faces (JSF) vs Struts
- Rich Internet Applications with Adobe Flex 2 and Java
- Java vs C++ "Shootout" Revisited
- Bean-Managed Persistence Using a Proxy List
- Reporting Made Easy with JasperReports and Hibernate
- Creating a Pet Store Application with JavaServer Faces, Spring, and Hibernate
- What's New in Eclipse?
- Why Do 'Cool Kids' Choose Ruby or PHP to Build Websites Instead of Java?
- i-Technology Predictions for 2007: Where's It All Headed?



































