YOUR FEEDBACK
Craig Balding wrote: Bruce I read your comment and couldn't quite understand how it related to the p...


2008 East
DIAMOND SPONSOR:
Data Direct
Frontiers in Data Access: The Coming Wave in Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
Intel
Virtualization – Path to Predictive Enterprise
Green Hills
IT Security in a Hostile World
JBoss / freedom oss
Practical SOA Approach
GOLD SPONSORS:
Software AG
The Art & Science of SOA: How Governance Enables Adoption
PlateSpin
Effective Planning for Virtual Infrastructure Growth
Fujitsu
Automated Business Process Discovery & Virtualization Service
Ceedo
Workspace Virtualization
Click For 2007 West
Event Webcasts

2008 East
PLATINUM SPONSORS:
Appcelerator
Think Fast: Accelerate AJAX Development with Appcelerator
GOLD SPONSORS:
DreamFace Interactive
The Ultimate Framework for Creating Personalized Web 2.0 Mashups
ICEsoft
AJAX and Social Computing for the Enterprise
Kaazing
Enterprise Comet: Real–Time, Real–Time, or Real–Time Web 2.0?
Nexaweb
Now Playing: Desktop Apps in the Browser!
Sun
jMaki as an AJAX Mashup Framework
POWER PANELS:
The Business Value
of RIAs
What Lies Beyond AJAX?
KEYNOTES:
Douglas Crockford
Can We Fix the Web?
Anthony Franco
2008: The Year of the RIA
Click For 2007 Event Webcasts
SYS-CON.TV
TOP THREE LINKS YOU MUST CLICK ON


Simplify Pattern Matching
Use java.util.regex

Pattern matching using "regular expressions" can help automate a number of text-processing operations like search and replace, input validation, text conversion, and filters. What otherwise requires significant amounts of code can be done in just a few lines with regular expressions because of the powerful underlying regular expressions processing engine. Some programming languages such as Perl and operating systems utilities such as grep have supported regular expressions for a number of years. But before J2SE 1.4, Java (J2SDK) didn't support it and one had to use external packages like Jakarta Regexp, IBM's commercial package (com.ibm.regex). Thankfully that changed with the introduction of the java.util.regex package. The package provides standard implementations for specifying and handling regular expressions. This article will show you how you can quickly use it to implement regular expressions for pattern-based search features. The article starts out by reviewing some important regular expressions fundamentals and then dives into the details of the package. The embedded examples demonstrate the important constructs through simple use cases.

What's a Regular Expression and Why It's Important
If you've used regular expressions in other languages, the following sections will introduce you to the Java flavor and help uncover some of the new features. If you're not familiar with regular expressions, you'll soon discover how to use them effectively to handle text processing in ways you never thought possible before.

A regular expression is a mechanism to specify a textual pattern and detect the presence of the pattern in a given character sequence. In other words, it's a pattern language. A regular expressions pattern is typically specified as a combination of two types of characters, literals and meta-characters. Literals are normal text characters (a, b, c, 1, 2) while meta-characters (ex. *, $, etc.) convey a special meaning to the regular expression engine discussed in the next few sections. A regular expression engine understands the pattern language. The engine interprets the regular expression, does the pattern match, and processes the results. The language and the engine together make regular expressions a powerful tool that simplifies pattern matching. A given implementation like java.util.regex and JRegex provides additional query and utility functions (replace, split, etc.) that are useful in modifying the target text. For details about other Java implementations and implementations available in other languages, please consult the references section.

Meta-Characters
Meta-characters provide advanced expressive power to regular expressions. I will discuss a frequently used meta-character subset that Java supports. For a complete list, please consult the Sun's API documentation (class java.util.regex.Pattern). A number of examples that use these meta-characters immediately follow this discussion.

Anchors
An anchor matches a pre-defined position in the target text. Anchors are similar to reference points and are used to determine the relative positions of other elements in the regular expression. They are typically used to match the boundary positions of string, line, word, etc., although they could also match any other position using the special "Lookaround" constructs shown in Listing 1. The Lookaround constructs match a position based on a given condition. A positive lookahead (?= Neo) matches a position that's immediately followed by the text 'Neo' whereas a negative lookahead (?! Neo) matches the positions that don't have the text 'Neo' at the end. Lookbehind constructs (positive ?<=..., negative ?<!...) work in the opposite way.

Character Classes, Class Shorthands and Alternation
A character class construct [...] is used to specify a list of characters to be included in the regular expression while the construct [^...] specifies the character list to be excluded. In the case of [...] a match is considered successful if any of the characters specified in the list is found. For example, the regular expression [cw]ould matches the instances of words 'could' and 'would'. The class notation implies a logical OR condition also known as "Alternation" between its elements. Alternation is used to specify conditions (x|y) where matching either x or y is considered a success. Therefore, the earlier regular expression could also be written as (c|w)ould.

Special class meta-characters such as (-) can be used to specify a range of values, so class [a-z] specifies all letters from a through z. Class shorthand is a simplified representation of commonly used classes such as the class digit (\d), word (\w), whitespace, etc. A list of class shorthands available in Java is shown in Listing 1.

Quantifiers
Quantifiers are used to indicate the number of instances of the element (to which they are applied in the regular expression) required for a successful match. Java supports three quantifier types namely greedy, reluctant, and possessive. Greedy quantifiers try to match as much as possible while their reluctant counterparts (with ? at the end) try to match the least required to fulfill a match. What this means is that a greedy quantifier will try to match the entire line whether or not a successful match has occurred. It can turn into real performance overhead when the target text is big. Reluctant (or lazy) quantifiers quit as soon as a successful match occurs without bothering to run through the entire line. Possessive quantifiers (with + appended) are useful in optimizing the match operations since they don't keep the prior match states around. Listing 1 details all three types of quantifiers.

Mode Modifiers
These are special constructs to turn certain powerful regex features 'on' or 'off.' The default mode for these features is 'off' since they involve additional overhead when doing a match. The use of (?i), for example, in a regular expression turns on the case insensitive match mode. Java also supports specifying the mode modifiers at compile time using the static final variables in the class java.util.regex.Pattern. The Pattern class is discussed below in the java.util.regex section.

Example 1: Input Validation
Let's now review an example that uses the meta-characters discussed so far to address the password validation needs at Zion. The security standards set at Zion Corporation require that passwords contain only alphanumeric characters, with at least one digit and ranging between six and 32 characters long.

Listings 2 and 3 show two possible solutions to the same problem. The first approach (Listing 2) uses the built-in regular expression support inside the java.lang.String matches() method. The second approach (Listing 3) uses the classes provided by the java.util.regex package. The underlying mechanics are the same in either case and are discussed next. I'll leave the API specifics to the next section.

Let's see how the solution meets the specified requirements. The regular expression pattern on Line 3 (Listing2) is same as the Patttern pContent (Line5, Lisiting 3). The pattern uses a combination of the meta-characters, namely the character class [a-z], class shorthand (\d shorthand for character class [0-9]), and greedy quantifiers (*, +). When put in a solution context the pattern "\\b(?i)([a-z]*\\d+[a-z]*)\\b" is successful if between the word boundaries, there are 0 or more letters followed by 1 or more digits followed by 0 or more letters. The mode modifier ?i is used to indicate that the search is case-insensitive. Notice that there are a couple of differences in the regular expressions in the two listings. The obvious one is the use of comments in Listing 3. The other difference is more subtle but important, did you find it? Check out the next section (Capturing, Grouping) to verify the answer.

The pattern on line 4 (Listing 2) addresses the password-length requirement, using the {min,max} quantifier that imposes minimum and maximum limits on the number of successful matches. In this case a match is successful if "\\b(?i)([a-z0-9]){6,32}\\b" there are between six and 32 instances of alphanumeric characters between the word boundaries. Notice that in Listing 3 the case-insensitive option is specified using the final variables in the class Pattern, which makes the expression more readable. The variables are discussed further in the following sections.

About Anant Athale
Anant Athale is a senior software engineer at Motorola Labs. He specializes in enterprise Java technologies and is an active participant in the Java Community Process (JSR 262,260). He is Sun certified and has a masters degree from Arizona State University.

LATEST JAVA STORIES & POSTS
Furthering its dedication to providing Java developers productivity with choice, Oracle announced the Oracle Enterprise Pack for Eclipse, a new component of Oracle Fusion Middleware. This release marks the first free Eclipse 3.4 environment to support Oracle WebLogic Server 10g R...
Two of the biggest launches in Rich Internet Application history took place in 2007/2008 when Adobe launched AIR 1.0 in February '08 and Microsoft launched Silverlight (September '07). At the 6th International AJAXWorld RIA Conference & Expo in October SYS-CON Events is delighted...
Red Hat CTO Brian Stevens, Citrix CTO Simon Crosby, Egenera CTO Pete Manca, Allen Stewart, Group Manager, Windows Virtualization at Microsoft, and Brian Duckering, Sr. Director of Products and Alliances at Symantec were the top industry executives who joined Jeremy Geelan in the ...
Government intervention and direction has long been critical to the development of the computer industry. The Internet, after all, was derived from the ARPANET, developed in the early 1970s from a U.S. government-sponsored research project by the Advanced Research Projects Agency...
Commercial systems are developed with a huge range of performance requirements and we are concerned in this article with the small number of systems where absolute maximum performance is demanded either in terms of execution speed or available memory. We'll discuss the role of be...
Genuitec announced the availability of MyEclipse Enterprise Workbench 7.0 milestone 1. This milestone release delivers advanced AJAX tooling for Java EE and full Application Lifecycle Management (ALM) capabilities for Eclipse 3.4 Ganymede, among other enhancements.
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS
SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON FEATURED WHITEPAPERS

SPONSORED BY INFRAGISTICS
There are many forces that influence technological evolution. After a decade of building enterprise ...
2008 is going to be an important year for Rich Internet Applications. Most organizations are deliver...
The OpenAjax Alliance is developing an Ajax industry wishlist for future browsers, using a dedicated...
In every field of design one of the first things students do is learn from the work of others. They ...
Infragistics announced the availability of two Community Technology Preview (CTP) User Interface (UI...
The YUI development team has released version 2.5.2; you can download the new release from SourceFor...
ADS BY GOOGLE
BREAKING JAVA NEWS

SpringSource, a leading provider of infrastructure software and the company behind ...