| By Ken Blackwell | Article Rating: |
|
| September 14, 2000 12:00 AM EDT | Reads: |
16,831 |
In my last article (XML-J, Vol. 1, issue 3) I made the case for using custom classes derived from XML Schemas to represent XML documents in C++ applications. That article focused primarily on the problems of generating XML documents from program objects, and explained how custom classes have significant advantages over standards like DOM and SAX in terms of performance, object orientation and maintainability of source code.
Here I'll describe a unique methodology for parsing XML data into C++ classes that provides all the object-oriented benefits detailed in the first article, with increased performance (compared to traditional generic XML parsers).
The Problem with Conventional Parsers
C++ programmers have been dealing with parsing technologies for years. Most of you remember writing simple language parsers in school, and probably wrote the basic syntax parser in tools like Lex and Yacc. So, for C++ developers, the idea of a syntax parser isn't especially intimidating.
The basic grammar for XML is pretty simple compared to a programming language like C++ or Java, for example, but there's one problem unique to XML parsing that is daunting: unlike conventional programming languages, XML doesn't have a fixed set of tags (i.e., keywords). Imagine trying to develop a general-purpose grammar for a programming language with a user-defined set of keywords!
To solve the general problem of XML parsing, it's necessary to build a parser that can be dynamically fed a list of tags and rules for the specific dialect of XML to be parsed. In the terminology of XML standards, that means specifying an XML Schema file to a DOM parser so that it knows how to parse and validate the specific dialect of the input XML file.
If an application reads and writes a variety of dialects of XML documents, the DOM model is appropriate because it doesn't require source code changes for incremental support for a new dialect of XML. This is typically the case for integration broker applications, as described in my last article, in which the broker is reading, transforming and forwarding all kinds of XML documents within and between organizations.
However, as I also described, there's a large class of applications in which only a few types of XML are spoken and these don't often change. For these, the overhead of DOM and the lack of application-specific object orientation is a major drawback.
Static Parsers Derived from XML Schemas
Just as it's beneficial in some environments to derive C++ classes from XML Schemas for writing XML documents, it can also be beneficial to derive classes to read XML documents from schemas.
The typical process for creating a language parser in C++ is to hand-code the Lex rules and Yacc grammar, then generate the Lexer and parser from these XML dialect-specific input files (see Figure 1).
This process is tedious, however, and must be redone for each dialect of XML that your application needs to parse. While doable, the same logic that you'd hand-code in the rules and grammar is already encapsulated in the XML Schema file. A more efficient approach is to develop a translation program that can convert the XML Schema file into the equivalent Lex rules and Yacc grammar for the XML dialect (see Figure 2).
The example project in Listing 1 shows a generated grammar for a sample XML DTD file called acmepc.dtd. You'll see the generated Yacc input in acmepcxml_parser.y and the Lex input in acmepcxml_lexer.l. All the classes and parser for this project are contained in the C++ namespace acmepcxml.
Using the generated custom parser is simple. Just create an instance of the acmepcxml::XMLImporter class, initialize it with its Initialize() member and import the XML data into the schema-derived classes with the ImportFromFile() member. The importer exposes a base class root node of the class tree via the GetXObject() member. This base class is then dynamically cast back to the acmepc class that contains the context of the specific XML dialect defined by the acmepc.dtd schema (see Listing 1).
Advantages of Custom Parser Approach
There are four primary advantages to creating a custom parser rather than using a generic parser like DOM.
- First and foremost, it's fast. I've run benchmarks that show the custom parser to be up to three times faster than the fastest DOM parser I can find while also having a smaller in-memory footprint. The primary reason it's so much faster than DOM seems to be that it doesn't have to do dynamic validation of the XML input. Instead, validation is enforced by the automata generated by Yacc from the input files, which are derived from the XML Schema.
- The generated parser can integrate tightly with the derived classes de-
scribed in my previous article. There is no two-step process of parsing into the DOM hierarchy, then populating classes from the DOM data structures. The custom parser creates the schema-derived classes directly, without the need for the intermediate step. The generated parser can also integrate tightly with framework technologies you might be using, such as STL and MFC class libraries.
- You get all the source code to the components that link into your application. By using the GNU-licensed Flex and Bison tools, the output source code will run on virtually every operating system imaginable. I've been very successful, for example, in running Flex and Bison on Windows NT and using the output C/C++ code on a variety of platforms with no necessary source code changes.
- The final advantage, and the coolest of all, is that using Lex and Yacc enables you to handle those pesky XML entities more easily. I use this feature to automatically expand entities on input so my program doesn't have to worry about them. XML entities can be preprocessed just as a macro is preprocessed by a compiler when parsing a C input file. The class instances created by the custom parser contain data with entity references fully expanded. I can't stress enough the amount of headaches this little feature can save you when dealing with documents with lots of entities.
While XML processing may be new to the C++ community, the skills and technologies that have matured over the last decade in this community can still be very useful in handling XML data formats. In my last article I described the benefits of deriving C++ class definitions from XML Schemas. Here, I've gone a bit further to show how to derive parser grammars for XML dialects from the XML Schema.
As the XML Schema standard nears acceptance, there will be many other opportunities to reuse the work of schema designers to automatically derive programming source code, relational database schemas and other artifacts that otherwise would have to be coded by hand. C++ developers should look for these opportunities as ways to reduce the amount of repetitive work required to add or update support for specific XML dialects.
Published September 14, 2000 Reads 16,831
Copyright © 2000 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Ken Blackwell
Ken Blackwell is the chief technical officer of Bristol Technology, Inc., where he oversees product architecture and research in XML, middleware and transaction analysis technologies.
- Patterns for Building High Performance Applications
- It's the Java vs. C++ Shootout Revisited!
- Asynchronous Logging Using Spring
- Java for Programmers (2nd Edition)
- Cross-Platform Mobile Website Development – a Tool Comparison
- Three Buzzwords That Every CIO Hears but One They Should Listen To
- Write Once Run Anywhere or Cross Platform Mobile Development Tools
- Immersing into JavaScript Frameworks
- Workday Reportedly Prepping to Go Public
- Cloud Expo New York: The Java EE 7 Platform - Developing for the Cloud
- Book Review: Sams Teach Yourself Java in 24 Hours
- OpenOffice.com Lives
- Book Excerpt: Introducing HTML5
- Adobe Sends Flex to the Apache Foundation
- Five Years Waiting for JRE 7: Is It Justified? (Part 1)
- Book Excerpt: Java Application Profiling Tips and Tricks
- i-Technology in 2012: Five Industry Predictions
- Patterns for Building High Performance Applications
- It's the Java vs. C++ Shootout Revisited!
- OpenXava 4.3: Rapid Java Web Development
- The Next Web Architecture
- Asynchronous Logging Using Spring
- Java for Programmers (2nd Edition)
- Is Write Once Run Anywhere Ever Going to Be a Reality?
- A Cup of AJAX? Nay, Just Regular Java Please
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- JavaServer Faces (JSF) vs Struts
- The i-Technology Right Stuff
- Rich Internet Applications with Adobe Flex 2 and Java
- Java vs C++ "Shootout" Revisited
- Bean-Managed Persistence Using a Proxy List
- Reporting Made Easy with JasperReports and Hibernate
- Creating a Pet Store Application with JavaServer Faces, Spring, and Hibernate
- Why Do 'Cool Kids' Choose Ruby or PHP to Build Websites Instead of Java?
- What's New in Eclipse?
- i-Technology Predictions for 2007: Where's It All Headed?




















