Welcome!

Java Authors: Elizabeth White, Torben Andersen, Liz McMillan, Patrick Carey, Srinivasan Sundara Rajan

Related Topics: Java

Java: Article

StAX: Java's XML Pull Parser Specification

An overview

Until recently Java programmers have had three options when wishing to access the XML infoset: they could use DOM, JDOM, or SAX. With the release of the JSR 173 StAX specification, Java programmers now have a fourth option, which gives them the efficiency of SAX with a convenient and extendable programming model. This article explores the rationale behind StAX's pull parsing model and describes how you can use the API to more cleanly create Java code to extract the information you need from your XML document. The article describes StAX's two API flavors, "cursor" and "event," and provides some of the reasons why the specification ended up containing two sets of reading and writing APIs. Example usage of each of the reading APIs and each of the writing APIs will be provided.

Document Streaming vs Document Object Model
When creating code that processes XML documents there are two approaches to dealing with the XML infoset data: object model and streaming. With object model, you first create an in-memory object model tree that holds the complete infoset state for an XML document; once in memory you can freely navigate around the tree and even evaluate arbitrary XPath expressions against the tree. This flexibility comes at a price - the complete details of the XML document must be held in memory as objects for the entire duration of the document processing. The creation of the document object graph requires considerable processor resources and takes up a large memory footprint. This may not be a problem for small XML documents, but for large XML ones memory may become a bottleneck to application performance.

With the streaming approach the XML infoset is processed in a serial manner (as a depth first traversal of the XML infoset tree); once an element has been seen its state is discarded and may be garbage collected. Only the infoset state at the current point of the document is available at any one time, which clearly limits the types of processing that can be done and implies that you need to know what processing you are going to perform before reading in the XML document. If you don't think you will ever need to use XML streaming, try loading a 100 megabyte XML file into your latest application and watch what happens.

Streaming Pull Parsing vs Streaming Push Parsing
If you are creating an application that's memory limited, either because you are running on a constrained device (as in a phone) or your application is simultaneously processing several requests (as in an application server), you need to read your XML documents using a streaming model. In the past you were restricted to using the Simple API for XML (SAX). SAX was the first widely available API for reading XML in Java and provides a very low-level, efficient API that deals directly with the character data in the XML document. SAX uses a push processing model in which the SAX library reads the XML document and calls methods on your application objects as it encounters elements and text within the XML document. Although the SAX API is simple, the code application developers need to create to use SAX is not.

A "pull" parsing alternative to SAX's "push" parsing has lurked in the background for some time, but no longer: the recently ratified StAX specification now standardizes a pull parser for Java. StAX provides an alternative processing model where you call methods on the parser at your leisure and move the processing along at your command. The key difference here is that with SAX you don't have control of the application thread and can only accept invocations from the parser. In contrast, with pull parsing you own the application thread and you control when and where you call the XML parser.

This control over "when and where" leads to increased freedoms in application design, allowing you to either collect all your parsing code together or alternatively place your parsing code within the objects that understand that particular type of information.

Pull parsing has the following advantages over push parsing:

  • Parsing simple documents can now be done with simple code.
  • It's much simpler to write recursive descent parsing code for more complex documents.
  • More than one document can be read by an application at one time with just a single thread. This can be useful when part way through reading one document you need to read a second document.
  • The parser can be told to skip parts of the XML document that are not relevant to the application, which simplifies your code and may reduce processing time and memory churn.
  • You can create streaming pipelines in an object-oriented way that are efficient and simple to use.
XML Information Model
The StAX specification models an XML document as a set of events, and these events are pulled by the application and supplied in the order in which they are encountered in the XML document. The StAX specification defines the following types of events: Attribute, Characters, Comment, StartDocument, EndDocument, StartElement, EndElement, Namespace, DTD, EntityDeclaration, EntityReference, NotationDeclaration, and ProcessingInstruction.

The last five event types are only seen if your document contains a DTD. Each event has different properties associated with it depending on the type of event.

A StAX parser is an engine that reads a Unicode character stream and converts the textual data into events. Below is an example XML document:


<?xml version="1.0" encoding="UTF-8"?>
<pre1:foo xmlns:pre1="http://www.example.com/foons">
	<pre1:sometext>
			the text <![CDATA[<notallowed> as normal text]]> other text
	</pre1:sometext>
	<pre2:bar ratio="5.5" xmlns:pre2="http://www.example.com/barns"/>
</pre1:foo>

The above document would be parsed into the events shown in Figure 1.

As you can see, this small document would be converted (by default) into a stream of 14 primary events. Each colored circle in the figure is an event object. The large circles are the primary events that are seen by the application, while the small circles are secondary events that are generally accessed from some primary event.

Salient points about the event stream to note are:

  • Every StartElement event has a matching EndElement event, even for empty events like <eg/>.
  • Attributes, although events, are not (normally) seen in the event stream but are instead accessible from their StartElement event.
  • Namespaces events are not (normally) seen in the event stream but instead appear twice, first accessible from a StartElement and, second, accessible from the corresponding EndElement.
  • XML Character data may be split over more than one event and crop up where you might not expect.
While parsing an XML document the StAX parsing engine maintains a namespace stack. The namespace stack holds details of all the XML namespaces defined for the current element and its ancestors. This namespace stack is accessible though the interface javax.xml. namespace.NamespaceContext, which includes methods for looking up a namespace URI given a prefix and looking up a prefix given a namespace URI.

Cursor vs Iterator
The Story of Two APIs

When programming in Java, object creation has traditionally been seen as the enemy of performance. One of the SAX API's main advantages is that very few superfluous objects are created during the parsing of an XML document. SAX even gives the application direct access into the parser's internal character buffer in order to read text content! This lean and mean approach leads to very efficient parsing of XML. One of the design goals for StAX was for it to be at least as fast as SAX.

Very early on during the process of creating the StAX specification two alternative API styles were proposed by the expert group. One, which I'll call "cursor," followed SAX's lean and mean approach; the other, which I'll call "iterator," was a modern object-oriented API utilizing immutable objects. The expert group looked long and hard at which API style we should go with, including doing a performance analysis of the two styles. What we discovered was that we were trying to support three different end-user developers:

  1. Library and infrastructure developers: The group who creates app servers, JAXM, JAXB, JAXRpc, implementations, etc., and needs a very low-level and close-to-the-metal API with little overhead and few requirements for extensibility.
  2. J2ME developers: This group wants an XML pull parsing library that's small and simple and has a tiny footprint. They have little need for being able to extend the parser or modify the event stream.
  3. J2EE and J2SE developers: This large group generally wants a simple, efficient pull parser that naturally produces elegant bug-free code while allowing for more complex features such as stream modification and introducing new application event types.
We looked at many inventive ideas on how to create a single API that would allow us to "have our cake and eat it," but each of these ideas was rejected as they left us with a bad taste in our mouth and invariably produced a less-predictable API. In the end our desire to support these three developer types and our requirement that we be just as fast as SAX led the expert group to support both API styles.

Cursor API
The cursor API contains a central parser interface called XMLStreamReader that includes accessor methods for all the possible information you could retrieve from the XML Information model. The parser interface contains methods for accessing the document encoding, element names, attributes, namespaces, text content, processing instructions, etc. Methods are provided to allow access into the internal character buffer just as in SAX. The cursor approach is like a mirror image of SAX and provides direct access to string and character information while exposing methods with integer indexes for accessing attribute and namespace information just as in SAX. Thus it's possible to access all of the Information Model via a set of methods that return strings so object allocation is kept to a bare minimum.

Listing 1 provides some example code that uses an XMLStreamReader instance called "sr", which reads the example document listed in Figure 1. The code uses "sr" to walk over the document and retrieve the text content of the element <pre1:sometext> and the value of the ratio attribute of the element <pre2:bar>.

Listing 1 assumes the XMLStreamReader sr has just been created, i.e., StartDocument will be the first event.

Iterator API
The iterator API presents the event stream as an ordered list of immutable event objects. The StAX API defines a common base interface called XMLEvent and a subinterface for each of the event types listed in the XML information model. The Iterator API contains a central parser interface called XMLEventReader with just five methods in it, the most important being nextEvent(), which returns the next event in the stream. The interface XMLEventReader implements java.util.Iterator so it can be passed into routines that can handle the standard Java Iterator.

The common super interface XMLEvent contains methods for finding the actual event type and downcasting to the event subtypes. Listing 2 shows the equivalent code for reading our example XML document but using the XMLEventReader "er".

Clearly in a real application the three QName objects would be held in static final fields but are left inline here to aid in comparing the code examples.

What can I do with the iterator API that I can't do with the cursor API?

As the XMLEvent subclasses are immutable objects, you can place them in arrays, lists, and maps and pass them through your application as you desire, even after the parser has moved on to later events.

You can create your own subtypes of XMLEvent that are either completely new information items or extensions of existing events but with extra methods.

You can modify an event stream by adding or removing events in a way that's much simpler to code than with the cursor API.

Which API Should I Use?
This decision can only be made by the individual developer depending on the specific situation; if one API fitted all needs we would not have ended up with two.

My personal rules of thumb:

  1. If you're programming on J2ME, use the cursor API.
  2. If you're creating low-level infrastructure or libraries and you need to achieve the best possible performance, use the cursor API.
  3. If you want to create pipelines of XML processing, use the iterator API.
  4. If you want to modify the event stream, use the iterator API.
  5. If you want to future proof your app by enabling pluggable processing of the event stream, use the iterator API.
  6. If in doubt, use the iterator API.
Performance: What's the Beef?
During the expert group design discussions for StAX, the author created a prototype pull parser that provided both types of API styles, and a series of performance tests were performed using a wide range of XML test data. Due to the fact that the internal parser implementation was common for all the tests, the results showed the performance impact of using the iterator API style versus the cursor API style when accessing the parser.

The performance differences between the API styles arise mostly because of the extra objects that are created and later need to be garbage collected. The cursor and event API need to create the objects shown in Figure 2.

The cursor API does not need to create string objects for XML character data as it provides direct access to the internal character buffer. In addition, the iterator API needs to create the immutable event objects. The items colored yellow are string objects that may be cached to improve performance at the expense of some cache management complexity.

Figure 3 shows the number of bytes of XML processed per millisecond averaged over different sets of XML test files. The test was run on JDK 122 and JDK 142 for both API styles and with and without string caching enabled. The test files are loaded into memory so there is no I/O during the test runs.

Clearly there has been a massive (3-6 times) performance improvement across the board from JDK 122 to JDK 142. The iterator API without any string caching is 6.5 times faster on JDK 142.

String caching makes a big difference on JDK 122, running up to 50% faster, but only a small difference on JDK 142, showing less than a 5% improvement with perfect caching. Clearly we can now take Joshua Bloch at his word when he says that object pooling of lightweight objects is unnecessary with modern JVMs.

The overhead from using the iterator API style instead of the cursor API is around 25-30%. This sounds like a lot but remember this test program is doing 90% XML parsing and 10% application logic, whereas your typical application would probably be the other way round - 90% app logic and 10% parsing - which would drive the overhead down to about 3%, which is in the noise for most applications.

Of course your mileage may vary depending on your particular usage scenario. With the first generation StAX parser coming out soon, we'll see if this level of performance difference is also reflected in the real StAX parsers.

StAX Input Factories
How do I create an instance of XMLStreamReader or XMLEventReader?

StAX parsers that implement either of these two APIs are created by the javax.xml.stream.XMLInputFactory, which follows the standard factory pattern. When we get JAXP support you'll be able to create instances via the JAXP APIs. Create a new instance of XMLInputFactory by calling newInstance(); this method will search for a StAX implementation using the standard techniques.

The input factory has a set of configuration parameters that control the features of the parser that will be created by the factory. Once you have an instance of the factory you can override the default configuration parameter values.

Some of the more interesting configuration parameters are:

  • javax.xml.stream.isCoalescing: Defaults to false but when set to true will request a parser that coalesces all contagious text into a single character event.
  • javax.xml.stream.supportDTD: Defaults to true, but when set to false will request a parser that does not support DTDs in XML documents.
  • javax.xml.stream.resolver: Can be used to set an implementation of XMLResolver, which is used during parsing to resolve external entities.
Below is the code to create an instance of the XMLStreamReader on a File f.


        XMLInputFactory factory = XMLInputFactory.newInstance();
        XMLStreamReader sr = factory.createXMLStreamReader(new FileInputStream(f));
And to create an XMLEventReader
        XMLInputFactory factory = XMLInputFactory.newInstance();
        XMLEventReader sr = factory.createXMLEventReader(new FileInputStream(f));

If we wanted a differently featured parser, we would set some configuration parameters before creating the parser as follows:


        XMLInputFactory factory = XMLInputFactory.newInstance();
        factory.setProperty("javax.xml.stream.isCoalescing",Boolean.TRUE);
      factory.setProperty("javax.xml.stream.supportDTD",Boolean.FALSE);
        XMLEventReader sr = factory.createXMLEventReader(new FileInputStream(f));

Some of the standard configuration parameters are optional, meaning that some implementations may choose not to support the feature. You can check to see if a standard (or nonstandard) configuration parameter is supported by calling isPropertySupported() on the factory instance. Here's an example of using an optional feature:


        XMLInputFactory factory = XMLInputFactory.newInstance();
        if(factory.isPropertySupported("javax.xml.stream.isValidating")){
            factory.setProperty("javax.xml.stream.isValidating",Boolean.TRUE);
            factory.setProperty("javax.xml.stream.reporter",this);
        }
        XMLStreamReader sr = factory.createXMLStreamReader(new FileInputStream(f));

The above code instantiates a DTD validating parser if the implementation supports DTD validation.

Once you have created either an XMLStreamReader or an XMLEventReader parser you can find out what its configuration is by calling getProperty() on the parser.

One important design feature of StAX is that you must specify the properties of the parser before you create the parser and you cannot change the property value for an existing parser nor set a new data source into the parser. The rationale behind these restrictions is that we wanted to enable optimized and modular implementations.

A StAX implementation could use a special parsing engine with its own optimized byte to Unicode decoding when reading from a java.io.InputStream; alternatively it could use a different parsing engine when a java.io.Reader is the data source. Another example is the javax.xml.stream.supportDTD property that when "false" could trigger an implementation to use a simpler, faster parsing engine. If we allowed any of the configuration parameters to be modifiable after the parser was created, these types of optimization would be considerably harder to implement.

What About Writing XML?
StAX is a truly bidirectional API and can be effectively used to generate XML, either from scratch or as the result of a StAX Event pipeline. The StAX writers are intelligent in that they maintain namespace stacks and can automatically generate namespace prefixes (if you don't care what they look like). The writers can also close elements using the correct prefix and localname.

Listings 3 and 4 show some code to generate our example XML file using someText and ratio variables. If you want to use the cursor API, the code appears as in Listing 3. If you want to use the iterator API, the code should look like Listing 4.

The writers automatically escape any illegal Unicode characters such as < or & that are found in character content or attribute values.

Summary
After a lot of hard work the StAX API specification has finally seen the light of day and Java application programmers now have a standard pull parser interface for XML. The StAX API has many advantages over the SAX API for developers, including a simpler programming model and the ability to modify the event stream data and extend the information model to allow the introduction of application-specific additions. J2ME programmers now have an XML API that matches their resource-constrained environments, while library developers now have a standard API to use that fits in with their client applications' threading models and performance expectations.

Acknowledgment
I would like to thank the members of the JSR 173 expert group for an interesting and stimulating set of discussions over the past two years.

More Stories By David Stephenson

David has worked in the computing industry for over ten years working in areas of distributed systems, middleware and IT solutions. David has a wide range of experience in computing having worked for Hewlett Packard laboratories and for the HPs middleware divisions creating e-services technology. Lately David has worked in the area of web services and on both the Java and .net platforms and has been HPís contributing expert in the JCP for both JSR31 and JSR173.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
SYS-CON Events announced today that IDenticard will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. IDenticard™ is the security division of Brady Corp (NYSE: BRC), a $1.5 billion manufacturer of identification products. We have small-company values with the strength and stability of a major corporation. IDenticard offers local sales, support and service to our customers across the United States and Canada. Our partner network encompasses some 300 of the world's leading systems integrators and security s...
The BPM world is going through some evolution or changes where traditional business process management solutions really have nowhere to go in terms of development of the road map. In this demo at 15th Cloud Expo, Kyle Hansen, Director of Professional Services at AgilePoint, shows AgilePoint’s unique approach to dealing with this market circumstance by developing a rapid application composition or development framework.
“In the past year we've seen a lot of stabilization of WebRTC. You can now use it in production with a far greater degree of certainty. A lot of the real developments in the past year have been in things like the data channel, which will enable a whole new type of application," explained Peter Dunkley, Technical Director at Acision, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The major cloud platforms defy a simple, side-by-side analysis. Each of the major IaaS public-cloud platforms offers their own unique strengths and functionality. Options for on-site private cloud are diverse as well, and must be designed and deployed while taking existing legacy architecture and infrastructure into account. Then the reality is that most enterprises are embarking on a hybrid cloud strategy and programs. In this Power Panel at 15th Cloud Expo (http://www.CloudComputingExpo.com), moderated by Ashar Baig, Research Director, Cloud, at Gigaom Research, Nate Gordon, Director of T...
"BSQUARE is in the business of selling software solutions for smart connected devices. It's obvious that IoT has moved from being a technology to being a fundamental part of business, and in the last 18 months people have said let's figure out how to do it and let's put some focus on it, " explained Dave Wagstaff, VP & Chief Architect, at BSQUARE Corporation, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Windstream, a leading provider of advanced network and cloud communications, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Windstream (Nasdaq: WIN), a FORTUNE 500 and S&P 500 company, is a leading provider of advanced network communications, including cloud computing and managed services, to businesses nationwide. The company also offers broadband, phone and digital TV services to consumers primarily in rural areas.
The Internet of Things is not new. Historically, smart businesses have used its basic concept of leveraging data to drive better decision making and have capitalized on those insights to realize additional revenue opportunities. So, what has changed to make the Internet of Things one of the hottest topics in tech? In his session at @ThingsExpo, Chris Gray, Director, Embedded and Internet of Things, discussed the underlying factors that are driving the economics of intelligent systems. Discover how hardware commoditization, the ubiquitous nature of connectivity, and the emergence of Big Data a...

ARMONK, N.Y., Nov. 20, 2014 /PRNewswire/ --  IBM (NYSE: IBM) today announced that it is bringing a greater level of control, security and flexibility to cloud-based application development and delivery with a single-tenant version of Bluemix, IBM's platform-as-a-service. The new platform enables developers to build ap...

DevOps Summit 2015 New York, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that it is now accepting Keynote Proposals. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential.
"People are a lot more knowledgeable about APIs now. There are two types of people who work with APIs - IT people who want to use APIs for something internal and the product managers who want to do something outside APIs for people to connect to them," explained Roberto Medrano, Executive Vice President at SOA Software, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Nigeria has the largest economy in Africa, at more than US$500 billion, and ranks 23rd in the world. A recent re-evaluation of Nigeria's true economic size doubled the previous estimate, and brought it well ahead of South Africa, which is a member (unlike Nigeria) of the G20 club for political as well as economic reasons. Nigeria's economy can be said to be quite diverse from one point of view, but heavily dependent on oil and gas at the same time. Oil and natural gas account for about 15% of Nigera's overall economy, but traditionally represent more than 90% of the country's exports and as...
The Internet of Things is a misnomer. That implies that everything is on the Internet, and that simply should not be - especially for things that are blurring the line between medical devices that stimulate like a pacemaker and quantified self-sensors like a pedometer or pulse tracker. The mesh of things that we manage must be segmented into zones of trust for sensing data, transmitting data, receiving command and control administrative changes, and peer-to-peer mesh messaging. In his session at @ThingsExpo, Ryan Bagnulo, Solution Architect / Software Engineer at SOA Software, focused on desi...
"At our booth we are showing how to provide trust in the Internet of Things. Trust is where everything starts to become secure and trustworthy. Now with the scaling of the Internet of Things it becomes an interesting question – I've heard numbers from 200 billion devices next year up to a trillion in the next 10 to 15 years," explained Johannes Lintzen, Vice President of Sales at Utimaco, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
"For over 25 years we have been working with a lot of enterprise customers and we have seen how companies create applications. And now that we have moved to cloud computing, mobile, social and the Internet of Things, we see that the market needs a new way of creating applications," stated Jesse Shiah, CEO, President and Co-Founder of AgilePoint Inc., in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the industry’s first all flash version of HyperConverged Appliances that include both compute and storag...
Today’s enterprise is being driven by disruptive competitive and human capital requirements to provide enterprise application access through not only desktops, but also mobile devices. To retrofit existing programs across all these devices using traditional programming methods is very costly and time consuming – often prohibitively so. In his session at @ThingsExpo, Jesse Shiah, CEO, President, and Co-Founder of AgilePoint Inc., discussed how you can create applications that run on all mobile devices as well as laptops and desktops using a visual drag-and-drop application – and eForms-buildi...
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core of our infrastructures. At the same time, we have old approaches made new again like micro-services...
Code Halos - aka "digital fingerprints" - are the key organizing principle to understand a) how dumb things become smart and b) how to monetize this dynamic. In his session at @ThingsExpo, Robert Brown, AVP, Center for the Future of Work at Cognizant Technology Solutions, outlined research, analysis and recommendations from his recently published book on this phenomena on the way leading edge organizations like GE and Disney are unlocking the Internet of Things opportunity and what steps your organization should be taking to position itself for the next platform of digital competition.
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
As the Internet of Things unfolds, mobile and wearable devices are blurring the line between physical and digital, integrating ever more closely with our interests, our routines, our daily lives. Contextual computing and smart, sensor-equipped spaces bring the potential to walk through a world that recognizes us and responds accordingly. We become continuous transmitters and receivers of data. In his session at @ThingsExpo, Andrew Bolwell, Director of Innovation for HP's Printing and Personal Systems Group, discussed how key attributes of mobile technology – touch input, sensors, social, and ...