| By Jimmy Zhang | Article Rating: |
|
| February 20, 2008 02:15 PM EST | Reads: |
31,609 |
VTD+XML in 30 Seconds
Allowing XML parsing to be
decoupled from application logic, the key in the example above is the
index file "po.vxl," which conforms to the VTD+XML spec. What is
VTD+XML? Since VTD-XML's internal representation of XML infoset is
inherently persistent, VTD+XML, as the name suggests, is simply the
binary packaging format that combines VTD records, LCs entries, and XML
into a single file. The detailed technical spec can be found at http://vtd-xml.sourceforge.net/persistence.html.
A Simple Example
This section gets down to the
nitty-gritty of the specification by manually composing, byte-by-byte,
a VTD+XML index. For the sake of simplicity, this example chooses to
index a simple XML document containing a single child-less root element
whose parsed representation doesn't have location cache entries. This
example also assumes a big-endian byte order (as in Java) and UTF-8
document encoding (the default character set). The name space awareness
is set to false.
<root/>
The first four-byte word of the corresponding index file is 0x0102A000 containing:
- The VTD+XML version number (0x01) in the first byte
- The character encoding format (0x02) in the second byte (Jimmy1)
- The name space awareness, word length of LC entries in the last level, byte endian-ness of the platform, and VTD version as encoded in various bit fields in the third byte (0xA0)(Jimmy2)
- The document depth (0x0 as the root element has no child)(Jimmy3)
The second four-byte word has the value of 0x00040001 containing:
- The number of LC levels supported by the VTD-XML implementation in the upper 16 bits (0x0004 in big endian)(Jimmy4)
- The root element index value in the lower 16 bits (0x0001 in big endian)(Jimmy5)
The remaining part of VTD+XML index consists of multiple adjacent segments each containing an eight-byte word (0x0000000000000002 indicating the VTD record or LC entry count) followed by the actual content of the VTD records or LC entries. The first eight-byte word (0x000000000000000002) indicates that there are two VTD records that are 0xDFF0000000000000 and 0x0000000400000001.
The remaining three eight-byte words all have the value of zero indicating that the location caches in level one, two, and three have zero entry in the VTD+XML index.
As the final output, the VTD+XML index for "<root/>" is 88-bytes long and looks like the following hex:
0x0102A00000040001 0x0000000000000000
0x0000000000000000 0x0000000000000007
0x3C726F6F742F3E00 0x0000000000000002
0xDFF0000000000000 0x0000000400000001
0x0000000000000000 0x0000000000000000
0x0000000000000000
Benefits and Limitations
Because VTD+XML
straightforwardly combines VTD and XML, it inherits all the benefits of
VTD-XML parsing. When compared with existing XML indices (e.g., various
pure-binary XML indices modeling labeled, ordered tree etc.), VTD+XML
possesses many unique technical benefits:
• General Purpose - Before
VTD+XML, most native XML indices only optimize specific types (e.g.,
the axis) of Xpath lookups. If an input query differs slightly from the
index type, the query execution still has to resort to expensive
parsing. Due to this limitation, many native XML databases today
require users to create multiple indices, one for each input query type
so users can benefit from those indices. The problem is that XML
database applications usually serve many types of queries that are
unpredictable and complex in nature, often rendering the benefits of
indexing insignificant. In comparison, VTD+XML is the first index that
completely eliminates the cost of XML parsing and predictably speeds up
any type of XPath query. It also works with namespaces exceptionally
well.
• Human Readable - VTD+XML is
also the first human-readable XML index. You can actually open it in a
text editor to examine the XML text. Figure 1 is what "po.vxl" looks
like in "vim." More than just a nice property, VTD+XML's
human-readability offers distinct advantages over pure binary indexing
schemes. Everything else being equal, keeping XML in its original
format avoids the processing cost of converting to and from any binary
formats. Moreover, what if your applications just wants to modify the
XML payload, such as inserting into it a chunk of XML text extracted
out of another SOAP message? What's the point of converting XML to
binary formats? In a service-oriented heterogeneous environment,
maintaining XML in its original format automatically retains the
openness and interoperability. It just seems to me that the only
loss-less equivalent of XML is XML itself, no less.
Published February 20, 2008 Reads 31,609
Copyright © 2008 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jimmy Zhang
Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.
- Kindle 2 vs Nook
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- Confessions of a Ulitzer Addict
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- It's the Java vs. C++ Shootout Revisited!
- Cloud Computing Can Revitalize Your Career as Software Developer
- IBM Could "Reinvent" Java: Mills
- Oracle & Cloud Computing: Exclusive Q&A with SVP Richard Sarwal
- A Brief History of Cloud Computing
- Kindle 2 vs Nook
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing Expo: Exclusive Q&A with Yahoo! SVP Cloud Computing
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- A Cup of AJAX? Nay, Just Regular Java Please
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- The i-Technology Right Stuff
- JavaServer Faces (JSF) vs Struts
- Rich Internet Applications with Adobe Flex 2 and Java
- Java vs C++ "Shootout" Revisited
- Bean-Managed Persistence Using a Proxy List
- Reporting Made Easy with JasperReports and Hibernate
- Creating a Pet Store Application with JavaServer Faces, Spring, and Hibernate
- What's New in Eclipse?
- Why Do 'Cool Kids' Choose Ruby or PHP to Build Websites Instead of Java?
- i-Technology Predictions for 2007: Where's It All Headed?


































