| By Chas Emerick | Article Rating: |
|
| October 1, 2006 06:45 PM EDT | Reads: |
13,110 |
The PDF file format has become the gold standard of document distribution and archiving. It's therefore virtually certain that data critical to your organization is sitting quietly in PDF documents somewhere. This situation means you have to get serious about integrating PDF content into your applications - taking shortcuts in this area and not finding or leveraging that mission-critical data can lead to millions of dollars in lost sales and/or similar levels of increased costs, compliance difficulties, or liability entanglements. It's time to use a high-performance PDF component that will yield accurate text and metadata extracts suitable for use with your existing search, content management, text analysis/mining, CRM, or other system(s).
However, the PDF file format is complex and wasn't designed for content extraction. So, any PDF library that doesn't specialize in content extraction is likely to exhibit various undesirable traits:
- Poor performance or performance degradation in high-volume environments
- Poor text extract accuracy
- Incomplete PDF file format support
- Lack of or limited support for extracting Unicode text, including Chinese, Japanese, and Korean text
- A complicated API that requires knowing the PDF file format
- Lack of any tools for identifying and converting unstructured data
PDFTextStream fits the bill either way. A pure Java library (also available for .NET and Python), PDFTextStream specializes in extracting text and metadata from PDF documents. Because of its focus, PDFTextStream has none of the downside all too common when using a general-purpose PDF library for content extraction purposes.
This introduction will cover just a few of the use cases where PDFTextStream's focus on content extraction yields significant value.
Simple/Powerful Text Extraction
PDF documents specify their text content a character at a time without any indication of each page's physical layout (such as lines, paragraphs, columns, tables, etc.). Thankfully, PDFTextStream automatically derives these structures for every page it extracts using state-of-the-art page segmentation and read-ordering processes - similar to how an OCR application derives the structure of a scanned document. And thankfully again, this accuracy doesn't come at the expense of speed or ease of use.
Now for some code. As you can see, extracting text using PDFTextStream is super-simple:
StringBuffer pdfText = new StringBuffer(1024);
com.snowtide.pdf.OutputTarget tgt = new
com.snowtide.pdf.OutputTarget(pdfText);
PDFTextStream stream = new PDFTextStream(pdfFile);
stream.pipe(tgt);
stream.close();
The full text of the PDF file is now available in the pdfText StringBuffer.
OutputTarget is the default implementation of the com.snowtide.pdf.OutputHandler interface, which can be thought of as a SAX interface for PDFTextStream document model events. These events are generated any time an OutputHandler is passed into a pipe(OutputHandler) function, which is available on many document model objects as well (com.snowtide.pdf.Page, com.snowtide.pdf.layout.Block, and com.snowtide.pdf.layout.Line).
OutputTarget's primary purpose is to provide a straightforward way to direct extracted text to a StringBuffer or a java.io.Writer. Further, OutputTarget passes through PDFTextStream's default text layout: content is in the proper semantic order, columns of text are separated, and rotated text is normalized and grouped in reasonable ways. This is really important if the PDF text you're extracting is going to be used as input to a semantically sensitive process, such as text mining or search engine indexing.
There are many OutputHandler implementations included with PDFTextStream, each of which interprets and processes PDF text events differently. If none of them meet your application's needs, you can easily write your own.
Unicode Text Extraction
Today's global economy demands that your application be world-ready, in any major language. Thankfully, PDFTextStream always extracts text from PDF documents as Unicode (a perfect match for Java's consistent and thorough Unicode support). Further, PDFTextStream extracts Chinese, Japanese, and Korean (CJK) text from PDF documents without any performance penalties.
Nothing special needs to be done to enable these capabilities - they're always on, so you can use the simplest code and always get Unicode and CJK text out of your source PDF documents.
Search Engine Integration
PDFTextStream was designed to be easily integrated into other applications, including content management systems, text mining processes, and, of course, search engines. A great example is its Lucene integration module, which produces Lucene documents using the content extracted from PDF files. Building a Lucene document that contains all of the text in a PDF file requires one line of code:
Document luceneDoc = com.snowtide.pdf.lucene.PDFDocumentFactory.buildPDFDocument(pdfFile);
The contents of the Lucene document, including whether PDF document attributes (such as author's name, title, creation date, etc.) should be included, as well as the Lucene document's indexing, tokenizing, and storage parameters can all be customized (via com.snowtide.pdf.lucene.DocumentFactoryConfig).
Also of interest to those who work with search engines, PDFTextStream enables Web crawlers to source new URLs to retrieve from PDF documents - see Enabling PDF Web Crawling below.
Metadata, Metadata Everywhere
Utilizing the metadata embedded in many PDF documents can add a great deal of value to your applications. PDFTextStream gives you easy access to the full world of PDF metadata:
- Document attributes (as a key/value Map or in Adobe XMP XML format)
- Document outline/bookmarks
- Acroform data - interactive form data
- PDF annotations (text notes, embedded URL links, etc.)
The Bulk Metadata Import
Consider a scenario where you need to load PDF documents into a content management system. A common requirement would be for each document's author, title, and creation date to be imported as well. Let's retrieve those attributes:
PDFTextStream stream = new PDFTextStream(pdfFile);
Object author = stream.getAttribute(PDFTextStream.ATTR_AUTHOR);
Object title = stream.getAttribute(PDFTextStream.ATTR_TITLE);
Object createDtStr = stream.getAttribute(PDFTextStream.ATTR_CREATION_DATE);
Date createDt = null;
if (createDtStr != null && createDtStr instanceof String)
createDt = PDFDateParser.parseDateString((String)createDtStr);
From here, you could easily add the metadata associated with each PDF document to the CMS. This code is straightforward, but there are some points worth noting:
- The PDFTextStream class provides a set of attribute name constants, making standard attribute lookups easy.
- The getAttribute(String) function returns an Object, not a String - this is because PDF files can technically specify attribute values of various types.
- PDF date strings have a standard format; the com.snowtide.pdf.PDFDateParser.parseDateString(String) function can be used to convert PDF date Strings into java.util.Date objects.
PDF documents can contain Internet URLs, but many Web crawlers don't look for and follow such links. Here, we'll retrieve the embedded PDF annotations that contain URL links, which could then be retrieved by a Web crawler.
PDFTextStream stream = new PDFTextStream(pdfFile);
List<Annotation> annots =
stream.getAllAnnotations();
ArrayList<String> uriList = new ArrayList<String>();
for (Annotation annot : annots) {
if (annot instanceof com.snowtide.pdf.annot.LinkAnnotation) {
LinkAnnotation link = (LinkAnnotation)annot;
if (link.getLinkActionName().equals("URI"))
uriList.add(link.getURI());
}
}
This example will add all of the available URLs in the PDF document to the uriList ArrayList. The process is very simple: find all of the PDF annotations of type com.snowtide.pdf.annot.LinkAnnotation, and ignore any LinkAnnotations that do not have an "action name" of URI. There are a variety of link action names, each of which have different behaviors in a PDF viewer. Only URI LinkAnnotations contain a URL, which is retrieved using the getURI() function.
Identifying and Converting Unstructured Data
Coping with "unstructured" data is a popular topic these days, mostly because:
- It's being recognized that unstructured data represents most of the data generated and received by most organizations
- Significant operational advantages can be achieved only if organizations can identify, convert, and harness the available unstructured data
First, PDFTextStream provides a table API (com.snowtide.pdf.layout.Table) that represents the data of any table that PDFTextStream can detect while processing a PDF document. This API can be used as the basis of a process that converts tabular data found in PDF documents into CSV or Excel files, or directly into database records.
Secondly, for broader unstructured data conversion purposes (or for tabular data that can't be detected automatically through its table API), PDFTextStream provides VisualOutputTarget, an OutputHandler implementation that renders PDF text to a StringBuffer or java.io.Writer while maintaining the visual layout of each page of text. This maintains the visual alignment of table columns and other textual elements, which makes text extracts retrieved using VisualOutputTarget ideal for input into downstream text analysis and mining tools.
Conclusion: Enterprise-Class, Indeed
The term "enterprise-class" typically means that a component is robust - that it can take a beating and still keep going, while maintaining high performance levels.
That describes PDFTextStream quite well. It's feature-rich, it has a high degree of PDF file format support, and it's just plain fast: in extensive benchmarking (conducted by Snowtide Informatics and posted for review and verification on its Web site), PDFTextStream is shown to be 223% -1,141% faster than all other Java PDF libraries that are capable of text extraction. Even better, PDFTextStream clocks in as 13% faster than pdftotext, the popular native C/C++ PDF text extraction utility that's part of the Xpdf project.
There's a right tool for every job, and in general, it's better to use a tool that is designed for the specific job at hand. Accurately extracting text and metadata from PDF documents with high levels of performance is a surprisingly difficult job that presents a complex set of problems. Given the importance of finding and accessing critical data available only in PDF documents, it makes sense to use a PDF content extraction library designed from the ground up to solve these problems expertly and without compromises. Doing so will ensure that your application and your users receive the greatest benefits of enterprise-class PDF content integration.
References
- Snowtide Informatics, publisher of PDFTextStream: http://snowtide.com
- PDFTextStream developer resources: http://snowtide.com/Support
- PDF text extraction benchmarks: http://snowtide.com/Performance.
- Adobe Extensible Metadata Platform (XMP): www.adobe.com/products/xmp/main.html
- Apache Lucene project: http://lucene.apache.org
- Xpdf project (home of pdftotext): http://foolabs.com/xpdf/
Published October 1, 2006 Reads 13,110
Copyright © 2006 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
- Kindle 2 vs Nook
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- Confessions of a Ulitzer Addict
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- It's the Java vs. C++ Shootout Revisited!
- Cloud Computing Can Revitalize Your Career as Software Developer
- IBM Could "Reinvent" Java: Mills
- Oracle & Cloud Computing: Exclusive Q&A with SVP Richard Sarwal
- A Brief History of Cloud Computing
- Kindle 2 vs Nook
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Why IBM’s Server Chief Got Busted
- Is Cloud Computing Like Teenage Sex?
- Industry Experts Discuss the State of Cloud Computing
- Performance Tuning Essentials for Java
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing Expo: Exclusive Q&A with Yahoo! SVP Cloud Computing
- Ajax in RichFaces 3.3, JSF 2 and RichFaces 4
- Confessions of a Ulitzer Addict
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- A Cup of AJAX? Nay, Just Regular Java Please
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- The i-Technology Right Stuff
- JavaServer Faces (JSF) vs Struts
- Rich Internet Applications with Adobe Flex 2 and Java
- Java vs C++ "Shootout" Revisited
- Bean-Managed Persistence Using a Proxy List
- Reporting Made Easy with JasperReports and Hibernate
- Creating a Pet Store Application with JavaServer Faces, Spring, and Hibernate
- What's New in Eclipse?
- Why Do 'Cool Kids' Choose Ruby or PHP to Build Websites Instead of Java?
- i-Technology Predictions for 2007: Where's It All Headed?







































