|By Ben Litchfield||
|March 24, 2005 12:00 AM EST||
Since Adobe released the first public PDF Reference in 1993, a number of PDF utilities and libraries, supporting all kinds of languages and platforms, have been made available to users and developers alike. However, support for Adobe's technology has lagged in Java application development. And this is curious because PDF documents tend to be a popular way of storing and interchanging information when dealing with enterprise information systems - an application domain that Java technology is particularly well suited to. Yet it seems that, until recently, mature, capable PDF support wasn't readily available to Java applications developers.
PDFBox (an Open Source project released under the BSD license) is a pure Java library that lets developers read and create PDF documents. It has features such as:
- Extracting text, including Unicode characters
- Easy integration with text search engines like Jakarta Lucene
- Encryption/Decryption of PDF documents
- Importing/Exporting of form data in FDF and XFDF formats
- Appending to existing PDF documents
- Splitting a single PDF into multiple documents
- Overlaying one PDF document on top of another
PDFBox has been designed to represent PDF documents using familiar object-oriented paradigms. The data contained in a PDF document is a collection of basic object types: arrays, booleans, dictionaries, numbers, strings and binary streams. PDFBox captures these basic object types in the org.pdfbox.cos package (the COS Model). While it's possible to create any desired interactions with a PDF document using only these objects, it requires an intimate knowledge of the internals of PDF documents and the techniques used to represent higher-level concepts. For example, objects such as pages and fonts are represented as dictionaries with specialized attributes; deciphering all these various attributes and their types requires tedious consultation of the PDF Reference.
For this reason, the org.pdfbox.pdmodel package (the PD Model) sits on top the COS Model and provides a high-level API that accesses PDF document objects in a more familiar manner (see Figure 1). Objects such as PDPage and PDFont can be found in this package, which encapsulates their lower-level COS model counterparts.
A word of caution to developers: the PD Model offers many nice features but is still a work in progress. In some instances, use of the COS Model may be required to access a particular piece of PDF functionality. Consequently, all PD Model objects can retrieve the corresponding COS Model object that they represent, so it's always possible to start with the PD Model and drop down to the COS Model when the required piece of functionality is found to be missing.
Now that the general capabilities of PDFBox have been discussed a few examples of its use are appropriate. We will start by reading an existing PDF document:
PDDocument document =
PDDocument.load( "./test.pdf" );
This operation will cause the PDF file to be parsed and an in-memory representation of the document will be created. To facilitate the efficient handling of large documents, PDFBox only stores the document structure in memory; objects such as images, embedded fonts and page content are cached in a temporary file.
Note: When finished using a PDDocument object, care should be taken to invoke the close() method on the document object to release resources used during its creation.
Text Extraction and Lucene Integration
In an information retrieval age when applications are expected to have searching and indexing capabilities regardless of the medium, the ability to organize and catalog information into a searchable format is critical. This is simple for text and HTML documents, but PDF documents have more structure and meta-information that makes it difficult to extract the underlying text. The PDF language is similar to Postscript in that objects are drawn as vectors on the page at certain positions. For example:
/Helv 12 Tf
0 13.0847 Td
(Hello World) Tj
This set of instructions changes the font to Helvetica size 12, moves the caret to the next line and renders the string "Hello World." These command streams are usually compressed and the order in which the glyphs are displayed on the screen is not necessarily the order in which the characters appear in the file, so it isn't always possible to simply extract text strings directly from the raw PDF document. However, PDFBox has a sophisticated text-extraction algorithm that deals with this and other complexities, letting a developer get the text of the document as if reading off its rendered form.
Lucene, which is part of the Apache Jakarta project, is a popular Open Source search engine library. Lucene lets developers create an index and do complex searches on a large volume of textual content based on that index. Since Lucene has adopted text as the common denominator for content, it's the developer's responsibility to convert the data contained in other desired file formats to text to use Lucene. For example, file formats such as Microsoft Word and StarOffice documents have to be converted to text before they can be added to a Lucene index.
PDF files are no exception, but PDFBox makes it easy to include a PDF document in a Lucene index by supplying a special object that does the integration. A basic PDF document can be converted to a Lucene document with a single statement:
Document doc = LucenePDFDocument.getDocument( file );
This operation parses the PDF document, extracts the text and creates a Lucene document object that can then be added to the index. As mentioned above, PDF documents also contain metadata such as author information and keywords that are important to track when indexing PDF documents. Table 1 shows the fields that PDFBox will populate while creating the Lucene document.
This integration makes it easy for developers to support simple searching and indexing of PDF documents with Lucene. Of course, some applications require more sophisticated text-extraction methods. In that case, the PDFTextStripper class can be used directly, or extended to handle these complex requirements.
By extending this class and overriding the showCharacter() method, many aspects of text extraction can be controlled. For instance, an implementation of this method can use the x, y positioning information to limit the inclusion of certain blocks of text in the extraction. One use might exclude all of the text above a certain y-coordinate value effectively excluding an unwanted document header.
Another example: Oftentimes a group of PDF documents may have been created from forms and the source data are no longer available. In other words, the documents all have some interesting text at similar locations on the page, but the form data used to fill the document out are no longer available. For example, a collection of cover letters that have the name and address at the same location in the document. In this case, an extension of the PDFTextStripper class can be used as a sort of screen-scraping device to extract the desired fields.
|Lucious 03/23/05 02:27:55 PM EST|
I can't believe I found this!! I was searching for tools to update pdf files in my java programs. All I could find was commercial tools that would charge for single/multiple CPU and development liscenses and then charge MORE for deployment!! I actually gave up and downloaded the pdf specs (over 1200 pages) to develop tools of my own. I can't wait to start using these tools!
|Maulik 03/23/05 08:09:43 AM EST|
Great easy-to-follow article for someone who knows next to nothing about integrating Java and PDF. Good job.
|Richard Bouchard 03/12/05 09:15:57 PM EST|
Excellent article and has great utility.
SYS-CON Events announced today that Roundee / LinearHub will exhibit at the WebRTC Summit at @ThingsExpo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. LinearHub provides Roundee Service, a smart platform for enterprise video conferencing with enhanced features such as automatic recording and transcription service. Slack users can integrate Roundee to their team via Slack’s App Directory, and '/roundee' command lets your video conference ...
Sep. 30, 2016 10:45 PM EDT Reads: 1,533
24Notion is full-service global creative digital marketing, technology and lifestyle agency that combines strategic ideas with customized tactical execution. With a broad understand of the art of traditional marketing, new media, communications and social influence, 24Notion uniquely understands how to connect your brand strategy with the right consumer. 24Notion ranked #12 on Corporate Social Responsibility - Book of List.
Sep. 30, 2016 10:45 PM EDT Reads: 479
Web Real-Time Communication APIs have quickly revolutionized what browsers are capable of. In addition to video and audio streams, we can now bi-directionally send arbitrary data over WebRTC's PeerConnection Data Channels. With the advent of Progressive Web Apps and new hardware APIs such as WebBluetooh and WebUSB, we can finally enable users to stitch together the Internet of Things directly from their browsers while communicating privately and securely in a decentralized way.
Sep. 30, 2016 10:00 PM EDT Reads: 1,254
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Sep. 30, 2016 09:30 PM EDT Reads: 4,116
Just over a week ago I received a long and loud sustained applause for a presentation I delivered at this year’s Cloud Expo in Santa Clara. I was extremely pleased with the turnout and had some very good conversations with many of the attendees. Over the next few days I had many more meaningful conversations and was not only happy with the results but also learned a few new things. Here is everything I learned in those three days distilled into three short points.
Sep. 30, 2016 09:30 PM EDT Reads: 5,427
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
Sep. 30, 2016 09:00 PM EDT Reads: 4,047
Adobe is changing the world though digital experiences. Adobe helps customers develop and deliver high-impact experiences that differentiate brands, build loyalty, and drive revenue across every screen, including smartphones, computers, tablets and TVs. Adobe content solutions are used daily by millions of companies worldwide-from publishers and broadcasters, to enterprises, marketing agencies and household-name brands. Building on its established design leadership, Adobe enables customers not o...
Sep. 30, 2016 08:45 PM EDT Reads: 485
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
Sep. 30, 2016 08:30 PM EDT Reads: 471
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lea...
Sep. 30, 2016 08:15 PM EDT Reads: 747
What are the new priorities for the connected business? First: businesses need to think differently about the types of connections they will need to make – these span well beyond the traditional app to app into more modern forms of integration including SaaS integrations, mobile integrations, APIs, device integration and Big Data integration. It’s important these are unified together vs. doing them all piecemeal. Second, these types of connections need to be simple to design, adapt and configure...
Sep. 30, 2016 07:30 PM EDT Reads: 566
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
Sep. 30, 2016 07:15 PM EDT Reads: 482
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, will compare the Jevons Paradox to modern-day enterprise IT, e...
Sep. 30, 2016 07:00 PM EDT Reads: 2,457
Major trends and emerging technologies – from virtual reality and IoT, to Big Data and algorithms – are helping organizations innovate in the digital era. However, to create real business value, IT must think beyond the ‘what’ of digital transformation to the ‘how’ to harness emerging trends, innovation and disruption. Architecture is the key that underpins and ties all these efforts together. In the digital age, it’s important to invest in architecture, extend the enterprise footprint to the cl...
Sep. 30, 2016 06:45 PM EDT Reads: 749
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management solutions, helping companies worldwide activate their data to drive more value and business insight and to transform moder...
Sep. 30, 2016 06:30 PM EDT Reads: 2,914
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
Sep. 30, 2016 06:15 PM EDT Reads: 3,547
What does it look like when you have access to cloud infrastructure and platform under the same roof? Let’s talk about the different layers of Technology as a Service: who cares, what runs where, and how does it all fit together. In his session at 18th Cloud Expo, Phil Jackson, Lead Technology Evangelist at SoftLayer, an IBM company, spoke about the picture being painted by IBM Cloud and how the tools being crafted can help fill the gaps in your IT infrastructure.
Sep. 30, 2016 06:15 PM EDT Reads: 3,165
Digital innovation is the next big wave of business transformation based on digital technologies of which IoT and Big Data are key components, For example: Business boundary innovation is a challenge to excavate third-party business value using IoT and BigData, like Nest Business structure innovation may propose re-building business structure from scratch, as Uber does in the taxicab industry The social model innovation is also a big challenge to the new social architecture with the design fr...
Sep. 30, 2016 05:45 PM EDT Reads: 1,380
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at EMC, will introduce a methodology for capturing, enriching and sharing data (and analytics) across the organizati...
Sep. 30, 2016 04:30 PM EDT Reads: 1,835
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
Sep. 30, 2016 04:00 PM EDT Reads: 3,598
IoT offers a value of almost $4 trillion to the manufacturing industry through platforms that can improve margins, optimize operations & drive high performance work teams. By using IoT technologies as a foundation, manufacturing customers are integrating worker safety with manufacturing systems, driving deep collaboration and utilizing analytics to exponentially increased per-unit margins. However, as Benoit Lheureux, the VP for Research at Gartner points out, “IoT project implementers often ...
Sep. 30, 2016 03:45 PM EDT Reads: 3,674