| By Tilak Mitra | Article Rating: |
|
| September 28, 2004 12:00 AM EDT | Reads: |
15,857 |
Portal applications usually require a search capability. Portal designers usually look into search engine products like Lotus Extended Search or open source implementations like Lucene in order to satisfy the search requirements. Although these search engines provide sophisticated search capabilities, the Native Search feature of the WebSphere Portal Server 5.x (hereafter called WebSphere Portal) has a fairly rich set of search engine capabilities that could satisfy the search requirements for many portal based applications.
In this two-part series I will introduce you to the basic capabilities of WebSphere Portal's native search. While Part I will take a very simple search scenario and walk the reader through its implementation and configuration, Part II will introduce the reader to more advanced search capabilities.
Example Prerequisites
To understand the scenario in this first article, you need to have only a very basic understanding of WebSphere Portal's configuration options - primarily the fundamental administration options. That said, this article is targeted not only towards the portal administrator, but also to others who want a better understanding of WebSphere Portal's native searching capabilities.
I used IBM WebSphere Portal Enable for Multiplatforms v5.0.2 for Windows environment for this example. If you want to run through this example yourself, your machine should have WebSphere Portal installed on a machine with at least 1 GB RAM. You also need basic internet connectivity because the example involves accessing the external IBM Web site (www.ibm.com).
The Portal Search Engine
WebSphere Portal v5.0.2 provides a search engine that can crawl Web sites (external and internal), index aggregated content, and categorize documents. The categorization can be implemented by either a predefined set of categories or by a user-based custom categorization process. The categorization facility of WebSphere Portal Native Search includes an extensive list of categories that are grouped into high-level business industry areas (e.g., finance, transportation, etc). You can use these predefined categories for your portal applications and the Categorization Engine will automatically categorize content for you. Alternatively, the user-defined categories provide the flexibility of creating custom category trees that may be used to categorize the incoming search results.
The Portal Search engine collects documents from multiple sites into a single collection. The engine can be configured to automate the process of crawling the sites periodically and updating the search content. You can also manually trigger the collection update process.
When a collection is defined and activated, one of the main functions performed by WebSphere Portal Native Search runtime is the creation of indexes. An index is a formatted data that is used by the search engine in order to store, read and match queries against it. Indexes provide a way of searching content in a more efficient manner. These indexes are stored in the file system in a location that should be accessible to the portal runtime. With this architecture, it is a simple matter of extending the portal native search capabilities to a clustered environment. All that has to be set up is a mounted file system (where the indexes are stored) that is then made accessible to each of the portal nodes in the cluster.
WebSphere Portal includes a Document Search portlet that is included in the list of portlets that iss installed by default. The Document Search Portlet has the ability to crawl and index Web content sources and attachments. It can also create, schedule, and maintain search indexes, thus providing search functionality that is comparable to other web search engines. This Document Search portlet can be used in production-ready portal applications. We will be using this portlet in this example.
These are only some of the features of the Portal Search Engine. For a complete description, see the Content search section of the WebSphere Portal Infocenter or WebSphere Portal Administration Guide.
The remainder of this article walks you through a simple search scenario in which you:
- Set up a collection, using a predefined static taxonomy (a.k.a category).
- Install the Document Search portlet onto a page.
- Run the search to see the result.
Now let's start the scenario. First you create a collection of document URLs from a single Web site. These documents are the ones you want indexed and searched.
To create the collection:
- Log on to the portal as an administrator (e.g., wpsadmin).
- Go to Administration->Portal Settings->Search Administration. Select Create Collection (see Figure 1).
- Supply the following information in the "New Collection" page:
- Location of Collection: "IBMCrawlerPredefined". (Note that there is nothing special in this name.)
- Specify Collection Language: "English"
- Specify Categorizer "Pre-Defined". (We are using predefined categories provided by Native Search in this example)
- Select Summarizer: "Automatic". (Summarizer is a feature used by the Portal Search Engine to form a summary of the entire document based on the most important sentences of the original document. Setting it to Automatic implies that we are going to use the summarizer functionality that is provided out-of-the-box
- Check the checkbox for "Remove common words from queries...". (This removes common words like "on", "and", etc.)Click OK. The page should look like Figure 2.
- An empty collection is created. We need to add one or more sites that will be a part of the newly created collection. In this example we will use the external IBM site (www.ibm.com). Note that multiple sites can be added to this collection, although for our example we are going to use a single site. Also note that a folder by the same name (IBMCrawlerPredefined) is created in the filesystem under InstallDir\WebSphere\AppServer directory (where InstallDir is the directory where WebSphere Portal is installed in your machine). WebSphere Portal stores the indexes (for this collection) in this folder (see Figure 4).
Click the Add Site link.
- Choose the options shown in Figure 4 for the new site that is being added to the collection. A brief explanation of some of the attributes shown in Figure 4 is given here:
- "Collect documents linked for this URLS" attribute is set to www.ibm.com. This denotes the site where documents are to be collected
- "Levels of Linked documents to collect" attribute is set to 2. This implies that two levels of URL redirections (from the main URL) will be navigated and searched for content
- The "Number of linked documents to collect" attribute is set to 100. This implies that at most 100 documents (that match the search criteria) may be collected from the main URL
- The "Number of parallel processes" attribute is set to 5. This denotes the number of parallel crawlers that go out to the site specified to fetch Web pages. The larger the site the more parallel crawlers could be used to retrieve all pages from the site faster .
- The "Always use default character encoding" attribute is not checked. If this is not checked/used then the encoding information provided with the HTML content is used for encoding. If this is used it provides the administrator with a means of overriding the encoding of the incoming HTML content
- The "Add all documents to collection automatically" checkbox is checked. This implies that the incoming documents (from the URL) do not need any manual editing and may be directly added to the collection by the portal search engine
- The Obey Robot.txt checkbox is checked. This is a means of control for webmasters to provide directives to crawlers (robots) as to what pages the robot may or may not include in scans. This file is located in the same virtual directory as that from which the portal's content is served
- Click the Create button. Now you are ready to start collecting links off the site that you just defined.
- The actual collection (of documents from the site) may be initiated by clicking the Start Collecting link in the "Sites in Collection:IBMCrawlerPredefined" panel (see Figure 5).
- Click the twistie on the Site Status section to watch the progress of the document collection process. This operation takes a few minutes to complete. It is advisable that nothing be done until the status is updated with a completion timestamp
Configuring the Search Portlet
With the collection configured we now need to configure and subsequently use the Document Search Portlet. WebSphere Portal comes with a set of installed portlets out of the box ready to be customized and used. The Document Search Portlet is one such portlet. This portlet can be customized and used in a production-ready portal application.
1. From the Administration Tab (on the top right hand corner) we need to get to the list of installed portlets. Click on the "Portlets->Manage Portlets" link from the left navigation bar. This displays the list of portlets. Find out the Document Search portlet and make a copy of it (see Figure 6).
2. A copy of the original Document Search Portlet is created. Note that in most portal installs, the initial status of this newly created (copied) portlet is set to Inactive. Highlight the cloned portlet and click on the Modify Parameters link. The IndexName attribute is set to the value of InstallDir\WebSphere\AppServer\IBMCrawlerPredefined (Where InstallDir is the directory where WebSphere Application Server is installed and IBMCrawlerPredefined is the foler under which the WebSphere Portal stores the index and other meta-data for the collection). Click the Save button (not shown in Figure 7. Scroll down to locate the same). The Indexname attribute value is saved.
3. In the same page the Language attribute can also be set. In our example we chose the default language (English). We want to change the Title of our portlet to give it a better name. Click on the link labeled Set title for selected locale. In the subsequent Manage Portlets page we can change the title. For this example I changed it to "Document Search on IBM Internet". Click OK (see Figure 8).
Pressing OK brings us back to the portlet's property modification page. Once back there, click the Save button again (as in step 2 earlier). This will save all the parameters that we modified for our portlet. Click on the Cancel button immediately to the right of the Save button. This will take us out of the parameter modification page altogether.
4. The portlet (renamed) will be highlighted. If the portlet is in Inactive mode we need to activate it. This is required since only activated portlets can be used to customize a page in the portal (see Figure 9).
5. All set !!! All that is left to do is to add the new customized document search portlet into one of the top level portal pages.
Adding Our Custom Portlet
The portal is made up of top-level pages, which can be thought of as tabs in the most common UIs that we come across in our daily life. The top-level pages are meant to separate out content. They organize content according to a logical grouping of functionality. Quite a few top-level pages are provided out of the box (e.g., Content Publishing, Welcome, Documents, My Work etc). For this example, I was searching for the page that had the least amount of content where I could put our customized portlet and make it conspicuous by its presence (a feeling that we achieved something !!!). I selected the My Work page for where to add our portlet.
We are now going to add this new portlet to the existing real estate area on this page. To do this, click on the Edit Page link on the top right hand corner of the page (see Fgure 10).
6. In the Edit Layout page (see Figure 11, click the Add Portlets button.
7. A search screen will appear where we need to search our portlet out of the ones that are already installed. Key in the search keywords as "Document Search on IBM" and hit the Search button you will see that a match was returned with our portlet (see Figure 12). Click on the checkbox by our portlet and click OK.
This will bring us back to the original page on which we started to add our portlet to the My Work top-level page. At this point you can play around with the page and place the portal vertically, horizontally, in a T layout etc. I leave it up to the more artistic of you to come up with the best look and feel. As for me, I am just going to accept the default layout and click on the Done button (see Figure 13).
8. This completes the customization and setup process. We're ready to run!
Running the Search
Back in the "My Work" page we can see that our portlet has arrived, ready to be executed. In the search textfield, I keyed in the word websphere and hit the Search button to the right. An excerpt of the output of the search results are shown in Figure 14.
Notice that each search result is categorized using the Portal Search Engine's built-in, predefined categories. There are two interesting things to notice there:
1. Every search result is followed by a summary. This summary is created by the Portal Search Engine's Summarizer function. The Summarizer creates a summary for each of the documents that have a certain narrative quality
2. Each search result is designated as part of a category. The categories used here are the predefined categories that come with the WebSphere Portal's Native Search. The incoming documents are categorized by the native search engine's Categorizer function. The Categorizer places the documents in a category based on certain internal algorithms. Pages that do not qualify for any of the provided categories are dropped into the "Uncategorized" category. The top-level node in the predefined category tree is called "root". The first search result is categorized under a main category, called "Business & Commerce", followed by several levels of sub-categorization. The second search result is one of those that the Categorizer could not qualify under one of the predefined categories. Hence, it is left as "Uncategorized" under the top-level ("root") node.
Conclusion
In this article we saw how quickly we can set up a basic search portlet using the out-of the-box Document Search Portlet, and how we can do some simple customizations on the existing portlet.
While in this article I demonstrated a search function using the predefined categories within WebSphere Portal, in my next article I will take you through a more flexible and advanced feature in which users can define their own custom category hierarchy for incoming documents to be managed and presented to the user through the portal.
Resources
Published September 28, 2004 Reads 15,857
Copyright © 2004 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Tilak Mitra
Tilak Mitra is a Certified Senior IT Architect at IBM. He specializes in mid- to large-range enterprise and application architectures based on J2EE, MQ, and other EAI technologies. You can reach him at tmitra@us.ibm.com.
- Patterns for Building High Performance Applications
- It's the Java vs. C++ Shootout Revisited!
- Asynchronous Logging Using Spring
- Java for Programmers (2nd Edition)
- Cross-Platform Mobile Website Development – a Tool Comparison
- Three Buzzwords That Every CIO Hears but One They Should Listen To
- Write Once Run Anywhere or Cross Platform Mobile Development Tools
- Immersing into JavaScript Frameworks
- Workday Reportedly Prepping to Go Public
- Cloud Expo New York: The Java EE 7 Platform - Developing for the Cloud
- Book Review: Sams Teach Yourself Java in 24 Hours
- OpenOffice.com Lives
- Book Excerpt: Introducing HTML5
- Adobe Sends Flex to the Apache Foundation
- Five Years Waiting for JRE 7: Is It Justified? (Part 1)
- Book Excerpt: Java Application Profiling Tips and Tricks
- i-Technology in 2012: Five Industry Predictions
- Patterns for Building High Performance Applications
- It's the Java vs. C++ Shootout Revisited!
- OpenXava 4.3: Rapid Java Web Development
- The Next Web Architecture
- Asynchronous Logging Using Spring
- Java for Programmers (2nd Edition)
- Is Write Once Run Anywhere Ever Going to Be a Reality?
- A Cup of AJAX? Nay, Just Regular Java Please
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- JavaServer Faces (JSF) vs Struts
- The i-Technology Right Stuff
- Rich Internet Applications with Adobe Flex 2 and Java
- Java vs C++ "Shootout" Revisited
- Bean-Managed Persistence Using a Proxy List
- Reporting Made Easy with JasperReports and Hibernate
- Creating a Pet Store Application with JavaServer Faces, Spring, and Hibernate
- Why Do 'Cool Kids' Choose Ruby or PHP to Build Websites Instead of Java?
- What's New in Eclipse?
- i-Technology Predictions for 2007: Where's It All Headed?





















