Welcome!

Java Authors: App Man, Liz McMillan, Jeremy Geelan, Yakov Fain, Hari Gottipati

Related Topics: Websphere

Websphere: Article

Using the Search Capabilities of WebSphere Portal V5 - Part I

Find your portlets quickly and easily

Portal applications usually require a search capability. Portal designers usually look into search engine products like Lotus Extended Search or open source implementations like Lucene in order to satisfy the search requirements. Although these search engines provide sophisticated search capabilities, the Native Search feature of the WebSphere Portal Server 5.x (hereafter called WebSphere Portal) has a fairly rich set of search engine capabilities that could satisfy the search requirements for many portal based applications.

In this two-part series I will introduce you to the basic capabilities of WebSphere Portal's native search. While Part I will take a very simple search scenario and walk the reader through its implementation and configuration, Part II will introduce the reader to more advanced search capabilities.

Example Prerequisites
To understand the scenario in this first article, you need to have only a very basic understanding of WebSphere Portal's configuration options - primarily the fundamental administration options. That said, this article is targeted not only towards the portal administrator, but also to others who want a better understanding of WebSphere Portal's native searching capabilities.

I used IBM WebSphere Portal Enable for Multiplatforms v5.0.2 for Windows environment for this example. If you want to run through this example yourself, your machine should have WebSphere Portal installed on a machine with at least 1 GB RAM. You also need basic internet connectivity because the example involves accessing the external IBM Web site (www.ibm.com).

The Portal Search Engine
WebSphere Portal v5.0.2 provides a search engine that can crawl Web sites (external and internal), index aggregated content, and categorize documents. The categorization can be implemented by either a predefined set of categories or by a user-based custom categorization process. The categorization facility of WebSphere Portal Native Search includes an extensive list of categories that are grouped into high-level business industry areas (e.g., finance, transportation, etc). You can use these predefined categories for your portal applications and the Categorization Engine will automatically categorize content for you. Alternatively, the user-defined categories provide the flexibility of creating custom category trees that may be used to categorize the incoming search results.

The Portal Search engine collects documents from multiple sites into a single collection. The engine can be configured to automate the process of crawling the sites periodically and updating the search content. You can also manually trigger the collection update process.

When a collection is defined and activated, one of the main functions performed by WebSphere Portal Native Search runtime is the creation of indexes. An index is a formatted data that is used by the search engine in order to store, read and match queries against it. Indexes provide a way of searching content in a more efficient manner. These indexes are stored in the file system in a location that should be accessible to the portal runtime. With this architecture, it is a simple matter of extending the portal native search capabilities to a clustered environment. All that has to be set up is a mounted file system (where the indexes are stored) that is then made accessible to each of the portal nodes in the cluster.

WebSphere Portal includes a Document Search portlet that is included in the list of portlets that iss installed by default. The Document Search Portlet has the ability to crawl and index Web content sources and attachments. It can also create, schedule, and maintain search indexes, thus providing search functionality that is comparable to other web search engines. This Document Search portlet can be used in production-ready portal applications. We will be using this portlet in this example.

These are only some of the features of the Portal Search Engine. For a complete description, see the Content search section of the WebSphere Portal Infocenter or WebSphere Portal Administration Guide.

The remainder of this article walks you through a simple search scenario in which you:

  1. Set up a collection, using a predefined static taxonomy (a.k.a category).
  2. Install the Document Search portlet onto a page.
  3. Run the search to see the result.
Setting Up the Collection
Now let's start the scenario. First you create a collection of document URLs from a single Web site. These documents are the ones you want indexed and searched.

To create the collection:

  1. Log on to the portal as an administrator (e.g., wpsadmin).
  2. Go to Administration->Portal Settings->Search Administration. Select Create Collection (see Figure 1).
  3. Supply the following information in the "New Collection" page:
    - Location of Collection: "IBMCrawlerPredefined". (Note that there is nothing special in this name.)
    - Specify Collection Language: "English"
    - Specify Categorizer "Pre-Defined". (We are using predefined categories provided by Native Search in this example)
    - Select Summarizer: "Automatic". (Summarizer is a feature used by the Portal Search Engine to form a summary of the entire document based on the most important sentences of the original document. Setting it to Automatic implies that we are going to use the summarizer functionality that is provided out-of-the-box
    - Check the checkbox for "Remove common words from queries...". (This removes common words like "on", "and", etc.)

    Click OK. The page should look like Figure 2.

  4. An empty collection is created. We need to add one or more sites that will be a part of the newly created collection. In this example we will use the external IBM site (www.ibm.com). Note that multiple sites can be added to this collection, although for our example we are going to use a single site. Also note that a folder by the same name (IBMCrawlerPredefined) is created in the filesystem under InstallDir\WebSphere\AppServer directory (where InstallDir is the directory where WebSphere Portal is installed in your machine). WebSphere Portal stores the indexes (for this collection) in this folder (see Figure 4).

    Click the Add Site link.

  5. Choose the options shown in Figure 4 for the new site that is being added to the collection. A brief explanation of some of the attributes shown in Figure 4 is given here:
  • "Collect documents linked for this URLS" attribute is set to www.ibm.com. This denotes the site where documents are to be collected
  • "Levels of Linked documents to collect" attribute is set to 2. This implies that two levels of URL redirections (from the main URL) will be navigated and searched for content
  • The "Number of linked documents to collect" attribute is set to 100. This implies that at most 100 documents (that match the search criteria) may be collected from the main URL
  • The "Number of parallel processes" attribute is set to 5. This denotes the number of parallel crawlers that go out to the site specified to fetch Web pages. The larger the site the more parallel crawlers could be used to retrieve all pages from the site faster .
  • The "Always use default character encoding" attribute is not checked. If this is not checked/used then the encoding information provided with the HTML content is used for encoding. If this is used it provides the administrator with a means of overriding the encoding of the incoming HTML content
  • The "Add all documents to collection automatically" checkbox is checked. This implies that the incoming documents (from the URL) do not need any manual editing and may be directly added to the collection by the portal search engine
  • The Obey Robot.txt checkbox is checked. This is a means of control for webmasters to provide directives to crawlers (robots) as to what pages the robot may or may not include in scans. This file is located in the same virtual directory as that from which the portal's content is served
For an explanation of the other attributes please refer to the WebSphere Portal Server InfoCenter's Content Search Section (see Resources).
  1. Click the Create button. Now you are ready to start collecting links off the site that you just defined.
  2. The actual collection (of documents from the site) may be initiated by clicking the Start Collecting link in the "Sites in Collection:IBMCrawlerPredefined" panel (see Figure 5).
  3. Click the twistie on the Site Status section to watch the progress of the document collection process. This operation takes a few minutes to complete. It is advisable that nothing be done until the status is updated with a completion timestamp
Now that you have created and configured the collection, it's time to prepare to test the search. WebSphere Portal provides a portlet - Document Search -that you can configure to use the collection you just created.

Configuring the Search Portlet
With the collection configured we now need to configure and subsequently use the Document Search Portlet. WebSphere Portal comes with a set of installed portlets out of the box ready to be customized and used. The Document Search Portlet is one such portlet. This portlet can be customized and used in a production-ready portal application.

1.  From the Administration Tab (on the top right hand corner) we need to get to the list of installed portlets. Click on the "Portlets->Manage Portlets" link from the left navigation bar. This displays the list of portlets. Find out the Document Search portlet and make a copy of it (see Figure 6).

2.  A copy of the original Document Search Portlet is created. Note that in most portal installs, the initial status of this newly created (copied) portlet is set to Inactive. Highlight the cloned portlet and click on the Modify Parameters link. The IndexName attribute is set to the value of InstallDir\WebSphere\AppServer\IBMCrawlerPredefined (Where InstallDir is the directory where WebSphere Application Server is installed and IBMCrawlerPredefined is the foler under which the WebSphere Portal stores the index and other meta-data for the collection). Click the Save button (not shown in Figure 7. Scroll down to locate the same). The Indexname attribute value is saved.

3.  In the same page the Language attribute can also be set. In our example we chose the default language (English). We want to change the Title of our portlet to give it a better name. Click on the link labeled Set title for selected locale. In the subsequent Manage Portlets page we can change the title. For this example I changed it to "Document Search on IBM Internet". Click OK (see Figure 8).

Pressing OK brings us back to the portlet's property modification page. Once back there, click the Save button again (as in step 2 earlier). This will save all the parameters that we modified for our portlet. Click on the Cancel button immediately to the right of the Save button. This will take us out of the parameter modification page altogether.

4.  The portlet (renamed) will be highlighted. If the portlet is in Inactive mode we need to activate it. This is required since only activated portlets can be used to customize a page in the portal (see Figure 9).

5.  All set !!! All that is left to do is to add the new customized document search portlet into one of the top level portal pages.

Adding Our Custom Portlet
The portal is made up of top-level pages, which can be thought of as tabs in the most common UIs that we come across in our daily life. The top-level pages are meant to separate out content. They organize content according to a logical grouping of functionality. Quite a few top-level pages are provided out of the box (e.g., Content Publishing, Welcome, Documents, My Work etc). For this example, I was searching for the page that had the least amount of content where I could put our customized portlet and make it conspicuous by its presence (a feeling that we achieved something !!!). I selected the My Work page for where to add our portlet.

We are now going to add this new portlet to the existing real estate area on this page. To do this, click on the Edit Page link on the top right hand corner of the page (see Fgure 10).

6.  In the Edit Layout page (see Figure 11, click the Add Portlets button.

7.  A search screen will appear where we need to search our portlet out of the ones that are already installed. Key in the search keywords as "Document Search on IBM" and hit the Search button you will see that a match was returned with our portlet (see Figure 12). Click on the checkbox by our portlet and click OK.

This will bring us back to the original page on which we started to add our portlet to the My Work top-level page. At this point you can play around with the page and place the portal vertically, horizontally, in a T layout etc. I leave it up to the more artistic of you to come up with the best look and feel. As for me, I am just going to accept the default layout and click on the Done button (see Figure 13).

8.  This completes the customization and setup process. We're ready to run!

Running the Search
Back in the "My Work" page we can see that our portlet has arrived, ready to be executed. In the search textfield, I keyed in the word websphere and hit the Search button to the right. An excerpt of the output of the search results are shown in Figure 14.

Notice that each search result is categorized using the Portal Search Engine's built-in, predefined categories. There are two interesting things to notice there:

1.  Every search result is followed by a summary. This summary is created by the Portal Search Engine's Summarizer function. The Summarizer creates a summary for each of the documents that have a certain narrative quality

2.  Each search result is designated as part of a category. The categories used here are the predefined categories that come with the WebSphere Portal's Native Search. The incoming documents are categorized by the native search engine's Categorizer function. The Categorizer places the documents in a category based on certain internal algorithms. Pages that do not qualify for any of the provided categories are dropped into the "Uncategorized" category. The top-level node in the predefined category tree is called "root". The first search result is categorized under a main category, called "Business & Commerce", followed by several levels of sub-categorization. The second search result is one of those that the Categorizer could not qualify under one of the predefined categories. Hence, it is left as "Uncategorized" under the top-level ("root") node.

Conclusion
In this article we saw how quickly we can set up a basic search portlet using the out-of the-box Document Search Portlet, and how we can do some simple customizations on the existing portlet.

While in this article I demonstrated a search function using the predefined categories within WebSphere Portal, in my next article I will take you through a more flexible and advanced feature in which users can define their own custom category hierarchy for incoming documents to be managed and presented to the user through the portal.

Resources

  • The InfoCenter is the most comprehensive place to find information on WebSphere Portal Server V5.x: http://publib.boulder.ibm.com/pvc/wp/502/ ent/en/InfoCenter/index.html
  • IBM Redbook, IBM WebSphere Portal V5 A Guide for Portlet Application Development. http://publib-b.boulder.ibm.com/ Redbooks.nsf/RedbookAbstracts/sg246076.html?Open
  • WebSphere Portal Zone: www-106.ibm.com/developerworks/websphere/zones/portal/
  • Web crawler robots: www.robotstxt.org/wc/robots.html
  • One-Click Portal Install: www-306.ibm.com/software/info/websphere/partners6/demohelp.html
  • More Stories By Tilak Mitra

    Tilak Mitra is a Certified Senior IT Architect at IBM. He specializes in mid- to large-range enterprise and application architectures based on J2EE, MQ, and other EAI technologies. You can reach him at tmitra@us.ibm.com.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.