We have a insurance customer that had two FileNet P8 collections that was crawled by WEXCA. They wanted improved faceted search and full text search capabilities. One collection was a single document class with 17 million documents. The other collection was also a single document class of 8 million documents. The current implementation of the FileNet P8 crawler requires you to perform a full crawl with no restart capabilities. The initial processing 17 million document could take up to a month of 24 hour processing given the resources allocated. During the processing we may need to increase the heap size or need to stop the crawler for a FileNet maintenance outage. If the crawler stops then when you restart even with the start crawling options of update only, it forces a full crawl and restarts from the beginning. There is no way to stop and restart the crawler to pick up where it stopped. Given it was totally unreasonable to expect we could crawl for multiple weeks without a restart we were forced to split up the collection into 10 crawler so that if we had to stop we would only loose a few days processing. This was still very costly in time to get this collection crawled and setting up the crawlers (lots of metadata mappings that were lost when you cloned the crawlers. Also this is a one time event because the documents are not changed.
The enhancement request is to have the FileNet crawler to act more like the JDBC crawler where it will pick up crawling where it left off. Given the FileNet P8 is flagship content repository, it is very painful to use with big collections. An enhancement to allow to restart crawling and not reprocessing documents that have already been crawl is greatly needed.
|Role Summary||WEX-CA Administrator and Solution Architect|
NOTICE TO EU RESIDENTS: per EU Data Protection Policy, if you wish to remove your personal information from the IBM ideas portal, please login to the ideas portal using your previously registered information then change your email to "firstname.lastname@example.org" and first name to "anonymous" and last name to "anonymous". This will ensure that IBM will not send any emails to you about all idea submissions