IBM Analytics Ideas

Welcome to the idea forum for IBM Analytics! 

Our team welcomes any feedback  and suggestions you have for improving our offerings / products!  This forum allows us to connect your offering / product improvement ideas with IBM product and engineering teams.

 

If you have not registered on this portal please click on the following link and register.  To complete registration you will need to open the email you will receive from Aha to confirm your identity.  https://ibm.biz/AnalyticsIdeasPortalRegister

 

Waston Explorer Content Analytics FileNet P8 Crawler Enhancement - Checkpoint Restart Crawling

We have a insurance customer that had two FileNet P8 collections that was crawled by WEXCA.   They wanted improved faceted search and full text search capabilities.  One collection was a single document class with 17 million documents.   The other collection was also a single document class of 8 million documents.   The current implementation of the FileNet P8 crawler requires you to perform a full crawl with no restart capabilities.    The initial processing 17 million document could take up to a month of 24 hour processing given the resources allocated.   During the processing we may need to increase the heap size or need to stop the crawler for a FileNet maintenance outage.   If the crawler stops then when you restart even with the start crawling options of update only,  it forces a full crawl and restarts from the beginning.    There is no way to stop and restart the crawler to pick up where it stopped.   Given it was totally unreasonable to expect we could crawl for multiple weeks without a restart we were forced to split up the collection into 10 crawler so that if we had to stop we would only loose a few days processing.  This was still very costly in time to get this collection crawled and setting up the crawlers (lots of metadata mappings that were lost when you cloned the crawlers.   Also this is a one time event because the documents are not changed.

The enhancement request is to have the FileNet crawler to act more like the JDBC crawler where it will pick up crawling where it left off.  Given the FileNet P8 is flagship content repository, it is very painful to use with big collections.   An enhancement to allow to restart crawling and not reprocessing documents that have already been crawl is greatly needed.

  • Randall Wilcox
  • Oct 16 2017
  • Planned
Real Scenario + Problem Statement
Role Summary WEX-CA Administrator and Solution Architect
Who would benefit from this IDEA?
Revenue Opportunity
Idea Priority High
How should it work?
Owning Tribe
Owning Segment
Submitter Tags
Customer
Geo - Use for OBDR Reporting only
Why is it useful?
  • Attach files
  • Randall Wilcox commented
    October 16, 2017 20:56

    This enhancement is related to the following PMRs:   38151,756,000, 38158,756,000,

NOTICE TO EU RESIDENTS: per EU Data Protection Policy, if you wish to remove your personal information from the IBM ideas portal, please login to the ideas portal using your previously registered information then change your email to "anonymous@euprivacy.out" and first name to "anonymous" and last name to "anonymous". This will ensure that IBM will not send any emails to you about all idea submissions