New actions can be created to extend the capabilities of the data processing service. You may want to integrate a new capability to translate text from one language to another, or you may want to use the existing metadata of the document to lookup additional information about the document.
The Document Worker Framework is the perfect solution to quickly create new actions for the Data Processing Service.
A document enters a document processing workflow with metadata, each action in the workflow can add, remove and modify this metadata. The document worker standardises this process by formalising the input and output of an action based on the Document Worker. The input to such a worker is defined as the metadata fields that already exist on the document. The output is a series of modifications to the documents metadata.
Creating an action using the Document Worker Framework is achieved by using the Document Worker Archetype. The archetype scaffolds a simple example data processing action that can be used as the basis for a new data processing action.
DocumentWorker in Data Processing :
Documentation on how to create a Document Worker from the Document Worker Archetype can be found here.
A Document Worker created from the archetype will contain an example implementation. This implementation is used to look at a document for a field name ‘REFERENCE’, and then to use its value to retrieve data from a table. The retrieved data is then applied to the value of the field ‘UNIQUE_ID’.
The example implementation implements the ‘BulkDocumentWorker’ interface, so that documents can be processed in batches. When using the BulkDocumentWorker interface be sure to implement the ‘processDocument’ method as well as the ‘processDocuments’ method. The reason for this is that when processing documents as a batch, if an error occurs when processing one of the documents, each document will then be retried one at a time and will utilize the ‘processDocument’ method.
If there is no efficiency to be gained by processing documents together then the ‘DocumentWorker’ interface should be implemented instead of the ‘BulkDocumentWorker’ interface, and the method ‘processDocuments’ should be removed. The example implementation implements BulkDocumentWorker for demonstration purposes only.
The Document Worker defines the following control options:
Name | Description |
maxBatchSize | This is a property of type 'int'. Specifies the maximum number of documents to include in a batch. |
maxBatchTime | This is a property of type 'long'. Specifies the maximum length of time (in milliseconds) to build up a batch. The time starts as soon as the first document is received. |
closeBatch() | Draws this batch of documents to a close. Calling this method will result in no new documents being added to the batch, the 'currentSize' will remain the same. This could be implemented to allow for some dynamic handling of documents. After calling this method, it is considered good practice to continue to iterate over the documents in the batch until there are none left to be processed. |
Document workers have the following advantages:
ChainedActionType
.When the data processing action supports batching the Bulk Document Worker is advantageous. For example a Document Worker can be used to lookup and retrieve information from a database. If field values on a document contained project IDs and these values where to be replaced with their corresponding project names. A worker could take a batch of documents and make a single call to a database to perform a bulk lookup of project IDs, and then to return a project name and the project ID.
These documents will be processed in a single interaction with the database. If you attempted to process each of the documents individually latency would be greatly increased because multiple round trips would be made to the database.
Utilising the BulkDocumentWorker interface allows users to make use of the ‘closeBatch()’ method. Calling this method will result in no more documents being added to the batch. This method can be implemented to allow for dynamic handling of documents. For example, if the accumulated size of each document where to exceed a limit, the closeBatch() method can prevent any more documents from being added to the batch, regardless of how ‘maxBatchSize’ and ‘maxBatchTime’ are configured, in order to reduce the time spent retrieving information from the database.
Additionally, if an implementation of BulkDocumentWorker was to encounter a problem during processing of the batch. The worker will stop processing the documents as a batch and begin processing every document individually. This is very useful because it means that one troublesome document in a batch will not stop other documents from being processed.
New actions can be created to extend the capabilities of the data processing service. Creating an action using the Document Worker Framework is achieved by using the Document Worker Archetype. A Document Worker can be used to alter the field names and values contained within the metadata of a document supplied to it. Documents can be processed as a batch.
Explicit failures should be added to documents as they are processed. Failures are recorded on the document by using the
addFailure
method on the Document
object. The addFailure
method takes two String
arguments: a failure ID, and a
failure message. The failure ID should be a non localizable identifier related to the failure. The failure message
should be a human readable message relating to the failure.
An example failure ID: “KVERR_FormatNotSupported”.
An example failure message: “The file format ‘IPG’ is not recognized”.
When utilising BulkDocumentWorker, if an error is encountered whilst processing a batch of documents, the worker will stop processing the documents as a batch and begin processing each document individually. This occurs so that a troublesome document in a batch will not stop other documents from being processed.
A Document Worker can throw a DocumentWorkerTransientException
. This should be thrown when a transient failure has occurred such as a
brief disconnection from a temporary resource such as a database. The operation might be able to succeed if it is retried at a later time,
so the task will be pushed back onto the queue. The ‘retryLimit’ is set in the configuration file for RabbitWorkerQueueConfiguration
.
WorkerFramework supports error handling in the case of non parsable input messages, and catastrophic errors which are not recoverable. This information as well as information on poison messages and retry counts, are documented in Worker Framework.
JavaDocs for Document Worker Interface can be found here.