Curation Data Model

The document outlines various actions that occur inside the Curation Manager for ReDBox and Mint. This document is not intended to be read alone however, so please see the other documentation relating to Curation for the broader context.

Each of the actions (or tasks) below is designed to be a single discreet step in a loose workflow that could potentially be distributed across multiple systems and records. This document outlines how all the of the tasks fit together, but from a technical perspective they (individual tasks) are not designed with the big picture in mind; they are simply aware of the parts of the workflow around them and operate in isolation.

Because each step is also implemented as a message in a message queue there is no guarantee that (for example) five messages in a row will relate to the one object . Each task must perform its allotted work and update the system state before it is finished, with no awareness of when (or even if) the next message in the process will arrive. The technical documentation will go into further detail on this issue and its implications, but from an administrative perspective it is worth keeping this in mind. For example, if you were to be monitoring log files for this process it would not always show a single linear curation process from start to finish. You may find two or more independent 'streams' working through this process, and you would need to look at the specifics of each message to interpret them.

On an operational level the Curation Manager is looking for a particular key in each message, called 'task'. This tells the Curation Manager the purpose of this message and which part of the process the message concerns, and the various 'task's expected by the Curation Manager are outlined in the sections below.

Tool Chain Tasks

Because we used the Curation Manager to replace Fascinator standard tool chain, we need make sure that the typical tool chain jobs are taken care of before we can start to consider curation. For this reason there are two 'task's related to these administrative jobs. Both ReDBox and Mint have these tasks and they function identically on both systems:

  • 'reharvest': This is the core of the traditional tool chain's role. It will run any transformers it is configured to, and then ensure that the Solr index gets updated. It will follow this by sending a 'clear-render-flag' back to the Curation Manager.
  • 'clear-render-flag': The web UI looks for a property against each object (which has come to be called a 'render flag') that indicates an object is currently in the tool chain being worked on. The Fascinator core sets this flag on the way in to the tool chain, but any replacement for the tool chain needs to remember to unset the render flag as it finishes, and that's what this task does.

Basic Curation

These tasks implement the core functionality of curation. Both ReDBox and Mint have these tasks, although there are some subtle differences in specific tasks, as noted below.

  • 'curation-request': This task is used to instigate the curation process for a record. It should typically define a relationship with the requester as well as a method of responding once curation has completed, and both of these should be stored for later use. Even if this record has been previously curated, the incoming relationships/responses may be new, so they are always stored (aside from some de-duplication). Unless configured to have a staff member verify this process, this task will typically generate a 'curation' task immediately and add it to the end of the queue. If verification is required, an email will be generated to administrators and processing will halt. It is the responsibility of the verification process to create the 'curation' task and enqueue it. See Note 1 for ReDBox/Mint differences.
  • 'curation-query': This is kind of like the 'light' version of a 'curation-request' task. There are times that the system doesn't necessarily want to trigger a curation workflow or declare a relationship, but would like to know if the object is already curated and has an identifier. If the target is not currently curated this record will be registered for a response when curation occurs, otherwise it receives an immediate response.
  • 'curation': This is where the actual curation of this record occurs. The Curation Manager will send the record through all configured curation Transformers and follow this up by generating a new 'curation-confirm' task on the end of the queue.
  • 'curation-confirm': This task has two roles:
  1. Confirm that curation actually occurred; meaning that the 'pidProperty' value set in configuration should contain a value. If this value is not set an error email will be generated to an administrator (see Note 2 for ReDBox) and the process halts, otherwise we continue on to...
  2. Check if all related objects have also gone through curation as well. For any that the current record is not aware of a 'curation-request' will be sent out to the appropriate location. This may be sent to either ReDBox or Mint's Curation Manager depending on configuration for each given relationship. The outgoing request will specify what reverse relationship the other record should have with this one, and ask the other Curation Manager to respond with a 'curation-pending' task when the other object completes curation. If there are no related objects, then a 'curation-pending' task will be generated automatically and added to the message queue.
  • 'curation-pending': This task is used to keep the record in a holding pattern if it is waiting on other records to curate, and this is why it is specified as the response task for each outgoing 'curation-request'. Whenever this task is received we again check through all related objects to see if they have been curated, although this time we won't send out requests, we just want to know if this record is ready to go. If the 'curation-pending' task was sent as a response to one of our requests, then the relationship metadata is of course updated with the response details before this check is performed. If there are no related objects, or if all objects have completed curation a 'curation-response' task is generated on the end of the message queue.
  • 'curation-response': By this stage we know not only that this record has been curated, but the entire network of linked relations has completed curation as well. The Curation Manager starts by generating a response message to any records that have sent incoming 'curation-request's or 'curation-query's to this record (this is why we store them all as they come in). Typically these will be 'curation-pending' tasks, but the requester could theoretically ask for anything. Following this a metadata flag is set against this object indicating that it is ready to publish. See Note 3 for ReDBox.
  • 'publish': Publishing is actually very simple, but we do need to do a few things to make sure it goes smoothly:
  1. We are going to set a metadata flag 'published' to 'true'. Our indexing rules files are watching for this value to index the record as a published object, thus exposing it in the 'published' web portal we want the ANDS harvester to use.
  2. We send the record through our configured curation Transformers again, just like in the 'curation' task. This allows for integration with external publication processes. See Note 4 as this relates to ReDBox and VITAL.
  3. If we have related objects we will in turn send 'publish' tasks out to them as well. We don't expect any responses from these however.

Notes:

  1. Mint has a link on the record's details screen which will generation a 'curation' task and send it to the Curation Manager. It makes no sense to require manual verification by a staff member on ReDBox, since a 'curation-request' is trigged when a staff member has already been required to save an object to the 'Published' step of the workflow. For this reason there is no link on the details screen in ReDBox. If you wanted to turn this on for some reason, copy the link (and associated) javascript from Mint.
  2. When ReDBox is configured to integrate with VITAL 'curation-confirm' is expected to fail the first time, because VITAL allocates Handles from its background indexer which only runs periodically. The housekeeping script polls VITAL until a Handle is found and a new 'workflow' task (see below) begins the process again for this record. This pass should not halt in 'curation-confirm' anymore, allowing the rest of the curation process to execute. The error email is still sent, but the text is different and specifically suggests that VITAL integration is the most likely cause. Administrators would only need to act on these emails if they receive more than one per record.
  3. The 'curation-response' task in ReDBox has the additional job of beginning the publication process. Given that the Collection is essentially at the 'middle' of our linked data network, once a positive response has come back from all related records it is time to publish, so a 'publish' task is generated for the Collection (and only the Collection; the 'publish' task is responsible for making this propagate through the network, not the 'curation-response' task) on the end of the message queue.
  4. Just like above in Note 2, the need to integrate with VITAL requires some additional consideration here. Once the object has been published our VITAL Transformer needs to copy all the datastreams out to VITAL and activate the object. Up until now it will have been an inactive object in the Fedora repository underneath VITAL. Activating the object is enough to publish the record in VITAL, so if your are harvesting VITAL from the ANDS harvester this is all you need.

ReDBox Specifics

These tasks are specific to ReDBox and have no equivalent in Mint. They relate to the standard ReDBox form based workflow and are used to integrate it with the tool chain and the curation process.

  • 'workflow': Prior to v1.2 ReDBox had a particularly inelegant process that occurred in response to a user saving new data from the workflow forms. This was an attempt to account for the asynchronous activities in the tool chain and the requirement to use the old VITAL Subscriber (it is now a Transformer) at some unknown point after the tool chain finishes processing. This new task replaces all of this functionality in the web portal, and re-implements in the Curation Manager in a greatly streamlined fashion. For legacy support any of the older Subscriber messages now route to here to ensure that audit logging continues as it used to. It will also generate a new 'workflow-curation' task if a 'ReIndex' event occurred (this is the form data requesting an update to metadata templates etc. because the data has changed).
  • 'workflow-curation': This task will assess the workflow form data and decide if curation is required (ie. is the object in the 'Published' workflow step) and then traverse the form data looking for any relationships that have been entered by data entry staff. This all occurs in line with the system configuration. Following this a basic 'curation-request' will be sent along with all the relationship data (although no response will be required).
Comments