Loading Terminology in SHARE: What's the DIFFerence?

We load the CDISC Controlled Terminology (CT) published by the NCI Enterprise Vocabulary Services (EVS) into the SHARE metadata repository (MDR). When we're through catching up with our backlog we'll have more CT in SHARE than any other type of metadata, and we'll publish the quarterly CT content to the SHARE Exports page on the CDISC web site. One of the main differences in how we load the CT is that we load only the incremental changes. That is, we load the differences by generating diff files that contain only the changes between the packages published in the current quarter and those published in the previous quarter. The diff files are available for download, or they can be generated using the NCI's Diff program. This program is a Java application, and it is simple to set up on most machines.

The Diff application is easy to use and can generate diff files for any two versions of a delimited text CT package. The two packages used to generate the diff file do not need to be consecutive, which is useful if you skipped a quarter or two. For example, we could generate a diff file between SDTM Terminology 2015-12-18.txt and SDTM Terminology 2016-12-16.txt that contains all the terminology changes implemented in 2016. The SHARE team has been using it to generate quarterly diff files for older terminology packages (2014-06-27 and earlier). We drive the quarterly CT load process using these diff files.

We use the diff files and the full CT packages published in ODM-XML format to generate the SHARE load files. The CT content changes are added into an extended version of ODM, and this content is used to generate tab-delimited text files that can be batch imported into SHARE. Before we generate the load files we run a number of quality checks to ensure our assumptions about the CT packages are sound. We do the checks to ensure we don't run into problems during the load that would take additional time to back out of SHARE. Here's a basic list of the QC checks we run on the CT:
Check for and replace characters that do not fit UTF-8 Unicode encoding (e.g. x92, x85)
Generate a SEND CT content file without the SDTM content
Combine the individual packages into one terminology file
Check to ensure all known subset items exist in the parent code list
Check for new terminology subsets. Add new subsets to the list of known subsets.
Check for duplicate code list names
Check for duplicate code list submission values
Check for duplicate code list c-codes
Check for a term without a unique submission value, CDISC synonyms, CDISC definition, or NCI preferred term within a code list
Check for duplicate terms based on code list c-code and term c-code
Generate or download the diff file for each package
Generate the load files using the diffs and compare the expected counts for each file

The process of generating incremental load files is largely one of model transformation. SHARE's metamodel for representing the CT is quite different from what is published by the NCI EVS. It's important to note, that what the NCI EVS folks publish is to CDISC specifications that were established some time ago. The SHARE metamodel is roughly aligned with ISO 11179 and consists of versioned Value Domains, Conceptual Domains, Concepts, and Domain Values. You can see a high-level diagram of the current metamodel on the SHARE API documentation page on the CDISC web site. In SHARE, the ISO 11179 model is implemented on top of the OMG's Reusable Asset Specification (RAS) which means everything that gets loaded into SHARE is an asset.

It takes some application logic to map the published CT into the SHARE model. A significant part of that programming logic exists as a set of rules that detail when the various SHARE terminology assets can be re-used, and what needs to be up-versioned or created de novo. The application uses an API to find the current versions of the SHARE assets.  Additional quality checks highlight errors or warning conditions, such as multiple terms that are the same except for the case. Once the updates in the diff files are processed, including the removal of terms or retirement of code lists, the application generates the tab delimited load files that will be imported into SHARE to update the controlled terminology.

An example of a model transformation issue was recently encountered. In this case, multiple terms were being updated at different times to use an identical c-code and submission value. This creates a scenario where one newer term has multiple predecessor terms, and this is not directly supported in the SHARE metamodel. To make this work for the rare instances where this occurs, we developed a work around that resolved the issue.

After the CT load files have been imported into SHARE, we check the loaded content against the CT published in the NCI spreadsheets. This process can be equated to yet another model transformation where the metadata are extracted from SHARE and reconstituted into the same NCI spreadsheet columnar arrangement. The content extracted from SHARE is sorted identically to the content published by the NCI, so when a test fails the offending content is easy to identify. Tracing the root cause is a bit more challenging given the need to traverse the various asset types in the SHARE metamodel.

Once the updated content is available in SHARE it can be exported or accessed via the API in XML, JSON, or RDF. Given the fact that new CT packages are published quarterly, we anticipate that loading CT into a sponsor's repository will be a popular SHARE API use case. If the topic of representing the CT in the SHARE metamodel is of interest, you might want to check out a recording of the CDISC webinar on SHARE 2.0 to get a look at the expanded metamodel (more closely aligned with ISO 11179) we'll be using in the future.

Comments

Popular posts from this blog

Value Level Metadata, Vertically Structured Datasets, and Normalizaton

Define-XML v2.1 – What do you think?

What’s the difference between iSHARE and eSHARE?