A patch to enable specifying text encoding of controlled terminology files

m Masafumi

on April 24, 2014

I'm testing OpenCDISC 1.5 with some dataset with Japanese characters.

OpenCDISC 1.5 can handle Japanese values correctly when it is passed by Dataset-XML format. (but currently .xpt format seems not to be supported yet).

But still there is a problem to validate datasets with non-ASCII characters. Checking controlled terminologies is done by parsing text files in config/data/CDISC/SDTM/yyyy-mm-dd/ folder, but there is no way to specify character encoding for these terminology files. It will be a barrier when users would like to check their data by localized terminologies.

So I wrote a small patch to enable specifying text encoding of terminology files.
As I failed to attach patch file to this forum post, please download patch file from this link: https://www.dropbox.com/s/ps5w8spk1p8no06/OpenCDISC_CTEncoding.patch

When this patch is applied, users can specify text encoding of terminology files by setting Engine.ControlledTerminology.FileEncoding property.

For example, if localized terminology file is encoded by EUC-JP encoding, users should append following line to lib/settings/settings.properties file.

Engine.ControlledTerminology.FileEncoding = EUC-JP

I will present some test dataset in my next post.

Forums: Enhancements and Feature Requests

m Masafumi

on April 24, 2014

Test files for the patch

I prepared two test files to confirm the function of this patch.

First, a tiny dataset in Dataset-XML format: https://www.dropbox.com/s/80zim8y349xlsal/dm.xml
This data contains Japanese characters at DM.RACE and DM.ETHNIC.

Of course validatiing this data by OpenCDISC 1.5 generates errors for these two data; "Value for ETHNIC not found in (ETHNIC) CT codelist" and "Value for RACE not found in (RACE) CT codelist".

So let's modify "SDTM Terminology.txt" ( ...that is a bad practice, I think, but for the test purpose...) to pass Japanese data. This is a modified version: https://www.dropbox.com/s/8iq8ywvsvvnq20e/SDTM%20Terminology.txt

This modified terminology.txt is encoded by EUC-JP encoding. But OpenCDISC 1.5 interprets delimited text files using platform's "default text encoding". EUC-JP is not the default on Windows nor Mac in Japanese locale, so OpenCDISC 1.5 still report same errors, though terminology file now supports the value.

Try applying my patch above, and specify Engine.ControlledTerminology.FileEncoding = EUC-JP on settings.properties file. After that, these errors will disappear.

m Masafumi

on April 24, 2014