m Masafumi
on

 

I'm testing OpenCDISC 1.5 with some dataset with Japanese characters.

OpenCDISC 1.5 can handle Japanese values correctly when it is passed by Dataset-XML format. (but currently .xpt format seems not to be supported yet).

But still there is a problem to validate datasets with non-ASCII characters. Checking controlled terminologies is done by parsing text files in config/data/CDISC/SDTM/yyyy-mm-dd/ folder, but there is no way to specify character encoding for these terminology files. It will be a barrier when users would like to check their data by localized terminologies.

So I wrote a small patch to enable specifying text encoding of terminology files.
As I failed to attach patch file to this forum post, please download patch file from this link: https://www.dropbox.com/s/ps5w8spk1p8no06/OpenCDISC_CTEncoding.patch 

When this patch is applied, users can specify text encoding of terminology files by setting Engine.ControlledTerminology.FileEncoding property.

For example, if localized terminology file is encoded by EUC-JP encoding, users should append following line to lib/settings/settings.properties file.

Engine.ControlledTerminology.FileEncoding = EUC-JP

I will present some test dataset in my next post.

 

Forums: Enhancements and Feature Requests

m Masafumi
on April 24, 2014

I prepared two test files to confirm the function of this patch.

First, a tiny dataset in Dataset-XML format: https://www.dropbox.com/s/80zim8y349xlsal/dm.xml
This data contains Japanese characters at DM.RACE and DM.ETHNIC.

Of course validatiing this data by OpenCDISC 1.5 generates errors for these two data; "Value for ETHNIC not found in (ETHNIC) CT codelist" and "Value for RACE not found in (RACE) CT codelist".

So let's modify "SDTM Terminology.txt" ( ...that is a bad practice, I think, but for the test purpose...) to pass Japanese data. This is a modified version: https://www.dropbox.com/s/8iq8ywvsvvnq20e/SDTM%20Terminology.txt

This modified terminology.txt is encoded by EUC-JP encoding. But OpenCDISC 1.5 interprets delimited text files using platform's "default text encoding". EUC-JP is not the default on Windows nor Mac in Japanese locale,  so OpenCDISC 1.5 still report same errors, though terminology file now supports the value.

Try applying my patch above, and specify Engine.ControlledTerminology.FileEncoding = EUC-JP on settings.properties file. After that, these errors will disappear.

 

m Masafumi
on April 24, 2014

In addition to the CT encoding patch, I'm trying to write a tiny patch to enable specifying text encoding for .XPT dataset. 

Currently modifying only one line patch seems to be working:  https://www.dropbox.com/s/5uvye3ekpdjvdm1/OpenCDISC_XPTEncoding.patch

I checked it by this xpt file:
 https://www.dropbox.com/s/baztwebsdbcf6fn/dm.xpt

But this xpt file is generated by R, not SAS. I will continue to test by more datasets. 


Want a demo?

Let’s Talk.

We're eager to share and ready to listen.

Cookie Policy

Pinnacle 21 uses cookies to make our site easier for you to use. By continuing to use this website, you agree to our use of cookies. For more info visit our Privacy Policy.