While working with define.xml, usually we are focused on its creation for submission to regulatory agencies. The key advantage towards adopting this burdensome standard is that it is machine readable. Well, recently I tested the machine readability of define.xml and am sharing my experience here.
I needed to import existing define.xml into some sort of relational database or tabular structure which could somehow fit in csv, excel or SAS datasets. I started looking out for available methods that other people may have been using. I didn’t notice any paper making use of the machine-readable aspect of define.xml, and found only one paper discussing how to import define.xml. Pinnacle 21 community software provides the facility to import define.xml. I tested these approaches and was surprised to see that none of these options provide a robust approach to read define.xml. Ultimately, I had to come up with my own way which I will discuss at the end. But first let’s look in the limitations of the available options.
SAS XML mapper
In the paper here, SAS XML mapper is introduced as an option to import the define.xml files into relational database like structure within SAS. First one needs to create a XMLMAP using either the define.xml file itself or it’s schema file. I wasn’t able to create the XMLMAP from the schema file. The issue was selection of root of the map, which I couldn’t get to work to get the proper map. Whereas define.xml can be easily used to generate the XMLMAP along with the SAS code to import the define.xml. The problem with this approach is that the datasets created by automatic XMLMAP aren’t same across different define.xml files. This is since the automatic mapper parses a define.xml file and create XMLMAP specific to that define.xml file. Another problem is that 40-50 datasets are created by automatic XMLMAP which are very hard to use. E.g. for one study this process extracted 49 metadata datasets, whereas for the other, it extracted 41 metadata datasets.
A robust approach will be create custom XMLMAP manually to extract define.xml. Such XMLMAP file can work across studies if created properly. Again this is time consuming task to create a custom XMLMAP file and it is not clear if entire define.xml can be successfully extracted via this approach.
Copy-Paste define.xml into excel
Direct copying/pasting seemed an attractive option to directly pull the rendered define.xml (while using the stylesheet provided by CDISC). It is pretty easy, just select the entire define.xml file already open in the browser and paste the contents into a blank excel file. To get better formatting, the define.xml file can be directly opened in excel (while selecting the associated stylesheet file). In excel, the define.xml looks very similar to how it looks in the browser. But after careful examination, this is not a viable option. The default stylesheet doesn’t display all the metadata attributes present in the define.xml. Often the metadata values are split across cells due to way the controlled terminology, comments, and where-clause are displayed within variable and value-level metadata. Finally formatted information, displayed within the ‘controlled terminology’ and ‘where’ columns, replace the original reference names stored within the define.xml. The formatted values are harder to read programmatically and in some scenarios, do not uniquely identify the correct controlled terminology. Via this approach the copied metadata is incomplete, harder to use and not robust.
Import define.xml using the Pinnacle 21
Pinnacle 21 community software comes along with the option to import existing define.xml into an excel specification file. The tool can also create excel specification file from the data. These excel specification files can be used to create the define.xml via Pinnacle 21 community software. This tool is able to read, the sample define.xml files provided by CDISC. But it failed to read other define.xml files available with me. These define.xml files are syntactically correct and structurally compliant as per the validation tool within the Pinnacle 21 community software.
My way – modify the CDISC stylesheet
Since, none of the above options worked for me, I tried a more creative way. It is possible to modify the existing XSL stylesheet provided by CDISC and modify it to display define.xml in any desired format. I modified the stylesheet to display all the dataset, variable level, value level and controlled terminology attributes into corresponding four consolidated 2-dimentional tables. Also, in the stylesheet, I suppressed splitting of cells and formatting of values. This approach worked. Using a single stylesheet file, now I can robustly extract all the needed metadata from define.xml into excel. The same modified stylesheet can work for any V2.0 compliant define.xml file.
It is true that define.xml is machine readable, but there is no widely available tool or any other straightforward way to import define.xml thoroughly and robustly.
Have you tried to import define.xml? What approaches have you tried and what has/hasn’t worked? Has your team benefitted due to machine readability of define.xml? I would love to hear back.