Saturday, February 20, 2016

Making your computational chemistry data available as supplementary material

It is fairly normal (and good practise!) to make atomic coordinates available as supplementary material when submitting a paper.  From what I can tell this is typically done by copying the coordinates into a text editor and making a pdf file.  This can be a tedious, time consuming, and error prone process and the irony is that the intended user will have to reverse the process to actually use the data. There is a better way.

Use an online digital repository
tl;dr: upload your files to an online digital repository, such as Figshare, and simply provide the link to the data in the supplementary material.

To use Figshare you simply make an account, click "Create a new project", and choose "File set". Then you can upload the files you want to share and describe what you are sharing.  Once you have everything the way you want it, you can make it public (you even get a DOI). Here I describe what I did for my latest paper (supplementary material on FigShare here)

Sharing coordinates
Most people will only be interested in the coordinates so I decided to make them available in xyz format, since this format can be read by almost any molecular visualization program.  So I copied the output files that contained the coordinates I wanted to share into a single folder and used OpenBabel to extract the coordinates and convert them into xyz format.

For this I used a bash command.  If you use another shell you can temporarily switch to bash by typing /bin/bash (to get out of bash later type exit).  First make a new folder called "coordinates". Then convert Gaussian log files to xyz files and place them in the new folder (all one line)

for i in *.out; do babel -ig09 "$i" -oxyz coordinates/"${i/.out}".xyz; done

If you want you can add "opt" to the xyz files to indicate that they are optimized by (all one line)

for i in *.out; do babel -ig09 "$i" -oxyz coordinates/"${i/.out}"opt.xyz; done

When doing the calculations I used a different naming scheme for the systems so I added a text file called README to the folder where I describe the change. Then I created a zip file of the folder

zip -r coordinates_for_supmat.zip coordinates

which I then uploaded to FigShare and made public to get the DOI.

Sharing everything else
One or two people may be interested in more than coordinates.  So I also make a zip file of the entire project folder and upload that to the same FigShare fileset.  Actually, calculations were done by three different people so I made zip files for each of their project folders separately.  This only takes a few minutes not counting compression and upload time.  These folders also contain calculations that did not make it into the manuscript but I am not willing to spend much time weeding them out because it is quite likely that no one will ever look at it.  Anyway, if there are questions they can always ask me. I also used a Google sheet for the data analysis (don't judge me!) so I included a link to that on FigShare.  If you used Excel you could just upload that file.

Sharing Gaussian output
Some of the calculations were done with Gaussian09, so I checked with Gaussian about whether they allow sharing output files.  They wrote that "yes, you may include the relevant parts of the output in the manuscript. We just ask that you not include any timing information".  I had a look at the output files and concluded that if I delete lines with the word "cpu" I should be OK.

You can do this for one file (here called "file.out") by

grep -v "cpu" file.out > temp && mv temp file.out

and all the files in a project (here located in folder "project") using bash (all one line)

for i in project/*.out project/**/*.out project/**/**/*.out; do grep -v "cpu" "$i" > temp && mv temp  "$i"; done

This  command goes three folders deep.  If you have more layers add "project/**/.../**/*.out" as needed.  There are also many other ways of going this.  Just to be safe I also deleted lines containing "terminated" because they contain a time and date stamp.



This work is licensed under a Creative Commons Attribution 4.0

No comments: