In this post we will go through what it takes to write a simple EMD file in python. If you want to learn how to read an EMD file in python, take a look here.
We will write an EMD file containing a 512x512x100 datacube filled with random numbers. The finished python script can be found here. Please note that the EMD file created by this script is about 200 MB in size.
import h5py import numpy as np import datetime
EMD files are based on the HDF5 format. Therefore we import h5py
package containing the python interface to the HDF5 library. Further details on this package and how to install it can be found on the official website. h5py
uses numpy
arrays to handle the data contained in an HDF5 file, so the numpy
package is imported. We will also use it to create our random test data. The datetime
package is imported to create a timestamp for the comments metadata.
# create file f = h5py.File('test.emd', 'w')
To create our EMD file let h5py
create a new HDF5 file name test.emd. The w parameter opens the file in write mode. We save the reference to the root group in variable f
.
# set version information f.attrs['version_major'] = 0 f.attrs['version_minor'] = 2
To let the reader know which specification the data contained in this file follows, we set the version information as attributes of the root group. Note that every group and dataset in an HDF5 file interfaced by h5py
has an attribute called attrs
containing the HDF5 attributes of this group or dataset as a python dict.
# add a group grp_exp = f.create_group('data')
Next we add a data group as a container for the datasets we are going to write in this EMD file. This is especially useful, if you want to put multiple datasets within a single EMD file.
# add an emd type subgroup for the dataset grp_dst = grp_exp.create_group('dataset_1') grp_dst.attrs['emd_group_type'] = 1
Our dataset itself will be contained in another subgroup comprising the actual data and the dimension vectors. This group is given a meaningful identifier best describing the dataset (dataset_1 in our case). To make this group be recognized as an emd-type dataset, we add the attribute emd_group_type and set it to the integer value 1.
# create a 3D dataset with random floats data = grp_dst.create_dataset('data', (512,512,100), dtype='float') data[:,:,:] = np.random.rand(512,512,100)
To this group we add the actual dataset using the create_dataset
method. Its parameters are the label of the dataset, which has to be data in the EMD specification, the shape of the dataset and its datatype. Here we create a 512x512x100 three dimensional datacube of float values. To write data to this dataset, we use numpy
indexing with the given handle. In this example we create a 512x512x100 dataset of random floats using the random.rand()
method from numpy
and set it to our EMD dataset.
# add dimension vectors dim1 = grp_dst.create_dataset('dim1', (512,1), dtype='int') dim1[:,0] = np.array(range(512)) dim1.attrs['name'] = np.string_('x') dim1.attrs['units'] = np.string_('[px]')
In addition to the actual data we need to supply the dimensions for each axis in dim# datasets within the same group. These contain the values along this axis for each element in that direction. The datasets are created in the EMD file in the same way we created the datacube. The first dimension in this example is filled with integers indicating the nth pixel in the x direction. Attributes are used to indicate the label for this axis (name) and the units used for the values. See also the recommendation for consistent unit description in the specifications. To save data as strings using the HDF5 library, the np.string_()
method is used to parse the string to fixed-width byte strings.
dim2 = grp_dst.create_dataset('dim2', (512,1), dtype='int') dim2[:,0] = np.array(range(512)) dim2.attrs['name'] = np.string_('y') dim2.attrs['units'] = np.string_('[px]') dim3 = grp_dst.create_dataset('dim3', (100,1), dtype='float') dim3[:,0] = np.linspace(0.0, 3.14, num=100) dim3.attrs['name'] = np.string_('angle') dim3.attrs['units'] = np.string_('[rad]')
Dimension vectors have to be provided for each dimension of the original dataset. The above code creates the dim2 and dim3 datasets analogously to the dim1 dataset. The dim2 datasets contains integers indicating the nth pixel in y direction similar to dim1. dim3 contains float values describing a fictional angular dimension running from 0 to pi.
The following groups and attributes are not necessary to create a valid EMD file. However it is good practice to use them to supply metadata in a standardized way, facilitating the exchange of scientific datasets. After all, this is what the EMD file format is all about.
# create microscope group for metadata grp_mic = f.create_group('microscope') grp_mic.attrs['magnification'] = 10
The microscope subgroup is recommended to store the experimental settings of the microscope which have led to the acquisition of the saved dataset. The metadata is stored in single attributes to this group, exemplarily shown here for a fictional magnification of 10x.
# create user group for user info grp_usr = f.create_group('user') grp_usr.attrs['operator'] = np.string_('me') grp_usr.attrs['email'] = np.string_('me@mine')
The user subgroup is supposed to contain information about the operator of the microscope. It should contain contact information of whom to ask about the experiment or simulation whose results are provided in the EMD file.
# create sample group for information on sample grp_spl = f.create_group('sample') grp_spl.attrs['material'] = np.string_('random')
The sample group should contain information about the sample, e.g. a unique identifier and information of the material and preparation method.
# create comments group for log grp_com = f.create_group('comments') # add a comment on file creation with the current timestamp timestamp = datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S (UTC)') grp_com.attrs[timestamp] = np.string_('file created, filled with random numbers')
The comments group is recommended to hold timestamped information on the history of the EMD file. An attribute should be added on every change made to the file. Here we retrieve the current timestamp using the datetime
package and use it to add a comment on the creation of this file.
# close the file f.close()
Finally we close the EMD file. Congratulations, you have written your first EMD file using python!
Make sure to play around with this file using the EMD Viewer, or try to read it using this post. A number of common pitfalls witnessed when working with EMD files using the h5py package in python has been compiled here.