[Python] Write EMD

In this post we will go through what it takes to write a simple EMD file in python. If you want to learn how to read an EMD file in python, take a look here.

We will write an EMD file containing a 512x512x100 datacube filled with random numbers. The finished python script can be found here. Please note that the EMD file created by this script is about 200 MB in size.

import h5py
import numpy as np
import datetime

EMD files are based on the HDF5 format. Therefore we import h5py package containing the python interface to the HDF5 library. Further details on this package and how to install it can be found on the official website. h5py uses numpy arrays to handle the data contained in an HDF5 file, so the numpy package is imported. We will also use it to create our random test data. The datetime package is imported to create a timestamp for the comments metadata.

# create file
f = h5py.File('test.emd', 'w')

To create our EMD file let h5py create a new HDF5 file name test.emd. The w parameter opens the file in write mode. We save the reference to the root group in variable f.

# set version information
f.attrs['version_major'] = 0
f.attrs['version_minor'] = 2

To let the reader know which specification the data contained in this file follows, we set the version information as attributes of the root group. Note that every group and dataset in an HDF5 file interfaced by h5py has an attribute called attrs containing the HDF5 attributes of this group or dataset as a python dict.

# add a group
grp_exp = f.create_group('data')

Next we add a data group as a container for the datasets we are going to write in this EMD file. This is especially useful, if you want to put multiple datasets within a single EMD file.

# add an emd type subgroup for the dataset
grp_dst = grp_exp.create_group('dataset_1')
grp_dst.attrs['emd_group_type'] = 1

Our dataset itself will be contained in another subgroup comprising the actual data and the dimension vectors. This group is given a meaningful identifier best describing the dataset (dataset_1 in our case). To make this group be recognized as an emd-type dataset, we add the attribute emd_group_type and set it to the integer value 1.

# create a 3D dataset with random floats
data = grp_dst.create_dataset('data', (512,512,100), dtype='float')
data[:,:,:] = np.random.rand(512,512,100)

To this group we add the actual dataset using the create_dataset method. Its parameters are the label of the dataset, which has to be data in the EMD specification, the shape of the dataset and its datatype. Here we create a 512x512x100 three dimensional datacube of float values. To write data to this dataset, we use numpy indexing with the given handle. In this example we create a 512x512x100 dataset of random floats using the random.rand() method from numpy and set it to our EMD dataset.

# add dimension vectors
dim1 = grp_dst.create_dataset('dim1', (512,1), dtype='int')
dim1[:,0] = np.array(range(512))
dim1.attrs['name'] = np.string_('x')
dim1.attrs['units'] = np.string_('[px]')

In addition to the actual data we need to supply the dimensions for each axis in dim# datasets within the same group. These contain the values along this axis for each element in that direction. The datasets are created in the EMD file in the same way we created the datacube. The first dimension in this example is filled with integers indicating the nth pixel in the x direction. Attributes are used to indicate the label for this axis (name) and the units used for the values. See also the recommendation for consistent unit description in the specifications. To save data as strings using the HDF5 library, the np.string_() method is used to parse the string to fixed-width byte strings.

dim2 = grp_dst.create_dataset('dim2', (512,1), dtype='int')
dim2[:,0] = np.array(range(512))
dim2.attrs['name'] = np.string_('y')
dim2.attrs['units'] = np.string_('[px]')

dim3 = grp_dst.create_dataset('dim3', (100,1), dtype='float')
dim3[:,0] = np.linspace(0.0, 3.14, num=100)
dim3.attrs['name'] = np.string_('angle')
dim3.attrs['units'] = np.string_('[rad]')

Dimension vectors have to be provided for each dimension of the original dataset. The above code creates the dim2 and dim3 datasets analogously to the dim1 dataset. The dim2 datasets contains integers indicating the nth pixel in y direction similar to dim1. dim3 contains float values describing a fictional angular dimension running from 0 to pi.

The following groups and attributes are not necessary to create a valid EMD file. However it is good practice to use them to supply metadata in a standardized way, facilitating the exchange of scientific datasets. After all, this is what the EMD file format is all about.

# create microscope group for metadata
grp_mic = f.create_group('microscope')
grp_mic.attrs['magnification'] = 10

The microscope subgroup is recommended to store the experimental settings of the microscope which have led to the acquisition of the saved dataset. The metadata is stored in single attributes to this group, exemplarily shown here for a fictional magnification of 10x.

# create user group for user info
grp_usr = f.create_group('user')
grp_usr.attrs['operator'] = np.string_('me')
grp_usr.attrs['email'] = np.string_('me@mine')

The user subgroup is supposed to contain information about the operator of the microscope. It should contain contact information of whom to ask about the experiment or simulation whose results are provided in the EMD file.

# create sample group for information on sample
grp_spl = f.create_group('sample')
grp_spl.attrs['material'] = np.string_('random')

The sample group should contain information about the sample, e.g. a unique identifier and information of the material and preparation method.

# create comments group for log
grp_com = f.create_group('comments')

# add a comment on file creation with the current timestamp
timestamp = datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S (UTC)')
grp_com.attrs[timestamp] = np.string_('file created, filled with random numbers')

The comments group is recommended to hold timestamped information on the history of the EMD file. An attribute should be added on every change made to the file. Here we retrieve the current timestamp using the datetime package and use it to add a comment on the creation of this file.

# close the file
f.close()

Finally we close the EMD file. Congratulations, you have written your first EMD file using python!

Make sure to play around with this file using the EMD Viewer, or try to read it using this post. A number of common pitfalls witnessed when working with EMD files using the h5py package in python has been compiled here.