File Format

1. Background

To date, most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.). We aim to develop a completely open file format flexible enough to store any possible type of electron microscopy data, while also allowing metadata of any type to be included.

2. HDF5 Format

HDF stands for hierarchical data format. It was originally introduced in 1988 by the National Center for Supercomputing Applications, at the University of Illinois, USA. The task force responsible for developing the HDF format was spun off to a non-profit corporation in 2004, called the HDF Group. They currently maintain and develop the format specification from the University of Illinois Research Park in Champaign, Illinois. For more information about the HDF Group and the history of HDF, see this page.

The HDF group has released several high level APIs to read and write HDF files. These include both high and low level APIs for workhorse programming languages such as FORTRAN, C and C++, as well as analysis platforms such as MATLAB and IDL. Binaries, source code, and documentation for the current version of HDF are all available on the extensive HDF5 website.

3. HDF5 File Contents

Each HDF5 file contains three different component types; groups, datasets and attributes. The highest level group is called the “root group.” Groups can contain additional groups or two types of members: attributes and datasets. The primary difference between attributes and datasets is the length of the stored information. Generally, attributes should only be used to store small pieces of information such a single number or string. Datasets may contain any amount of information and are optimized towards larger amounts of data. Note that attributes may also belong directly to datasets.hdf5_example-01

All diagrams shown in this document will be drawn in a tree view starting from the root. Each element type has a unique colour and child groups or attributes will be drawn directly attached to their parent elements. The diagram to the right shows a typical layout of an HDF file containing multiple groups, datasets and attributes.

4. EMD File Contents – v0.2

EMD files are always valid HDF5 files. However, scientific data in an EMD file must be stored in a specific manner.  Each N-dimensional dataset is stored in an EMD data group, which is marked as an EMD data group by having an attribute with the name “emd_group_type” that has an integer value of 1.  There are no restrictions on the name of this group.  Any number of EMD data groups may be placed anywhere in the EMD file.  However we do not recommend placing EMD data groups inside the root folder of an EMD file; rather they should be placed inside a group with a descriptive name.

Each valid EMD data group typically contains multiple datasets.  The first (and only absolutely required) is the data itself, which is stored in a dataset named “data” with N dimensions.  Inside the EMD data group, each dimension must also have an associated dataset named dim# where # ranges from 1 to N. The values of these datasets correspond to the coordinates along their corresponding dimension. For example the x dimension of a 1024^2 pixel micrograph with a pixel size of 0.02 nm would have dim1 values of [0, 0.02, 0.04, … 20.46].

In addition to the coordinates, the dim# should also contains two attributes: name and units. Without these attributes, the viewer will still function but data processing routines may fail or produce incorrect results. These attributes correspond to the name of the dimension vector and the units it is specified in. For example name = x and units = [n_m] are common values for dim1. You may also specify a “name” and “units” for the data group that stores the scientific dataset.  The viewer will display these values below the histogram / colorbar panel.

If the dim# vectors have a constant step size (linearly spaced coordinates), you may remove all entries except for the first two values.  To be consistent with the above description, these values will therefore be equal to [offset  offset+step].  The EMD viewer will assume linear steps and extrapolate.emd_minimal-01

The only other requirement of the EMD specification is to attach the “version_major” and “version_minor” attributes to the root group. The version attributes allows a processing routine to check which features of EMD are expected to be implemented. A minimal EMD file for the current specification version (0.2) with a single N-dimensional dataset therefore looks like the diagram drawn to the right. This minimal example contains a single EMD group, consisting of an N dimensional dataset.

Without the “name” and “units” attributes, the EMD viewer program can still parse the file, and these fields will default to numerical dimensions and pixels respectively. The “version” attribute is important for programs to validate which version of the EMD specification they are using before processing the data.

5. Metadata Tags and Naming Conventions

One of the primary benefits of HDF5 is its ability to include an arbitrary amount of metadata, of arbitrary types.  Large arrays of values should be stored in datasets, and single values or small arrays can be stored in attributes.  We do not place any restrictions on including additional data in EMD files. Extra datasets and attributes can even be freely added even to EMD data groups at any level.

However, we do make recommendations on how to store experimental metadata.  These guidelines are intended to make it easier for different applications that use the EMD format to communicate necessary values, such as microscope parameters.  More standards may be added in the future as data processing needs evolve. Metadata in the EMD format is typically divided into various top-level groups. Each group can contain any number of associated attributes, as well as additional sub-groups. The currently recommended top-level metadata groups for electron microscopy data stored in EMD format are:

A. Microscope

This group contains all of the relevant microscope (and therefore imaging) settings. These include parameters such as accelerating voltage, pixel size / magnification, beam current, and many others. Long lists of closely related parameters can be further lumped into subgroups. For example, the hardware corrector of an aberration-corrected TEM records many aberration parameters. Each of the (many) aberration parameters it can measure has a magnitude, angle and unit. Rather than clutter up the microscope group, we recommend placing these parameters in a subgroup aberrations.

B. Sample

We have found it very useful to include metadata tags related to the sample being imaged. These tags could include the material being imaged, sample processing information, or even qualitative comments related to the sample quality. Recording this information during the experiment greatly aids quality assessments at a future date.

C. User

All information related to the microscope operator (or the person who simulated the data) can be stored here. Information such as the user’s name, institution, department and contact email are very useful for tracing the providence of experimental data. It also aids in database applications for large volumes of microscopy data.

D. Comments

This is a special group that contains time-stamped information on the EMD dataset. In particular we recommend that whenever the data is modified, this information be appended to the comments group. This will prevent potential mixups between raw and filtered data.

Each of these groups are optional. However using standard conventions for metadata information will make it much easier to develop software and algorithms that make use of the EMD format.

We also have some general recommendations for EMD file contents and naming conventions. All group and attribute names are lowercase and single word if possible. If multiple word names are needed, they are separated by underscore characters. Units are specified inside repeated square brackets with the prefixes placed in front of the unit and separated by an underscore. Raising a unit to a power other than 1 is specified by the ^ character, which includes inverse units prefaced by a negative sign. These considerations make units easily machine readable. Some examples include:

Unit Type Unit Name EMD Unit

Length nanometers [n_m]

Electric potential kilovolts [k_V]

Electron interaction parameter rads per nanometer squared [rad][n_m^-2]

Force Newtons [N] or [k_g][m][s^-2]