1. Background
To date, most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.). We aim to develop a completely open file format flexible enough to store any possible type of electron microscopy data, while also allowing metadata of any type to be included.
2. HDF5 Format
HDF stands for hierarchical data format. It was originally introduced in 1988 by the National Center for Supercomputing Applications, at the University of Illinois, USA. The task force responsible for developing the HDF format was spun off to a non-profit corporation in 2004, called the HDF Group. They currently maintain and develop the format specification from the University of Illinois Research Park in Champaign, Illinois. For more information about the HDF Group and the history of HDF, see this page.
The HDF group has released several high level APIs to read and write HDF files. These include both high and low level APIs for workhorse programming languages such as FORTRAN, C and C++, as well as analysis platforms such as MATLAB and IDL. Binaries, source code, and documentation for the current version of HDF are all available on the extensive HDF5 website.
3. HDF5 File Contents
Each HDF5 file contains three different component types; groups, datasets and attributes. The highest level group is called the “root group.” Groups can contain additional groups or two types of members: attributes and datasets. The primary difference between attributes and datasets is the length of the stored information. Generally, attributes should only be used to store small pieces of information such a single number or string. Datasets may contain any amount of information and are optimized towards larger amounts of data. Note that attributes may also belong directly to datasets.
In this document we’ll diagram HDF5 files as shown below, with a unique color for each of the HDF5 file components, and with lines connecting parent groups and datasets to their contained child groups, datasets and attributes.

4. EMD v0.1 (original 2012 format)
EMD files are always valid HDF5 files. However, scientific data in an EMD file must be stored in a specific manner. EMD v0.1 defines a specification for storing any number of N-dimensional arrays, each with a set of N 1D vectors which calibrate its axes.
Each N-dimensional dataset is stored in an EMD data group, which is marked as an EMD data group by having an attribute with the name “emd_group_type” that has an integer value of 1. There are no restrictions on the name of this group. Any number of EMD data groups may be placed anywhere in the EMD file. However we do not recommend placing EMD data groups inside the root folder of an EMD file; rather they should be placed inside a group with a descriptive name.
Each valid EMD data group contains multiple datasets. The first is the data itself, corresponding to a dataset named “data” and containing an N dimensional array. The EMD data group additionally contains N datasets named “dim#”, where # ranges from 1 to N. The values of these datasets correspond to the coordinates along their corresponding dimension. For example the x dimension of a 1024^2 pixel micrograph with a pixel size of 0.02 nm would have dim1 values of [0, 0.02, 0.04, … 20.46]. The dim# datasets should each contain two attributes, name and units, which should be UTF-8 encoded strings, and correspond to the name of the dimension calibrated by this vector and the units it is specified in. For example name = “x” and units = “n_m” are common values for dim1. You may also specify a “name” and “units” for the data group that stores the scientific dataset.

Linear dim vectors may be compressed to their first two entries: if the dim# vectors have a constant step size, you may remove all entries except for the first two values. To be consistent with the above description, these values will therefore be equal to [offset offset+step].
The only other requirement of the EMD v0.1 specification is to attach the “version_major” and “version_minor” attributes to the root group of the H5 file. A minimal EMD version 0.1 file with a single N-dimensional dataset therefore looks like the diagram drawn to the right.
5. EMD v1.0 (2023 iteration)
The EMD 1.0 specification builds on the initial EMD 0.1 spec, described above. The image below summarizes the updated format. This diagram depicts one abstraction level higher than the HDF5 component diagrams shown in the previous two sections: each of the rounded boxes in the diagram represents some collection of HDF5 groups, datasets and attributes. These components can be mixed and matched in simple or complex combinations to make EMD 1.0 files.

An EMD 1.0 file has a header and any number of EMD trees.
An EMD tree is a directory-like tree with a root and any number of nodes.
A node contains any number of metadata groups and one data group.
Metadata groups hold any number of string named blocks of data of a variety of types.
Data groups hold one block of data, plus some self-descriptive metadata.
In the sections that follow, we detail how each of these abstractions – the EMD 1.0 header, tree, root, node, metadata, and data – map to HDF5 components.
Header
The header corresponds to the HDF5 file root and a set of attached attributes. There are three required attributes and three optional attributes. The required attributes are
- attr: “emd_group_type” = “file”
- attr: “version_major” = 1
- attr: “version_minor” = 0
where ‘attr:’ specifies an HDF5 attribute, the names of which is on the left and the value of which is on the right. The optional attributes are
- attr: “UUID”
- attr: “authoring_user”
- attr: “authoring_program”
Under the header are any number of EMD trees.

EMD Tree
An EMD tree is a directory-like tree: it consists of a root node, a set of downstream nodes, and a set of paths connecting them.

Root
An EMD root is an HDF5 group with any name which must have
- attr: “emd_group_type” = “root”
and which optionally may have
- any number of HDF5 groups corresponding to EMD nodes
- an HDF5 group named “metadatabundle” which contains any number of HDF5 groups corresponding to EMD metadata groups.
An EMD root must always and can only ever be at the root position of an EMD tree. They live directly under the header.
Node
An EMD node is an HDF5 group with any name which must have
- attr: “emd_group_type”
- an EMD data group, consisting of attributes, datasets, and groups defined by the specifications below and corresponding to the node’s “emd_group_type” value
and which may optionally have
- any number of HDF5 groups corresponding to EMD nodes
- an HDF5 group named “metadatabundle” which contains any number of HDF5 groups corresponding to EMD metadata groups.
- attr: “python class”
The generic EMD node structure is shown below.

Valid “emd_group_type” values for EMD tree nodes are “root”, “node”, “array”, “pointlistarray”, and “custom”. Not all EMD tree nodes contain a data group.
The group types which include a data group are array, pointlist, pointlistarray, and custom. If a node has an “emd_group_type” value of “array”, “pointlist”, “pointlistarray” or “custom”, it has a data group.
The group types which do not include a data group are node and root. If a node has an “emd_group_type” value of “node” or “root”, it contains no EMD data group. Note that “node” may be used to refer to either the generic structure above where “emd_group_type” is any of the options listed above, or to the specific node EMD group type where “emd_group_type” = “node” which has no data group and is not a root group.

This image summarizes the contents of metadata and data groups by their EMD group types. Metadata groups are like dictionaries. Node and root groups are the basic elements of EMD trees, and may each carry any number of metadata groups. Arrays, pointlists, pointlistarrays, and custom EMD groups are like node groups with some additional data block attached.
Array
An array data block contains array-like data. It must have
- attr: “emd_group_type” = “array”
- dset: “data” = N dimensional array
with attr: “units” equal to a string on dset :”data”. For N dimensional data, it must also have
- dset: “dim#”
for each # in (1, …, N), where the data itself are 1D vectors as described in section 4 above, and with
- attr: “dim_units”
- attr: “dim_name”
attached to each dim# dataset.

This specification describes the standard array group, in which an N dimensional array is calibrated by its N dimension vectors. EMD additionally allows a slight variant on this specification, in which an N+1 dimensional array has its first N dimensions calibrated by N dimension vectors, and its final dimension is indexed by string labels. This is meant to represent multiple named N-dimensional arrays which all share a single set of dimension vectors. In this case, the structure above is modified slightly: the final dimension vector must have its
- attr: “name” = “_labels_”
and values corresponding to the array string labels. In this case the final dimension vector does not have an att: “units”. Arrays of this variant are referred to as stack array.

Pointlist
A pointlist data block contains a set of N points in some M dimensional space, with string names associated with each of the M dimensions. It must have
- attr: “emd_group_type” = “pointlist”
- dset: “data” = N dimensional array
and for each # in (1,…,M), a dataset
- dset: (field # name) = 1D array
which has an attr: “dtype” equal to a string specifying a valid numpy datatype.

PointListArray
A pointlistarray data block holds an N-D grid with pointlist-like data at each grid point. The pointlists all share a single data type. They may vary in length from one grid point to the next; accordingly, datastructures of this sort are called ragged arrays. An (N+1) dimensional pointlistarray refers to a pointlistarray on an N-D grid, with the +1 indicating the final variable length dimension. The current version of the `emdfile` module reads and writes (2+1)D pointlistarrays, and this case will be used in the specification below. However, N may be any positive integer. If N≠2 the (X,Y) shape tuples found in the specification below should be appropriately modified.
Pointlistarrays use HDF5’s variable length datatypes, which enable ragged arrays. A pointlistarray must have a dataset
- dset: “data”
with a shape of (X,Y) for X,Y ∈ ℕ and dtype of any valid numpy datatype. For example using h5py, if group
is an h5py.Group
instance then the Python code
dataset = group.create_dataset(
"some_name",
shape = (X,Y),
dtype = h5py.special_dtype( vlen = 'uint16' )
)
makes an (X,Y) shaped 2+1 dimensional ragged array of 16-bit unsigned integers.
A structured dtype of the form [(‘field#name’, field#dtype) for # in M] will make a pointlistarray in which the ragged axis represents points in M named fields, as in a pointlist. The Python code
dataset = group.create_dataset(
"some_name",
shape = (X,Y),
dtype = h5py.special_dtype(
vlen = [('x','<f8'),('y','<f8'),('z','<f8')]
)
)
creates a dataset in which the ragged axis corresponds to points in a 3D space (‘x’, ‘y’, ‘z’), with the three coordinates each specified by a 64-bit float.

Metadata
Nodes of every EMD group type may contain any number of metadata groups. Nodes containing metadata must have a group called “metadatabundle”, which may contain any number of EMD metadata groups. EMD metadata groups are HDF5 groups that must contain
- attr: “emd_group_type” = “metadata”
and may contain an
- attr: “python_class”
and contains any number of pieces of metadata. A piece of metadata may be of either type I or type II. Type I pieces of metadata represent a single item, and type II pieces of metadata represent a set of N items which all share a datatype. Type I pieces of metadata are a
- dset: any name, data = value
with an
- attr: “type” = value
where the value parameter for both the data and type have several options, specified in the table that follows. Type II pieces of metadata are a
- grp: any name
with
- attr: “type” = value
- attr: “length” = N
and for each # in (1…N), a
- dset: “#” = value
where the value parameter for the type and datasets are specified in the table below.

Type I metadata table
“type” | dtype of dataset value |
“number” | number |
“bool” | boolean |
“string” | string |
“array” | numpy ndarray |
“None” | “_None”* |
“tuple” | tuple of numbers |
“list” | list of numbers |
Type II metadata table
“type” | dtype of dataset values |
“tuple_of_tuples” | tuples |
“tuple_of_arrays” | numpy ndarrays |
“tuple_of_strings” | strings |
“list_of_arrays” | numpy ndarrays |
“list_of_strings” | strings |
Custom
The purpose of the Custom EMD group type is to enable composition of the other group types into a single object. A custom group must have
- attr: “emd_group_type” = “custom”
and may have any number of subgroups which have an
- attr: “emd_group_type” = “custom_*”
where * is in (“node”, “array”, “pointlist”, “pointlistarray”, “custom”). These subgroups contain the elements normally associated with their corresponding EMD group type, however, are not considered independent nodes themselves and are instead treated as components of the data block of their parent custom group. Note that “custom_custom” is a valid option. This enables nesting such that a single custom group may itself contain any arbitrary tree of object groups. Note also that the leading “custom_” strings do not stack – if a custom group contains some “custom_custom” subgroup, then children of this latter group in its data block should also be named “custom_*”, e.g. “custom_array” and not “custom_custom_array”. Groups nested underneath some custom group with an attr: “emd_group_type” beginning with “custom_” are a part of the custom node’s data block; groups nested under a custom group with an attr: “emd_group_type” which does not begin with “custom_” represent new, distinct nodes underneath the custom node. Because groups underneath a custom group with attr: “emd_group_type” = “custom_*” are elements of the custom group’s data block and are not themselves nodes, they may not contain new nodes, i.e. they may not contain groups with an attr: “emd_group_type” which does not begin with “custom_”.

From the perspective of a Python runtime environment, the custom group type enables classes which contain multiple blocks of data which are each instances of the other EMD groups. For instance, a single logical container might hold several arrays. This is useful in the context of code which defines its own data-containing classes which need to bundle several distinct pieces of data together. Classes inheriting from the emdfile.Custom
class may define attributes which are assigned to other EMD group types, and these attributes will then be written and read all together with the custom object.
Python class
Any group with an “emd_group_type” attribute may optionally contain an
- attr: “python_class”
with some string value. The purpose of this attribute is to provide a hook for the emdfile
reader to identify and load subclasses which inherit from the base classes.
6. The emdfile
package
The emdfile
Python package contains read/write functions and a set of classes that mirror the EMD group types in the form of Python runtime objects. emdfile
also mirrors EMD tree structures as relationships between those runtime objects – any object in a runtime tree may be accessed any other object in the same tree. The syntax is summarized below.

More on the emdfile
package can be found at https://github.com/py4dstem/emdfile.