File Format

1. Background

To date, most electron microscopy data is stored in either raw data formats (binary, bitmap images, tiff, etc.) or proprietary formats developed by vendors (dm3, emispec, etc.). We aim to develop a completely open file format flexible enough to store any possible type of electron microscopy data, while also allowing metadata of any type to be included.


2. HDF5 Format

HDF stands for hierarchical data format. It was originally introduced in 1988 by the National Center for Supercomputing Applications, at the University of Illinois, USA. The task force responsible for developing the HDF format was spun off to a non-profit corporation in 2004, called the HDF Group. They currently maintain and develop the format specification from the University of Illinois Research Park in Champaign, Illinois. For more information about the HDF Group and the history of HDF, see this page.

The HDF group has released several high level APIs to read and write HDF files. These include both high and low level APIs for workhorse programming languages such as FORTRAN, C and C++, as well as analysis platforms such as MATLAB and IDL. Binaries, source code, and documentation for the current version of HDF are all available on the extensive HDF5 website.


3. HDF5 File Contents

hdf5_example-01

Each HDF5 file contains three different component types; groups, datasets and attributes. The highest level group is called the “root group.” Groups can contain additional groups or two types of members: attributes and datasets. The primary difference between attributes and datasets is the length of the stored information. Generally, attributes should only be used to store small pieces of information such a single number or string. Datasets may contain any amount of information and are optimized towards larger amounts of data. Note that attributes may also belong directly to datasets.

All diagrams shown in this document will be drawn in a tree view starting from the root. Each element type has a unique colour and child groups or attributes will be drawn directly attached to their parent elements. The diagram to the right shows a typical layout of an HDF file containing multiple groups, datasets and attributes.


4. EMD File Contents – v0.7 (⚠️Section Under Construction⚠️)

High-level overview of the py4DSTEM EMD v0.7 tree structure. The structure of each data group is detailed below.

The py4DSTEM project, a suite of tools for processing four dimensional scanning transmission electron microscopy (4D-STEM) data, has continued the development of the EMD file format to add several new data structures, including utilities for reading and saving related collections of images, multidimensional data with labeled slices, lists of electron counts, and other rich lists of positions and quantities. py4DSTEM compatible EMD files have a special group structure, which is documented in the py4DSTEM repo in this document. py4DSTEM has also implemented a consistent system for reading and writing metadata to accompany the overall experiment as well as each individual data object in the EMD tree.


In py4DSTEM EMD files, a group at the top level with emd_group_type of 2 indicates a collection of 4D–STEM data. This group may have any name: files saved by py4DSTEM name this group “4DSTEM_experiment” while files produced by Prismatic name it “4DSTEM_simulation.” The version_major and version_minor attributes are used by py4DSTEM to determine the correct reading routine, as the details of the structure of each type of dataset has evolved. Inside the “4DSTEM_experiment” are three groups: “data,” “log,” and “metadata.” The “data” group contains all raw and processed data. The “log” group is used to keep a detailed history of the reading, writing, and analysis performed in py4DSTEM. The “metadata” group stores all metadata for the file. In the following sections, details of the structure of the each data type are described.

counted_datacubes

Inside the “counted_datacubes” group are any number of groups with any name. The group names correspond to the names that will be displayed by the py4DSTEM FileBrowser, and unless specified by the user are of the format datacube_N. Each datacube group has the following attributes:

  • emd_group_type: Integer value of 1, indicating this is a data group.
  • dimensions: Integer value of 1 or 2, specifying whether the list of electron events is stored in (unraveled) 1D format or (raveled) 2D format.
  • coordinates: This attribute is optional–it is written by the py4DSTEM exporter but not used in reading. Contains a string dump of the numpy datatype of the electron event lists.
  • metadata: Integer flag indicating if there is a Metadata entry associated with this DataObject.

The group also contains 6 datasets:

  • data: A 2D Variable Length array, with each entry in the array containing either (a) a 1D integer array where each value in the array corresponds to an electron strike at the unraveled (1D) position on the detector given by the entry; or (b) a 1D or 2D numpy structured array written by h5Py, with each row containing the 1D or 2D (raveled or unraveled) position of an electron strike. If using numpy structured arrays, the index_coords dataset is required (for plain integer arrays, it is not needed).
  • dim1, dim2, dim3, dim4: Datasets containing the coordinate arrays along each dimension, where 1&2 are realspace scan coordinates and 3&4 are detector coordinates. The length of dimensions 3&4 encode the dimensions of the detector used in reconstructing the diffraction patterns, while the values in all 4 arrays are only used for later calibration and so may be filled arbitrarily or with calibrated values.
  • index_coords: String array used when data is a numpy structured array written by py4DSTEM, containing the names of the structured array fields that contain the (1 or 2) indices. This is here mainly because behind the scenes py4DSTEM uses the same routines to save counted data as PointListArrays, which contain arbitrary data fields.

datacubes

Details of DCs

diffractionslices

Details of DSs

realslices

Details of RSs

pointlistarrays

Details of PLAs

pointlists

Details of PLs


5. EMD File Contents – v0.3-0.6

EMD versions 0.3-0.6 were developed by the py4DSTEM project. Legacy documentation for these formats is available in this document in the py4DSTEM GitHub repo. Legacy readers are also provided through the py4DSTEM.file.io.read(...) function. The primary difference between these versions and v0.7 are that v0.7 and greater: (i) support electron counted datasets natively; (ii) store PointListArrays in a more compact format, and (iii) support memory mapping and compressed data for 4-D datasets.


6. EMD File Contents – v0.2

EMD files are always valid HDF5 files. However, scientific data in an EMD file must be stored in a specific manner.  Each N-dimensional dataset is stored in an EMD data group, which is marked as an EMD data group by having an attribute with the name “emd_group_type” that has an integer value of 1.  There are no restrictions on the name of this group.  Any number of EMD data groups may be placed anywhere in the EMD file.  However we do not recommend placing EMD data groups inside the root folder of an EMD file; rather they should be placed inside a group with a descriptive name.

Each valid EMD data group typically contains multiple datasets.  The first (and only absolutely required) is the data itself, which is stored in a dataset named “data” with N dimensions.  Inside the EMD data group, each dimension must also have an associated dataset named dim# where # ranges from 1 to N. The values of these datasets correspond to the coordinates along their corresponding dimension. For example the x dimension of a 1024^2 pixel micrograph with a pixel size of 0.02 nm would have dim1 values of [0, 0.02, 0.04, … 20.46].

In addition to the coordinates, the dim# should also contains two attributes: name and units. Without these attributes, the viewer will still function but data processing routines may fail or produce incorrect results. These attributes correspond to the name of the dimension vector and the units it is specified in. For example name = x and units = [n_m] are common values for dim1. You may also specify a “name” and “units” for the data group that stores the scientific dataset.  The viewer will display these values below the histogram / colorbar panel.

emd_minimal-01

If the dim# vectors have a constant step size (linearly spaced coordinates), you may remove all entries except for the first two values.  To be consistent with the above description, these values will therefore be equal to [offset  offset+step].  The EMD viewer will assume linear steps and extrapolate.

The only other requirement of the EMD specification is to attach the “version_major” and “version_minor” attributes to the root group. The version attributes allows a processing routine to check which features of EMD are expected to be implemented. A minimal EMD file for the current specification version (0.2) with a single N-dimensional dataset therefore looks like the diagram drawn to the right. This minimal example contains a single EMD group, consisting of an N dimensional dataset.

Without the “name” and “units” attributes, the EMD viewer program can still parse the file, and these fields will default to numerical dimensions and pixels respectively. The “version” attribute is important for programs to validate which version of the EMD specification they are using before processing the data.


7. Metadata Tags and Naming Conventions

One of the primary benefits of HDF5 is its ability to include an arbitrary amount of metadata, of arbitrary types.  Large arrays of values should be stored in datasets, and single values or small arrays can be stored in attributes.  We do not place any restrictions on including additional data in EMD files. Extra datasets and attributes can even be freely added even to EMD data groups at any level.

However, we do make recommendations on how to store experimental metadata.  These guidelines are intended to make it easier for different applications that use the EMD format to communicate necessary values, such as microscope parameters.  More standards may be added in the future as data processing needs evolve. Metadata in the EMD format is typically divided into various top-level groups. Each group can contain any number of associated attributes, as well as additional sub-groups. The currently recommended top-level metadata groups for electron microscopy data stored in EMD format are:

A. Microscope

This group contains all of the relevant microscope (and therefore imaging) settings. These include parameters such as accelerating voltage, pixel size / magnification, beam current, and many others. Long lists of closely related parameters can be further lumped into subgroups. For example, the hardware corrector of an aberration-corrected TEM records many aberration parameters. Each of the (many) aberration parameters it can measure has a magnitude, angle and unit. Rather than clutter up the microscope group, we recommend placing these parameters in a subgroup aberrations.

B. Sample

We have found it very useful to include metadata tags related to the sample being imaged. These tags could include the material being imaged, sample processing information, or even qualitative comments related to the sample quality. Recording this information during the experiment greatly aids quality assessments at a future date.

C. User

All information related to the microscope operator (or the person who simulated the data) can be stored here. Information such as the user’s name, institution, department and contact email are very useful for tracing the providence of experimental data. It also aids in database applications for large volumes of microscopy data.

D. Comments

This is a special group that contains time-stamped information on the EMD dataset. In particular we recommend that whenever the data is modified, this information be appended to the comments group. This will prevent potential mixups between raw and filtered data.

Each of these groups are optional. However using standard conventions for metadata information will make it much easier to develop software and algorithms that make use of the EMD format.

We also have some general recommendations for EMD file contents and naming conventions. All group and attribute names are lowercase and single word if possible. If multiple word names are needed, they are separated by underscore characters. Units are specified inside repeated square brackets with the prefixes placed in front of the unit and separated by an underscore. Raising a unit to a power other than 1 is specified by the ^ character, which includes inverse units prefaced by a negative sign. These considerations make units easily machine readable. Some examples include:

Unit Type Unit Name EMD Unit

Length nanometers [n_m]

Electric potential kilovolts [k_V]

Electron interaction parameter rads per nanometer squared [rad][n_m^-2]

Force Newtons [N] or [k_g][m][s^-2]