Single-molecule Dataset (SMD) Format

A file format for publication and exchange of data from experiments in single-molecule biophysics

View My GitHub Profile

Single-molecule Dataset (SMD) Format

The single-molecule dataset (SMD) format has been jointly developed in the groups of Dan Herschlag (Stanford) and Ruben Gonzalez (Columbia) to facilitate publication and exchange of data and analysis results obtained in single-molecule studies. This repository contains Matlab utility functions for creating, validating, saving and loading SMD structures in Matlab.

The smd-python repository contains similar Python utility functions for creating, validating, saving and loading SMD structures.

Format Description

The representation of a SMD structure in Matlab is as follows

  • dataset : struct
    • .id : string
      Unique identifier for collection of traces (e.g. a hash)
    • .desc : string
      Human-readable decriptor for dataset
    • .types : struct
      • .index : "bool" | "float" | "double" | "int" | "long" | "string" Data type for index
      • .values : struct Data types for column values. Each field .column_name contains a format string as in .index
    • .attr : struct
      Dataset level features (e.g. descriptors of experimental conditions)
    • .data : 1 x N struct
      • .id : string
        Unique identifier for trace (e.g. a hash)
      • .attr : struct
        Any trace-specific features that are not series
      • .index : 1 x T vector
        Row index for trace data (e.g. acquisition times)
      • .values : struct
        Column values. Each field .column_name holds a 1 x T vector

Installation

  1. Download this repository from
    https://github.com/smdata/smd-matlab/archive/master.zip

  2. Unzip master.zip to some location (e.g. c:\path\)

  3. Add the smdata directory to the Matlab path by typing

    addpath(genpath('c:\path\smd-matlab\'))
    

    where c:\path\ is the directory where master.zip was unpacked.

Functions

smd.create(data, types, varargin): Creates a SMD structure from supplied data.

smd.write_json(filename, dataset): Saves a SMD structure as JSON (.json)or compressed JSON (.json.gz).

smd.read_json(filename): Loads a SMD structure from JSON (.json)or compressed JSON (.json.gz).

smd.isvalid(dataset): Checks if supplied struct is a valid SMD instance.

smd.filter(dataset): Returns a filtered dataset by matching id and attr values, or by applying a custom function with boolean output to each trace.

smd.merge(data1, data2, ...): Returns a merged dataset containing all traces in multiple datasets.

Example Usage

Generate some fake data: Mixture of 3 Gaussian distributions

state_mean = [0.1, 0.5, 0.7];
state_noise = [0.05, 0.10, 0.05];
num_traces = 10;
max_length = 100;
for n = 1:num_traces
    T = ceil(max_length * rand());
    states = ceil(length(state_mean) * rand(T,1));
    observations = state_mean(states)' + state_noise(states)' .* randn(T,1);
    data{n} = [states, observations];
end

Create a SMD structure

% initialize smd structure
dataset = smd.create(data, {'state', 'int', 'observation', 'float'})
% add global attributes 
dataset.attr.description = 'example data: mixture of 3 gaussians with equal occupancy';
dataset.attr.state_mean = state_mean;
dataset.attr.state_noise = state_noise;
dataset.attr.max_length = max_length;

Save data to disk

% save as Matlab data
save('example.mat', '-struct', 'dataset');
% save as plain text JSON (uncompressed)
smd.write_json('example.json', dataset);
% save as plain text JSON (with gzip compression)
smd.write_json('example.json.gz', dataset);

Load data from disk

% read matlab data
example = load('example.mat');
% read plain text json (uncompressed)
example = smd.read_json('example.json', dataset);
% read plain text json (with gzip compression)
example = smd.read_json('example.json.gz', dataset);

Filter data

% filter out traces with <= 50 data points
filtered = smd.filter(example, 'func', @(d) size(d.values,1) > 50);