Skip to content

Configuring Metadata Sources

Developer Documentation

This page is intended for visualization authors configuring metadata sources in samples.

Overview

Metadata sources make sample metadata configurable as explicit sources instead of one monolithic metadata table. This enables:

  • eager loading for regular tabular metadata (backend: "data")
  • lazy loading for large matrix-like metadata (backend: "zarr")
  • per-source defaults for imported columns

Legacy compatibility

The legacy samples.data and samples.attributes configuration remains supported for backward compatibility, but new configurations should use samples.metadataSources.

Quick example

{
  "samples": {
    "identity": {
      "data": { "url": "samples.tsv" },
      "idField": "sample",
      "displayNameField": "displayName"
    },
    "metadataSources": [
      {
        "id": "clinical",
        "name": "Clinical",
        "initialLoad": "*",
        "excludeColumns": ["sample", "displayName"],
        "backend": {
          "backend": "data",
          "data": { "url": "samples.tsv" },
          "sampleIdField": "sample"
        }
      },
      {
        "id": "expression",
        "name": "Expression",
        "initialLoad": false,
        "groupPath": "Expression",
        "attributes": {
          "TP53": {
            "type": "quantitative",
            "scale": { "scheme": "redblue", "domainMid": 0 }
          }
        },
        "backend": {
          "backend": "zarr",
          "url": "data/expr.zarr"
        }
      }
    ]
  }
}

In this example, identity reads the canonical sample ids and display names from samples.tsv. The first metadata source (clinical) uses the same TSV as an eager table source and autoloads all non-excluded columns at startup. The second source (expression) points to a Zarr matrix, is lazy by default (initialLoad: false), and defines quantitative styling for selected expression columns under the Expression group.

If you omit initialLoad, backend defaults apply. For backend: "data", that means load all columns by default; use excludeColumns to keep helper fields such as sample and displayName out of metadata attributes.

Splitting configuration into files

When source definitions become long, you can keep samples.metadataSources compact by importing each source from a separate JSON file.

Example in the main spec:

{
  "samples": {
    "identity": {
      "data": { "url": "samples.tsv" },
      "idField": "sample",
      "displayNameField": "displayName"
    },
    "metadataSources": [
      { "import": { "url": "metadata-sources/clinical-source.json" } },
      { "import": { "url": "metadata-sources/expression-source.json" } }
    ]
  },
  ...
}

Example imported source file (metadata-sources/clinical-source.json):

{
  "id": "clinical",
  "name": "Clinical",
  "initialLoad": "*",
  "excludeColumns": ["sample", "displayName"],
  "backend": {
    "backend": "data",
    "data": { "url": "../samples.tsv" },
    "sampleIdField": "sample"
  }
}

Import behavior:

  • each imported file must define exactly one metadata source object
  • nested imports are not supported
  • relative paths are resolved using GenomeSpy base-url rules
  • backend URLs inside an imported source are resolved relative to that imported file

Configuring attribute types and scales

Attribute configuration can be attached directly to a source:

  • attributes applies to specific columns
  • attributes[""] sets a source-level default for all imported columns

Example:

{
  "id": "clinical",
  "name": "Clinical",
  "groupPath": "Clinical",
  "initialLoad": "*",
  "excludeColumns": ["sample", "displayName"],
  "attributes": {
    "purity": {
      "type": "quantitative",
      "scale": {
        "domain": [0, 1],
        "scheme": "yellowgreenblue"
      }
    },
    "ploidy": {
      "type": "quantitative",
      "scale": {
        "domain": [1.5, 6],
        "scheme": "blues"
      }
    },
    "treatment": {
      "title": "Treatment",
      "visible": true
    }
  },
  "backend": {
    "backend": "data",
    "data": { "url": "samples.tsv" },
    "sampleIdField": "sample"
  }
}

In this example, purity and ploidy are configured as quantitative with custom scales, and treatment gets a custom title. Other imported columns without explicit defs still work: GenomeSpy infers their type from values. When using grouped/hierarchical names (attributeGroupSeparator), attributes can also target group nodes for shared defaults. See Grouping and hierarchy.

If you need an explicit source-wide default (instead of inference), define attributes[""] and then override selected columns with specific keys.

Grouping and hierarchy

Grouping helps when metadata has many attributes: users can collapse and expand groups in the hierarchy, and authors can configure shared defaults once at group level (for example type and scale) instead of repeating them for every child column.

Metadata organization is controlled by two related properties:

  • groupPath: where imported columns are placed
  • attributeGroupSeparator: how path-like column names are split into groups

Placement with groupPath

Without groupPath, imported columns are added at the root. With groupPath, imported columns are prefixed under that path.

Example:

{
  "id": "expression",
  "groupPath": "Expression",
  "backend": {
    "backend": "zarr",
    "url": "data/expr.zarr"
  }
}

Importing column TP53 from this source creates attribute path Expression/TP53.

Hierarchy with attributeGroupSeparator

attributeGroupSeparator lets grouped column names define hierarchy levels. It also enables group-level definitions in attributes.

Suppose you have columns such as:

  • patientId
  • clinical.PFI
  • clinical.OS
  • signature.HRD
  • signature.APOBEC

With attributeGroupSeparator: ".", the clinical.* and signature.* columns are grouped under clinical and signature.

Inheritance rules are straightforward: child columns inherit type and scale from the nearest parent group unless overridden by a more specific key. visible and title apply to the group node itself (for example clinical) rather than to all child columns.

Example configuration:

{
  "id": "clinical",
  "name": "Clinical",
  "attributeGroupSeparator": ".",
  "attributes": {
    "patientId": {
      "type": "nominal"
    },
    "clinical": {
      "type": "quantitative",
      "scale": { "scheme": "blues" }
    },
    "clinical.OS": {
      "visible": false
    },
    "signature": {
      "type": "quantitative",
      "scale": { "scheme": "yelloworangered" },
      "visible": false
    }
  },
  "backend": {
    "backend": "data",
    "data": { "url": "samples.tsv" },
    "sampleIdField": "sample"
  }
}

In this configuration, clinical.PFI inherits quantitative/blues defaults from clinical, while clinical.OS applies its own override (visible: false). Without attributeGroupSeparator, no path splitting is applied: column names and attributes keys are treated as flat ids.

Using both together

When both are set, groupPath places imported attributes under a destination group and attributeGroupSeparator defines how grouped names are interpreted.

attributeGroupSeparator also affects how groupPath itself is parsed:

  • with attributeGroupSeparator: ".", groupPath: "Expression.RNA" becomes Expression/RNA
  • without attributeGroupSeparator, groupPath is not split. The whole value is treated as one group name (for example "Expression/RNA" stays one group id).

Schema reference

samples entry points

identity

Type: SampleIdentityDef

Optional explicit sample identity definition.

metadataSources

Type: array

Metadata source definitions used for startup and on-demand imports.

Source order is significant for startup loading: eager startup imports are applied in declaration order.

Metadata source definitions

Type: MetadataSourceDef | object

attributeGroupSeparator

Type: string

Separator used by source-side attribute names to express hierarchy.

Example: if separator is ".", column clinical.OS is interpreted as group clinical and attribute OS.

attributes

Type: object

Attribute definitions keyed by attribute/column id (and optionally by group path).

Special key "" defines source-level defaults for all imported columns. Path splitting is applied only when attributeGroupSeparator is defined.

backend Required

Type: DataBackendDef | ZarrBackendDef | ParquetBackendDef | ArrowBackendDef

Backend-specific source configuration.

description

Type: string

Optional short description of what this source contains.

Can be shown in UI and can help automated agents choose the correct source.

excludeColumns

Type: array

Column ids that must never be imported from this source.

Useful for excluding identity/helper columns such as sample and displayName.

groupPath

Type: string

Default destination group path for imported attributes.

Imported column names are placed under this path, which effectively creates (or reuses) a metadata hierarchy node.

This value is parsed as a path using attributeGroupSeparator when that separator is defined for the source. Without an explicit separator, the whole value is treated as one group name (including any / characters).

Users can override this per import in the dialog.

id

Type: string

Stable source identifier used in actions, provenance, and configuration.

Should remain stable across spec revisions if bookmarks/provenance replay must keep working.

initialLoad

Type: boolean | "*" | string[]

Startup loading behavior.

  • false: do not load at startup
  • "*": load all columns allowed by this source
  • string[]: resolve and load only the listed columns

Omitted value uses backend defaults.

name

Type: string

Optional user-facing label shown in menus and dialogs.

If omitted, UI falls back to id.

Backends

data backend

data Required

Type: UrlData | InlineData

Eager tabular metadata source using the standard data contract.

Supports UrlData and InlineData.

sampleIdField

Type: string

Field name in the table that matches sample ids in the view.

Default value: "sample"

zarr backend

Example with optional lookup helpers and matrix path overrides:

{
  "id": "expression",
  "name": "Expression (Zarr)",
  "description": "Normalized expression matrix with identifier lookup.",
  "initialLoad": false,
  "groupPath": "Expression",
  "attributes": {
    "": {
      "type": "quantitative",
      "scale": { "scheme": "redblue", "domainMid": 0 }
    }
  },
  "backend": {
    "backend": "zarr",
    "url": "data/expr.zarr",
    "matrix": {
      "valuesPath": "X",
      "rowIdsPath": "obs_names",
      "columnIdsPath": "var_names"
    },
    "identifiers": [
      {
        "name": "symbol",
        "path": "var/symbol",
        "primary": true,
        "caseInsensitive": true
      },
      {
        "name": "ensembl",
        "path": "var/ensembl_id",
        "stripVersionSuffix": true
      }
    ]
  }
}

If your store uses the default matrix paths (X, obs_names, var_names), you can omit the entire matrix block. Identifier helpers are optional too: if omitted, only primary column ids are used for lookup. For a minimal setup, see the simpler Zarr example near the top of this page.

AnnData context and current scope

GenomeSpy currently supports an AnnData-compatible matrix subset for metadata import.

This is aimed at expression-style workflows where users import selected genes as metadata attributes. It is not full AnnData object support. Zarr metadata sources are primarily useful for large matrices with selective loading; for small tabular metadata, backend: "data" is usually simpler.

AnnData compatibility checklist

  • matrix values array (valuesPath, default X) with shape (n_samples, n_features)
  • sample id array (rowIdsPath, default obs_names) with length n_samples
  • feature id array (columnIdsPath, default var_names) with length n_features
  • optional lookup arrays via identifiers (for example var/symbol, var/ensembl_id) with length n_features

Current limitations:

  • sparse matrix handling is not supported in this metadata-source path
  • AnnData layers are not exposed as selectable alternatives to X
  • obs, var, obsm, varm, obsp, varp, uns are not ingested directly

Implementation note: browser-side Zarr access is done with zarrita, fetching only requested arrays/chunks over HTTP.

Zarr backend schema

identifiers

Type: array

Optional identifier arrays used to resolve user queries to columns.

If omitted, only primary column ids are used for lookup.

matrix

Type: ZarrMatrixLayoutDef

Optional path overrides for the expression-style matrix layout.

url Required

Type: string

URL to the root of the Zarr store.

Zarr layout details

These definitions describe where matrix content lives inside the Zarr store. Use these path overrides for expression-style sample-by-feature arrays.

columnIdsPath

Type: string

Path to matrix column identifiers.

Default value: "var_names"

rowIdsPath

Type: string

Path to matrix row identifiers (sample ids).

Default value: "obs_names"

valuesPath

Type: string

Path to matrix values (sample rows x metadata columns).

Default value: "X"

Zarr identifier helpers

These optional definitions improve column lookup from user-entered terms. Use identifiers for aligned identifier arrays (for example symbol and Ensembl).

caseInsensitive

Type: boolean

Enables case-insensitive matching for this identifier field.

name Required

Type: string

Logical identifier name shown in UI and diagnostics.

Example values: "symbol", "ensembl", "entrez".

path Required

Type: string

Backend path that provides identifier values aligned to matrix columns.

The array length must equal the number of columns in the matrix.

primary

Type: boolean

Marks this identifier as the primary, canonical identifier.

stripVersionSuffix

Type: boolean

Remove version suffixes during matching (for example, ENSG....12).

Useful for identifiers such as Ensembl ids that may contain version suffixes in some datasets but not in user queries.

Background references