Skip to content

Eager Data Sources

Eager data sources load and process all available data during the initialization stage. They are suitable for small data sets as they do not support partial loading or loading in response to user interactions. However, eager data sources are often more flexible and straightforward than lazy ones.

The data property of the view specification describes an eager data source. GenomeSpy supports four eager source forms directly in a view specification: loading data from url, embedding it inline with values, binding it by name, and generating numeric rows with sequence. This model is based on Vega-Lite and is mostly compatible, but the sections below describe GenomeSpy's behavior directly.

GenomeSpy can read eager data as "csv", "tsv", "dsv", and "json", as well as additional URL-based formats such as "bed", "bedpe", "fasta", and "parquet". Whatever the source, the data is treated as a table of records that can be further processed with transforms.

Overview

The following eager source forms are available:

Form Purpose Typical use
values Embed data directly in the specification Small examples, constants, test data
url Load data from one or more files CSV/TSV/JSON and other eager file formats
name Bind a dataset by name Root-level datasets or runtime-provided data
sequence Generate a numeric sequence Derived coordinates, bins, synthetic data

Inline Data

Use values when the data is already available in the specification. The most common form is an array of objects, where each object becomes one data record.

Example: Inline object array
{
  "data": {
    "values": [
      { "category": "A", "value": 5 },
      { "category": "B", "value": 8 }
    ]
  }
}

A single object is treated as a one-row dataset:

Example: Inline single object
{
  "data": {
    "values": { "x": 1, "y": 2 }
  }
}

Arrays of primitive values are also allowed. In that case, GenomeSpy wraps each value into an object with a data field.

Example: Inline scalar array
{
  "data": {
    "values": [1, 2, 3]
  }
}

This produces rows equivalent to:

[{ "data": 1 }, { "data": 2 }, { "data": 3 }]

Finally, values may also be a string. String values are parsed according to data.format, so the format type must be specified explicitly.

Example: Parsing inline CSV text
{
  "data": {
    "values": "x,y\n1,2\n3,4",
    "format": { "type": "csv" }
  }
}

Type inference

GenomeSpy uses vega-loader to parse tabular data and infer its data types. Vega-loader is sometimes overly eager to interpret strings as dates. In such cases, the field types need to be specified explicitly. On the other hand, explicit type specification also gives a significant performance boost to parsing performance.

Handling empty (NA) values

Empty or missing values must be presented as empty strings instead of NA that R writes by default. Otherwise type inference fails for numeric fields.

URL Data

Use url to load eager data from files. The URL may be absolute or relative to the page where GenomeSpy is embedded.

Relative URLs are resolved against the current view's effective baseUrl. Base URLs are inherited through the view hierarchy and can be overridden in nested views. If a nested view defines a relative baseUrl, it is resolved against the parent view's baseUrl.

Example: Loading a single file
{
  "data": {
    "url": "data.tsv",
    "format": { "type": "tsv" }
  }
}

GenomeSpy can also load multiple files and concatenate their rows. This is useful when later transforms reshape each file into a common form before the results are concatenated. Each input file is processed as its own batch, so batch-sensitive transforms can reset their internal state between files. For example, this works well when each patient has a separate mutation file with different sample-specific VAF columns that are folded into a compatible long format with "regexFold".

Example: Loading multiple files
{
  "data": {
    "url": ["part1.tsv", "part2.tsv"],
    "format": { "type": "tsv" }
  }
}

If you have many files, you can place the file list in a separate file with urlsFromFile.

Example: Loading file URLs from a manifest
{
  "data": {
    "url": { "urlsFromFile": "variant-file-list.tsv", "type": "tsv" },
    "format": { "type": "tsv" }
  }
}

The manifest file must be either a JSON array of strings or a CSV/TSV file with a single column named url.

Relative URLs inside a urlsFromFile manifest are resolved relative to the manifest file itself, not the surrounding view specification.

GenomeSpy also supports gzip-compressed eager files transparently. URLs ending in .gz, .bgz, or .bgzf are decompressed automatically, and format inference uses the uncompressed extension. For example, variants.tsv.gz is treated as a TSV file.

Tabular Formats

GenomeSpy supports the following general-purpose eager formats:

  • "csv" for comma-separated text
  • "tsv" for tab-separated text
  • "dsv" for delimited text with a custom single-character delimiter
  • "json" for JSON arrays and nested JSON documents

In many cases, the format can be inferred from the file extension. For example, data.csv, data.tsv, and data.json are recognized automatically. If the extension is ambiguous or does not reflect the real contents, specify data.format.type explicitly.

In bioinformatics, files with a .csv extension are often actually tab-delimited. In such cases, use "tsv" explicitly instead of relying on the filename.

For "dsv", you must also specify the delimiter:

Example: Custom-delimited text
{
  "data": {
    "url": "data.txt",
    "format": {
      "type": "dsv",
      "delimiter": "|"
    }
  }
}

Headerless delimited files can provide field names explicitly with format.columns. When columns is present, GenomeSpy treats the first row as data instead of a header row.

Example: Headerless TSV with explicit columns
{
  "data": {
    "url": "samples.tsv",
    "format": {
      "type": "tsv",
      "columns": ["sample", "value"]
    }
  }
}

For "json", you can optionally use format.property to read a nested array from the loaded JSON document:

Example: Reading nested JSON data
{
  "data": {
    "url": "data.json",
    "format": {
      "type": "json",
      "property": "values.items"
    }
  }
}

Parsing Field Types

The format.parse property controls how field values are converted from text to typed values.

For "csv", "tsv", and "dsv", GenomeSpy enables automatic type inference by default. Numeric fields are converted to numbers, booleans to booleans, and so on. If you want full control over parsing, provide an explicit parse mapping. This is also faster, because GenomeSpy does not need to scan the data to infer field types. For large delimited files, explicit format.parse mappings are strongly recommended.

Example: Explicit parse mapping
{
  "data": {
    "url": "samples.tsv",
    "format": {
      "type": "tsv",
      "parse": {
        "sample_id": "string",
        "value": "number"
      }
    }
  }
}

When parse is an object, only the listed fields are converted. Other fields remain strings.

If parse is omitted for "csv", "tsv", and "dsv", GenomeSpy uses automatic type inference. If parse is set to null, type inference is disabled and delimited fields remain strings.

Named Data

Use data.name when the data should be supplied separately from the view specification.

Example: Named data source
{
  "data": {
    "name": "myResults"
  }
}

One way to provide named data is the root-level datasets property, which embeds reusable datasets in the root specification and exposes them by name.

Example: Root-level datasets
{
  "datasets": {
    "myResults": [
      { "x": 1, "y": 2 },
      { "x": 2, "y": 3 }
    ]
  },
  "data": {
    "name": "myResults"
  }
}

Another way is the JavaScript API, which can provide or update named data at runtime. Named datasets must be arrays. If a named dataset is not found, GenomeSpy treats it as empty data.

Sequence Generator

Use sequence to generate a numeric sequence instead of loading rows from an external source. This is useful for synthetic coordinates, repeated structures, and examples.

Example: Sequence generator
{
  "data": {
    "sequence": {
      "start": 0,
      "stop": 5,
      "step": 1,
      "as": "x"
    }
  }
}

This generates the rows:

[{ "x": 0 }, { "x": 1 }, { "x": 2 }, { "x": 3 }, { "x": 4 }]

start and stop are required. The stop value is exclusive, step defaults to 1, and the output field name defaults to "data". Sequence parameters may also use expression references.

Additional Formats

The following additional formats are supported by the eager url data source. Most bioinformatic data formats are supported through lazy data.

BED

BED parsing is based on @gmod/bed and supports BED3-BED12 fields. Output uses standard BED field names such as chrom, chromStart, chromEnd, and optional BED/BED12 fields.

Behavior details:

  • Leading browser, track, and # lines are skipped before the first data row.
  • BED12 rows use BED12 field names (for example thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts).
  • strand is normalized to numeric codes: "+" -> 1, "-" -> -1, and any other value -> 0.
  • blockSizes and blockStarts are parsed from comma-separated text into numeric arrays.
  • Non-BED12 extra columns are preserved as positional fallback fields (fieldN).

Example

{
  "description": "Browser Extendible Data (BED) example.",

  "assembly": "hg38",

  "data": {
    "url": "data/bedRegions.bed",
    "format": { "type": "bed" }
  },

  "mark": "rect",

  "encoding": {
    "x": { "chrom": "chrom", "pos": "chromStart", "type": "locus" },
    "x2": { "chrom": "chrom", "pos": "chromEnd" },
    "y": { "field": "name", "type": "nominal" },
    "color": { "field": "strand", "type": "nominal" }
  }
}

For larger BED files, consider using the lazy BigBed data source instead of the eager url source.

BEDPE

BEDPE is parsed as tab-delimited text with positional columns. It is commonly used for paired loci, such as structural variant breakpoints and link/arc annotations between two regions. The required leading fields are chrom1, start1, end1, chrom2, start2, end2; common optional fields are name, score, strand1, strand2. BEDPE files are typically headerless.

Behavior details:

  • Leading browser, track, and # lines are skipped before the first data row.
  • If a header row is present, it is detected when the first six columns match the required BEDPE prefix; headerless files use the default BEDPE column order or explicit names from format.columns.
  • Sentinel normalization: . -> null for string fields (chrom*, name); -1 and "" -> null for coordinates; "" -> null for score.
  • strand1 and strand2 are normalized to numeric codes: "+" -> 1, "-" -> -1, and any other value -> 0.
  • Extra trailing columns are preserved as positional fallback fields (fieldN, with 1-based indexing).
  • Rows with fewer than six columns are rejected with a parse error.

Example

{
  "description": "BEDPE example using link marks between paired breakpoints.",

  "assembly": "hg38",

  "data": {
    "url": "data/events.bedpe",
    "format": { "type": "bedpe" }
  },

  "mark": {
    "type": "link",
    "linkShape": "dome",
    "orient": "vertical",
    "clampApex": true
  },

  "encoding": {
    "x": { "chrom": "chrom1", "pos": "start1", "type": "locus" },
    "x2": { "chrom": "chrom2", "pos": "start2" },
    "y": { "field": "score", "type": "quantitative" },
    "size": { "value": 2 },
    "color": { "field": "name", "type": "nominal" }
  }
}

FASTA

The type of FASTA format is "fasta" as shown in the example below:

{
  "data": {
    "url": "16SRNA_Deino_87seq_copy.aln",
    "format": {
      "type": "fasta"
    }
  },
  ...
}

The FASTA loader produces data objects with two fields: identifier and sequence. With the "flattenSequence" transform you can split the sequences into individual bases (one object per base) for easier visualization.

Parquet

Apache Parquet is a column-oriented binary storage format designed for efficient analytics on large tables. Compared to row-oriented text formats, it usually provides better compression and faster column scans. For larger datasets, Parquet is strongly recommended over delimited text formats because it can be processed much more quickly. In GenomeSpy, Parquet is decoded into row objects similarly to "csv" and "tsv" formats supported by the url data source.

The type of Parquet format is "parquet":

{
  "data": {
    "url": "data.parquet",
    "format": {
      "type": "parquet"
    }
  }
}

Current constraints:

  • Only Snappy-compressed Parquet files are supported.
  • 64-bit integer values that require JavaScript BigInt are not supported.

Writing compatible Parquet files in R

The following example shows how to write compatible Parquet files in R using the arrow package:

library(arrow)
library(dplyr)

data.frame(x = 123L, y = 1.23) |>
  arrow_table() |>
  mutate(
    across(where(is.integer), ~ arrow::cast(., int32())),
    # Using float32 instead of double is optional but can significantly reduce file size
    across(where(is.double),  ~ arrow::cast(., float32()))
  ) |>
  write_parquet("data.parquet")

The implementation is based on hyparquet.