How to Add RO-Crate Metadata
Store provenance information about simulation input/output data using
Research Object Crates (RO-Crate). Torc lets you attach
JSON-LD metadata entities to a workflow and export them as a valid ro-crate-metadata.json
document.
Automatic Entity Generation
The easiest way to add RO-Crate metadata is to enable automatic generation. Set
enable_ro_crate: true in your workflow specification:
name: my_workflow
user: researcher
enable_ro_crate: true
files:
- name: input_data
path: data/input.csv # Must exist on disk when workflow is created
- name: output_data
path: data/output.csv # Will be created by the job
jobs:
- name: process
command: python process.py
input_files: [input_data]
output_files: [output_data]
Torc automatically detects input vs output files by checking if each file exists on disk when the workflow is created. Files that exist get their modification time recorded.
When automatic generation is enabled:
- Input files (files that exist on disk) get File entities created during workflow initialization
- Output files get File entities with provenance (
wasGeneratedBy) created when jobs complete - Jobs get CreateAction entities linking to their output files
After running the workflow, export the metadata:
torc ro-crate export 123 -o ro-crate-metadata.json
The exported document includes complete provenance:
{
"@id": "data/output.csv",
"@type": "File",
"name": "output.csv",
"encodingFormat": "text/csv",
"wasGeneratedBy": { "@id": "#job-1-attempt-1" }
}
Manual Entity Creation
For additional metadata (external software, custom properties), use manual commands.
Quick Start
# Add an entity describing an output file
torc ro-crate create 123 \
--entity-id "data/output.parquet" \
--type File \
--metadata '{"name": "Simulation Output", "encodingFormat": "application/x-parquet"}'
# Export all entities as an RO-Crate metadata document
torc ro-crate export 123 -o ro-crate-metadata.json
Core Concepts
Each RO-Crate entity has:
| Field | Description |
|---|---|
entity_id | The JSON-LD @id (e.g., "data/output.parquet", a URL) |
type | The Schema.org @type (e.g., "File", "Dataset", "SoftwareApplication") |
metadata | A JSON string containing additional JSON-LD properties |
file_id | Optional link to a Torc file record |
Entities are stored per-workflow. The export command assembles them into a complete RO-Crate
document with the required metadata descriptor and root dataset.
Creating Entities
File entity
Describe a single output file:
torc ro-crate create 123 \
--entity-id "results/summary.csv" \
--type File \
--metadata '{"name": "Summary", "encodingFormat": "text/csv"}'
Directory entity (Hive-partitioned data)
For directories with many files (like hive-partitioned Parquet datasets), use the add-dataset
command instead of creating entities manually. This automatically computes file count, total size,
and an integrity hash:
torc ro-crate add-dataset 123 \
--name partitioned_table \
--path data/partitioned_table/ \
--hash-mode manifest \
--encoding-format "application/vnd.apache.parquet"
See Adding Dataset Entities below for full details.
External software entity
Record which software produced the data (no --file-id needed):
torc ro-crate create 123 \
--entity-id "https://example.com/simulation/v2.1" \
--type SoftwareApplication \
--metadata '{"name": "My Simulation", "version": "2.1.0"}'
Link to a Torc file record
If the entity corresponds to a Torc file, link them with --file-id:
torc ro-crate create 123 \
--entity-id "output.csv" \
--type File \
--file-id 42 \
--metadata '{"name": "Output CSV"}'
Read metadata from stdin
For large metadata objects, pipe from a file:
torc ro-crate create 123 \
--entity-id "data/model.h5" \
--type File \
--metadata - < metadata.json
Adding Dataset Entities
For directory-based outputs (like hive-partitioned Parquet datasets), the add-dataset command
creates a Dataset entity with computed statistics and integrity hash.
Basic Usage
torc ro-crate add-dataset 123 \
--name training_output \
--path output/training.parquet/
This walks the directory, counts files, sums sizes, computes a manifest hash, and creates:
{
"@id": "output/training.parquet/",
"@type": "Dataset",
"name": "training_output",
"contentSize": 15032385536,
"fileCount": 2847,
"sha256": "7cbcd407fae0631505a1fe289356ee07c8825e41e9441fafca44c001bd6ce75d",
"hashMode": "manifest"
}
Hash Modes
Choose a hash mode based on your needs:
# Manifest hash (default) - fast, detects structural changes
torc ro-crate add-dataset 123 --name output --path data/ --hash-mode manifest
# Content hash - thorough but slow, detects any content change
torc ro-crate add-dataset 123 --name output --path data/ --hash-mode content
# No hash - fastest, only counts files and sizes
torc ro-crate add-dataset 123 --name output --path data/ --hash-mode none
| Mode | What it hashes | When to use |
|---|---|---|
manifest | Sorted list of (path, size, mtime) | Large datasets, structural integrity |
content | All file contents (Merkle tree) | Small datasets, content verification |
none | Nothing | Very large datasets, stats only |
Parallel Processing
For large directories, use multiple threads to speed up content hashing:
# Use 8 threads for content hashing
torc ro-crate add-dataset 123 \
--name training_output \
--path output/training.parquet/ \
--hash-mode content \
--threads 8
By default, the command uses all available CPU cores. The --threads option is most useful for
content mode where file I/O is the bottleneck.
Full Example
torc ro-crate add-dataset 123 \
--name simulation_results \
--path output/results.parquet/ \
--hash-mode manifest \
--description "Hive-partitioned simulation output with 100 partitions" \
--encoding-format "application/vnd.apache.parquet"
Output:
Computing dataset statistics for: output/results.parquet/ (using 8 threads)
Files: 2847, Size: 15032385536 bytes
Hash (manifest): 7cbcd407fae0631505a1fe289356ee07c8825e41e9441fafca44c001bd6ce75d
Created RO-Crate Dataset entity with ID: 42
When to Use add-dataset vs create
| Scenario | Command |
|---|---|
| Directory with many files | add-dataset |
| Need file count and total size | add-dataset |
| Need integrity hash | add-dataset |
| Single file | create |
| External URL or software | create |
| Custom metadata only | create |
Listing and Viewing Entities
# List all entities for a workflow
torc ro-crate list 123
# Get a specific entity with full metadata
torc ro-crate get 1
# JSON output for scripting
torc -f json ro-crate list 123
Updating Entities
Update individual fields of an existing entity:
# Change the type
torc ro-crate update 1 --type Dataset
# Update metadata
torc ro-crate update 1 --metadata '{"name": "Updated Name"}'
# Unlink from a file (set file_id to 0)
torc ro-crate update 1 --file-id 0
Deleting Entities
# Delete a single entity
torc ro-crate delete 1
Entities are also automatically deleted when their parent workflow is deleted (cascade delete).
Exporting an RO-Crate Document
The export command assembles all entities into a valid
RO-Crate 1.1 metadata document:
# Write to file
torc ro-crate export 123 -o ro-crate-metadata.json
# Write to stdout
torc ro-crate export 123
The exported document has this structure:
{
"@context": "https://w3id.org/ro/crate/1.1/context",
"@graph": [
{
"@id": "ro-crate-metadata.json",
"@type": "CreativeWork",
"about": {"@id": "./"},
"conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"}
},
{
"@id": "./",
"@type": "Dataset",
"name": "my_workflow",
"hasPart": [
{"@id": "data/output.parquet"},
{"@id": "https://example.com/simulation/v2.1"}
]
},
{
"@id": "data/output.parquet",
"@type": "File",
"name": "Simulation Output",
"encodingFormat": "application/x-parquet"
},
{
"@id": "https://example.com/simulation/v2.1",
"@type": "SoftwareApplication",
"name": "My Simulation",
"version": "2.1.0"
}
]
}
The @id and @type fields are always set from the entity record, overriding any values in the
metadata JSON.
Workflow Export/Import
RO-Crate entities are included in workflow exports (torc workflows export) and restored during
imports (torc workflows import). File ID links are remapped automatically.
See Also
- RO-Crate Provenance Tracking — Concept overview
- RO-Crate Specification — Official documentation