RO-Crate Provenance Tracking
Torc supports Research Object Crate (RO-Crate), a community standard for packaging research data with machine-readable metadata. This enables tracking of data provenance—knowing which jobs produced which outputs, what inputs they consumed, and when the data was created.
What is RO-Crate?
RO-Crate is a lightweight approach to packaging research data with JSON-LD metadata. It provides:
- Standardized metadata format — Compatible with Schema.org and linked data tools
- Provenance tracking — Records how data was produced and transformed
- Interoperability — Works with repositories, archives, and other research tools
How Torc Uses RO-Crate
Torc stores RO-Crate entities per workflow. Each entity describes a file, dataset, software, or other research object with JSON-LD properties. Entities can be:
- Created always — SoftwareApplication entities for the torc binaries are recorded during
workflow initialization, regardless of the
enable_ro_cratesetting. This ensures every workflow has basic software provenance. - Created automatically when
enable_ro_crate: trueis set on a workflow — file and job provenance entities - Created manually using the
torc ro-crate createcommand - Exported as a standard
ro-crate-metadata.jsondocument
Automatic Entity Generation
Always recorded (all workflows)
During workflow initialization, Torc creates SoftwareApplication entities for the torc binaries (server, job runner, etc.) that processed the workflow. These record the software name and version, providing a baseline provenance record even when full RO-Crate tracking is not enabled.
When enable_ro_crate: true
When you enable RO-Crate on a workflow, Torc additionally creates file and job provenance entities:
During workflow initialization:
- File entities are created for all input files (files that exist on disk)
- Entities include MIME type inference, file size, and modification date
When jobs complete successfully:
- File entities are created for all output files
- CreateAction entities are created for each job (provenance)
- Output files are linked to their producing job via
wasGeneratedBy
This creates a complete provenance graph linking inputs → jobs → outputs.
Entity Structure
Automatically generated File entities include:
{
"@id": "data/output.csv",
"@type": "File",
"name": "output.csv",
"encodingFormat": "text/csv",
"contentSize": 1024,
"sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"dateModified": "2024-01-01T00:00:00Z",
"wasGeneratedBy": { "@id": "#job-42-attempt-1" }
}
File entities include SHA256 hashes for integrity verification when the file is readable.
Job provenance is captured as CreateAction entities:
{
"@id": "#job-42-attempt-1",
"@type": "CreateAction",
"name": "process_data",
"instrument": { "@id": "#workflow-123" },
"result": [{ "@id": "data/output.csv" }]
}
Enabling Automatic RO-Crate
Add enable_ro_crate: true to your workflow specification:
name: my_workflow
user: researcher
enable_ro_crate: true
files:
- name: input_data
path: data/input.csv # Must exist on disk when workflow is created
- name: output_data
path: data/output.csv # Will be created by the job
jobs:
- name: process
command: python process.py
input_files: [input_data]
output_files: [output_data]
Torc automatically detects input vs output files by checking if each file exists on disk when the workflow is created. Files that exist are marked as inputs; files that don't exist are outputs.
After running this workflow:
input_datawill have an RO-Crate File entity (created during initialization)output_datawill have an RO-Crate File entity withwasGeneratedBylinking to the job- A CreateAction entity will describe the
processjob execution
Dataset Entities for Directories
Many workflows produce directory-based outputs rather than single files—for example, hive-partitioned Parquet datasets with thousands of files. For these, use Dataset entities instead of File entities.
Why Datasets?
- Efficiency — One metadata record instead of thousands of File entities
- Appropriate granularity — The directory is the meaningful unit, not individual partition files
- Integrity verification — Manifest-based hashing detects changes without reading all file contents
Dataset Structure
Dataset entities include file count, total size, and an optional hash:
{
"@id": "output/training.parquet/",
"@type": "Dataset",
"name": "training_output",
"description": "Hive-partitioned training results",
"contentSize": 15032385536,
"fileCount": 2847,
"sha256": "a1b2c3...",
"hashMode": "manifest",
"encodingFormat": "application/vnd.apache.parquet"
}
Hash Modes
Torc supports three hash modes for datasets:
| Mode | Description | Speed | Detects |
|---|---|---|---|
manifest | Hash of sorted (path, size, mtime) list | Fast | File additions, deletions, renames |
content | SHA256 of all file contents (Merkle tree) | Slow | Any content change |
none | No hash, only file count and size | Fastest | Nothing (stats only) |
For large datasets, manifest mode provides a good balance—it detects structural changes without
the I/O cost of reading terabytes of data.
Creating Dataset Entities
Use the add-dataset command to create a Dataset entity for a directory:
torc ro-crate add-dataset \
--workflow-id 123 \
--name training_output \
--path output/training.parquet/ \
--hash-mode manifest
See How to Add RO-Crate Metadata for detailed usage.
When to Use RO-Crate
RO-Crate is valuable when you need to:
- Track data lineage — Know which jobs produced each output
- Archive workflows — Export metadata with your results for long-term storage
- Share reproducible research — Provide machine-readable provenance to collaborators
- Meet compliance requirements — Document data processing for audits or regulations
Comparison: Automatic vs Manual
| Feature | Automatic (enable_ro_crate) | Manual (torc ro-crate create) |
|---|---|---|
| Input files | Created on initialization | Must create manually |
| Output files | Created on job completion | Must create manually |
| Job provenance | CreateAction entities | Must create manually |
| Custom metadata | Basic (name, type, size, date) | Full control over properties |
| External entities | Not created | Can add software, datasets, etc |
For most workflows, enable automatic generation and add manual entities only for external references (software versions, related datasets, etc.).
See Also
- How to Add RO-Crate Metadata — Step-by-step guide
- RO-Crate Specification — Official documentation