RO-Crate Generation Design
This page describes how Torc creates and updates automatic RO-Crate provenance entities in the current branch.
Current Model
The important identity rules are:
- Workflow plan entity: one per workflow,
#torc-workflow - Workflow run entity: one per run,
#torc-run-id-{run_id} - Torc software entities: one per run,
#software-{binary_name}-run-id-{run_id} - Job execution entities: one per job attempt,
#job-{job_id}-attempt-{attempt_id} - File entities: one per file record/path, updated in place across runs
That last point is why build_file_entity() does not take run_id. Plain file entities are not
modeled as run-scoped records. Run-scoped provenance is attached through relationships:
- Output files link to the workflow run with
prov:wasAttributedTo - Output files link to the producing job with
prov:wasGeneratedBy - Job
CreateActionentities link to the run withisPartOf - Job
CreateActionentities link to software agents withinstrumentandprov:wasAssociatedWith
If run_id were written directly into the base file entity metadata again, it would mix a stable
file identity with run-specific state. The current code instead keeps file identity stable and
updates the same file entity as a file moves from "input known at initialization" to "output with
provenance after job completion".
This design is also consistent with the multi-run behavior covered by
test_auto_ro_crate_second_run_replaces_entities: file entities are replaced in place, while
software and job execution entities accumulate across runs and attempts.
Entity Creation Flow
flowchart TD
A[Workflow initialize_jobs] --> B{enable_ro_crate?}
A --> C[Server creates<br/>#software-torc-server-run-id-N]
A --> D[Client attempts to create<br/>#software-torc-run-id-N<br/>and optional<br/>#software-torc-slurm-job-runner-run-id-N]
B -->|yes| E[Server upserts input File entities<br/>from DB rows with st_mtime]
B -->|yes| F[Client creates or updates<br/>#torc-workflow and #torc-run-id-N]
B -->|yes| G[Client creates or updates<br/>input File entities]
B -->|no| H[No automatic file provenance]
G --> I[Workflow execution]
E --> I
F --> I
C --> I
D --> I
I --> J[Job completes successfully]
J --> J2{Job has output files?}
J2 -->|yes| K[Client refreshes<br/>#torc-workflow and #torc-run-id-N]
J2 -->|yes| L[Client creates<br/>#job-job_id-attempt-attempt_id]
J2 -->|yes| M[Client creates or updates<br/>output File entity]
J2 -->|no| P[No additional automatic<br/>RO-Crate entities for this job]
L --> N[Job CreateAction metadata]
N --> N1[prov:hadPlan -> #torc-workflow]
N --> N2[isPartOf -> #torc-run-id-N]
N --> N3[instrument -> #software-torc-run-id-N]
N --> N4[prov:used -> input file paths]
N --> N5[result -> output file paths]
M --> O[Output File metadata]
O --> O1[prov:wasGeneratedBy -> job CreateAction]
O --> O2[prov:wasAttributedTo -> #torc-run-id-N]
O --> O3[prov:wasDerivedFrom -> input file paths]
classDef init fill:#dbeafe,stroke:#1d4ed8,color:#0f172a,stroke-width:2px;
classDef software fill:#dcfce7,stroke:#15803d,color:#0f172a,stroke-width:2px;
classDef input fill:#fef3c7,stroke:#b45309,color:#0f172a,stroke-width:2px;
classDef run fill:#ede9fe,stroke:#6d28d9,color:#0f172a,stroke-width:2px;
classDef job fill:#fee2e2,stroke:#b91c1c,color:#0f172a,stroke-width:2px;
classDef output fill:#cffafe,stroke:#0f766e,color:#0f172a,stroke-width:2px;
classDef disabled fill:#e5e7eb,stroke:#4b5563,color:#111827,stroke-dasharray: 5 3;
class A,I,J init;
class C,D software;
class E,G input;
class F,K run;
class L,N,N1,N2,N3,N4,N5 job;
class M,O,O1,O2,O3 output;
class H,P disabled;
What Gets Created
Torc binaries
- The server always creates
#software-torc-server-run-id-{run_id}duringinitialize_jobs() - The client attempts to create run-scoped software entities for
torcand, on Linux,torc-slurm-job-runner - Client-side software entities are skipped when the corresponding binary cannot be found next to
the current executable or on
PATH - These are
SoftwareApplicationplusprov:SoftwareAgent
Jobs
- The client creates one
CreateActionper successful job completion that has at least one output file - The entity id is
#job-{job_id}-attempt-{attempt_id} - Jobs with no output files currently do not emit an automatic
CreateAction - When present, the job entity is the main join point between inputs, outputs, workflow run, and software agents
Input files
- Input files are detected by
st_mtime IS NOT NULL - During initialization, both the server and the client currently upsert the same input file entity
- The entity is keyed by workflow and
file_id, withentity_id = file.path - Input file entities are expected to exist before jobs run, but the code does not rely on them being create-only; it is intentionally upsert-based
Output files
- Output file entities are created or replaced after a job succeeds and the file record has been refreshed
- If a file already had an entity from initialization or a prior run, the same DB row is updated rather than creating a new file entity for each run
- Run-specific provenance is recorded in the metadata relationships, not by giving the file entity a run-specific identity
- The same successful-job path also refreshes
#torc-workflow, refreshes#torc-run-id-{run_id}, and creates the jobCreateAction, but only when there is at least one output file to process
Important Asymmetries
- Software entities are run-scoped and accumulate across runs
- Job
CreateActionentities are attempt-scoped and accumulate across attempts - File entities are file-scoped and are replaced in place across runs
These asymmetries are intentional and match tests/test_auto_ro_crate.rs, especially
test_auto_ro_crate_second_run_replaces_entities, which expects:
- file entity count to stay stable across runs
- software entity count to grow across runs
- output file provenance to point at the newer
#torc-run-id-{run_id}