SQR-112
Docverse documentation hosting platform design#
Abstract
For a decade, Rubin Observatory has hosted its documentation sites with its LSST the Docs service. That service provided excellent capabilities and performance, but its implementation is now out of step with our current application design practices and some long-standing bugs and features have been difficult to retrofit. Docverse is the next iteration of LSST the Docs that retains the qualities of hosting versioned, static web documentation, but resolves the feature gaps, performance issues, and bugs we experienced with LSST the Docs.
Introduction#
Rubin Observatory’s technical documentation has long followed the docs-like-code model where documentation is authored in Git repositories, often alongside and in conjunction with code. The LSST the Docs (LTD) application (SQR-006) has played a critical role in this ecosystem by providing a platform for hosting versioned documentation sites. For users, LTD’s design ensures documentation is served directly through a CDN and object store hosted in the public cloud, so that performance and reliability are excellent even under high load and aren’t affected by application-level issues. For project staff, LTD provides a seamless experience for integrating documentation hosting with their development and deployment workflows. When new branches or tags are pushed to a project, LTD creates documentation editions for those corresponding versions automatically.
Lessons learned from LTD#
After a decade of operating LTD, though, we have identified a number of area where we either wish to improve the platform, or resolve issues or limitations in the existing implementation:
The LTD codebase is a Flask (synchronous) Python application, whereas we now build applications with FastAPI and use asyncio throughout our codebases.
Edition updates need to be faster, near instant, and never produce 404s for users. With LTD, an edition update for a large documentation site could take nearly 15 minutes.
We need greater flexibility in how LTD is configured and operates. For example, publishing projects as subpaths rather than subdomains.
We needed to update and refine the project edition dashboards more easily.
Build uploads are slow for large documentation sites.
Need to be able to purge outdated draft editions and help projects ensure that their readers are using the default edition unless they explicitly choose a different one.
From LTD to Docverse#
Since around 2022 we started to work on a second version of LTD that filled some of these gaps. However, in the scope of that work we wanted to retain the existing Flask codebase and maintain compatibility with the existing API, with only a limited set of new API endpoints. At this point, we’ve realized that a more comprehensive reimplementation is needed to fully address the design goals. Rather than maintain compatibility with LTD, we will migrate existing documentation sites and projects to the new platform.
Key Docverse features and changes from LTD#
Implementation with FastAPI and Safir
Queue system built on a backend-agnostic abstraction layer, with Arq (via Safir) and Redis as the initial implementation, replacing the Celery system in LTD. The abstraction enables future evaluation of alternative queue backends without disrupting application logic.
Works with Gafaelfawr tokens for authentication and group membership, replacing the custom token system in LTD
Organization models to support multiple organizations with separate documentation domains and configurations hosted from the same Docverse instance
Support for Cloudflare to provide instant edition updates.
Improved build upload performance by allowing for multipart tarball uploads rather than requiring clients to upload individual files.
Support for deleting draft editions, including automation with GitHub webhooks to delete draft editions for deleted branches.
Edition dashboards are built from templates that are synchronized from GitHub so that the edition dashboards are easier to update and customizable by organization members.
Support for the
versions.jsonfiles used by pydata-sphinx-theme and other documentation themes to power client-side edition switching.
Organization of this technote#
This technote dives into the design of Docverse at a fairly technical level to provide a record of design decisions and a reference for implementation.
Organizations discusses the organization model, including how organization configuration is supported by the API and database schemas.
Authentication and authorization describes how Docverse uses Gafaelfawr both to protect API endpoints for different roles, and to provide user and goup information for fine-grained access control.
Projects, editions, and builds describes the the core data model that Docverse carries over from LTD, but with substantial improvements to capabilities and performance.
Documentation hosting explores CDN and edge compute architectures for serving documentation, explains how pointer mode eliminates the S3 copy-on-publish bottleneck, and how organizations configure hosting infrastructure.
Dashboard templating system describes the new dashboard templating system that allows edition dashboards to be built from templates stored in GitHub repositories.
Code architecture describes Docverse’s layered architecture, factory pattern for multi-tenant client construction, protocol-based abstractions for object stores and CDN providers, and the client-server monorepo structure including the Python client library and CLI.
Queue system describes the queue system for processing edition updates and build uploads, built on a backend-agnostic abstraction with Arq as the initial implementation.
REST API design describes the API design and schema definitions for Docverse.
GitHub Actions action describes the native JavaScript GitHub Action for uploading documentation builds from GitHub Actions workflows.
Migration from LSST the Docs covers the data and client migration plan for moving existing LTD deployments to Docverse, including migration tooling, phased rollout, and risk mitigation.
Organizations design#
With organizations, a single Docverse deployment can host multiple documentation domains. Doing so can enable a single institution (like AURA or NOIRLab) to host documentation for multiple multiple missions with completely separate white-labelled presentations. It can also enable SQuaRE to provide Docverse-as-a-service for partner institutions.
The Organization is the sole infrastructure configuration boundary — all projects within an org share the same object store, CDN, root domain, URL scheme, and default dashboard templates.
Docverse follows a “bring your own infrastructure” strategy: each organization provisions and owns its cloud resources (object store buckets, CDN services, DNS zones) rather than Docverse providing centralized, shared infrastructure. This design is motivated by three concerns. First, cost allocation — organizations pay for and control their own cloud spend directly, avoiding the need for Docverse to meter usage or redistribute costs. Second, data ownership — organizations retain full ownership of their stored documentation artifacts; Docverse itself only stores connection metadata and encrypted credentials, never the data at rest. Third, regulatory compliance — some organizations face restrictions such as ITAR export controls that dictate where documentation can be hosted and who can access the underlying storage, requirements that are simplest to satisfy when the organization controls its own accounts.
Organization configuration#
Each organization owns:
Object store: bucket name, credentials, provider (AWS S3, GCS, generic S3-compatible, or Cloudflare R2). The bucket is provisioned externally by the org admin; Docverse stores connection details.
Staging store (optional): a separate object store bucket used for build tarball uploads, configured with the same shape as the publishing object store (provider, bucket, credentials). When configured, presigned upload URLs point to the staging bucket and the Docverse worker reads tarballs from it. When not configured, staging uses a
__staging/prefix in the publishing bucket. The staging store optimization is useful when the publishing store is on a different network than the Docverse compute cluster – for example, a GCS staging bucket in the same region as a GKE cluster paired with a Cloudflare R2 publishing bucket. The staging bucket only needs transient storage; tarballs are deleted after processing.CDN: provider choice (Fastly, Cloudflare Workers, Google Cloud CDN), service ID, API keys for cache purging. The CDN is provisioned externally; Docverse only interacts at runtime for cache invalidation and edge data store updates.
DNS: For subdomain-based layouts, Docverse registers subdomains via DNS APIs (e.g., Route 53, Cloudflare DNS). When using Cloudflare, wildcard subdomains are supported on all plans via a proxied
*.domainDNS record with free wildcard SSL.URL scheme (per-org setting, one of):
Subdomain: each project gets
project.base-domain(e.g.,sqr-006.lsst.io)Path-prefix: all projects under a root path (e.g.,
example.com/documentation/project)
Base domain (e.g.,
lsst.io) and root path prefix (for path-prefix mode).Dashboard templates: a GitHub repo containing Jinja templates and assets, configured at the org level with optional per-project overrides. See the Dashboard Templating System section for full details.
Edition slug rewrite rules: an ordered list of rules that transform git refs into edition slugs. Configured at the org level with optional per-project overrides. See the Edition slug rewrite rules section for the full rule format.
Default edition lifecycle rules (Projects, editions and builds).
Credential storage#
Docverse encrypts organization credentials at rest using Fernet symmetric encryption from the cryptography library. Fernet provides AES-128-CBC encryption with HMAC-SHA256 authentication — ciphertext is tamper-evident and self-describing (the token embeds a timestamp and version byte). Encryption and decryption are in-process CPU-bound operations (sub-millisecond), requiring no external service calls or network round-trips. A single Fernet key is stored as a Kubernetes secret, never in the database, so database backups alone cannot decrypt credentials.
This approach avoids the operational complexity of Vault Transit (running a Vault instance, configuring Kubernetes auth, managing Vault policies, network round-trips for every encrypt/decrypt) for what amounts to encrypting a small number of short API tokens and keys. The cryptography library is already a transitive dependency via Safir.
Key provisioning#
The Fernet encryption key is provisioned through Phalanx’s standard secrets management. In the application’s secrets.yaml:
credential-encryption-key:
description: >-
Fernet key for encrypting organization credentials at rest.
generate:
type: fernet-key
Phalanx auto-generates the key, stores it in 1Password, and syncs it to a Kubernetes Secret. The key never appears in the database.
Key loading#
At startup, the application loads the encryption key from environment variables sourced from the Kubernetes Secret:
DOCVERSE_CREDENTIAL_ENCRYPTION_KEY— the current primary Fernet key.DOCVERSE_CREDENTIAL_ENCRYPTION_KEY_RETIRED(optional) — a retired key, present only during rotation periods.
When both keys are present, Docverse constructs a MultiFernet([Fernet(primary), Fernet(retired)]). MultiFernet tries decryption with each key in order, so credentials encrypted under either key are readable, while new encryptions always use the primary key. When only the primary key is present, Docverse still wraps it in MultiFernet([Fernet(primary)]) to provide a uniform interface.
Database schema#
Fernet tokens are self-describing (they embed a version byte, timestamp, IV, and HMAC), so the database schema needs no separate columns for nonces, key versions, or algorithm metadata:
CREATE TABLE organization_credentials (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
organization_id UUID NOT NULL REFERENCES organizations(id),
label TEXT NOT NULL,
service_type TEXT NOT NULL,
encrypted_credential TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE(organization_id, label)
);
The label is a human-friendly name (e.g., “Cloudflare R2 production”). The service_type identifies the provider (e.g., cloudflare, aws_s3, fastly). Credentials are write-only through the API — the GET response returns metadata (label, service type, timestamps) but never the decrypted value. See Database models for all database tables and organization_credentials for the column reference.
Python integration#
CredentialEncryptor is a thin wrapper around MultiFernet that handles str↔bytes encoding:
from cryptography.fernet import Fernet, MultiFernet
class CredentialEncryptor:
"""Encrypt and decrypt organization credentials using Fernet."""
def __init__(
self,
primary_key: str,
retired_key: str | None = None,
) -> None:
keys = [Fernet(primary_key)]
if retired_key:
keys.append(Fernet(retired_key))
self._fernet = MultiFernet(keys)
def encrypt(self, plaintext: str) -> str:
"""Encrypt a credential, returning a Fernet token."""
return self._fernet.encrypt(
plaintext.encode()
).decode()
def decrypt(self, token: str) -> str:
"""Decrypt a Fernet token to recover the credential."""
return self._fernet.decrypt(
token.encode()
).decode()
def rotate(self, token: str) -> str:
"""Re-encrypt a token under the current primary key.
If the token is already encrypted under the primary key,
the result is a fresh token (new IV and timestamp) under
the same key. MultiFernet.rotate() is idempotent in the
sense that calling it repeatedly always produces a valid
token under the primary key.
"""
return self._fernet.rotate(
token.encode()
).decode()
All methods are synchronous — Fernet operations are sub-millisecond CPU-bound work, so no async/await is needed. The service layer calls decrypt when constructing an org-specific storage or CDN client; the plaintext is held only in memory for the duration of client construction.
In the factory pattern, CredentialEncryptor is a process-level singleton in ProcessContext. Since it holds no network connections or file handles, no shutdown cleanup is needed.
For testing, construct a CredentialEncryptor with Fernet.generate_key() — no mocking or external services required.
Key rotation#
Key rotation uses MultiFernet to provide a zero-downtime transition:
Generate a new Fernet key in 1Password (or let Phalanx regenerate).
Deploy with both keys: set the new key as
DOCVERSE_CREDENTIAL_ENCRYPTION_KEYand the old key asDOCVERSE_CREDENTIAL_ENCRYPTION_KEY_RETIRED. Restart pods. At this point,MultiFernetdecrypts credentials under either key; new encryptions use the new key.Run the
credential_reencryptjob (scheduled periodically; see Periodic job scheduling). The job iterates over allorganization_credentialsrows and callsCredentialEncryptor.rotate(), which re-encrypts each token under the current primary key. Unlike Vault’svault:vN:prefix, Fernet tokens don’t indicate which key encrypted them, so the job processes all rows unconditionally.MultiFernet.rotate()is idempotent — re-encrypting an already-migrated token simply produces a new token under the same primary key.Remove the retired key: once the re-encryption job completes, remove
DOCVERSE_CREDENTIAL_ENCRYPTION_KEY_RETIREDand restart pods.
Organization management#
Organizations are created and configured via the Docverse API (not statically in Helm/Phalanx config). This keeps orgs in the same database as projects for consistency. The API has two tiers of admin endpoints:
Docverse superadmin APIs: create/delete/list organizations (scoped via Gafaelfawr token scope)
Org admin APIs: configure the org’s settings, manage projects, manage edition rules
Dashboard rendering#
Docverse subsumes the role of LTD Dasher. Dashboard pages (/v/index.html) and custom 404 pages are rendered server-side using Jinja templates and project/edition metadata from the database, then uploaded as self-contained HTML files to the object store and served through the CDN. See the Dashboard templating system section for the full design.
Re-rendering is triggered by:
Template repo changes (via GitHub webhook)
Project metadata changes
Edition updates (new build published, edition created/deleted)
Dashboard rendering is handled asynchronously via the task queue.
Relationship to LTD Keeper v2#
The LTD Keeper v2 Organization model (in keeper/models.py) established much of this design: the OrganizationLayoutMode enum (subdomain vs path), Fernet-encrypted credentials, org-scoped Tags and DashboardTemplates, and the Product→Organization foreign key. Key changes in Docverse:
Cloud-agnostic storage (not AWS-only)
Configurable CDN provider (not Fastly-only)
GitHub-repo-based dashboard templates (not S3-bucket-stored)
Gafaelfawr auth replacing the User/Permission model
Infrastructure configuration at the org level, not the project level (see Projects, editions and builds)
Projects, editions and builds#
The key domain entities from LTD are retained in Docverse.
Project: a documentation site with a stable URL and multiple versions (editions). Projects are owned by organizations and have metadata like name, description, and configuration.
Edition: a published version of a project, representing a specific build’s content at a stable URL. Editions have tracking modes that determine which builds they follow (e.g.,
mainedition tracks the default branch,DM-12345edition tracks branches matchingtickets/DM-12345).Build: a discrete upload of documentation content for a project. Builds are conceptually immutable and carry metadata about their origin (git ref, uploader identity, etc). Builds are identified with a Crockford Base32 ID.
Project model simplification#
In the original LTD API, projects were called “products” to borrow terminology from EUPS. With time, we realized that EUPS wasn’t relevant to LTD, and we shifted the terminology to “project.”
With Docverse, we further improve projects by removing all infrastructure configuration from the project model and moving it to the organization level. This simplifies the project model and reflects the fact that infrastructure configuration (e.g., object store settings, CDN settings) is typically shared across all projects within an organization. See the Organizations design section for details on the new organization model and how infrastructure configuration is handled there.
Improved build uploads#
An issue with the LTD API was that build uploads are slow for large documentation sites. This was because the client uploaded each file individually using presigned URLs.
Docverse uses a tarball-to-object-store upload model. The client compresses the built documentation into a tarball, uploads it directly to the object store via a presigned URL, then signals Docverse to process it. This avoids the performance problems of per-file presigned URLs (HTTP overhead per file, no compression, thousands of separate uploads for large Sphinx sites) and keeps the API server thin by never routing large request bodies through the API. See Client-server monorepo for the Python client library and GitHub Actions action (docverse-upload) for the GitHub Action that implement this upload flow.
End-to-end flow#
This sequence diagram illustrates the end-to-end flow of a build upload and processing:
sequenceDiagram
participant Client
participant API as Docverse API
participant Store as Object Store
participant Worker as Docverse Worker
Client->>API: Authenticate (Gafaelfawr token)
Client->>API: POST create build (git ref, content hash)
API-->>Client: Presigned upload URL
Client->>Store: Upload tarball (PUT via presigned URL)
Client->>API: PATCH signal upload complete
API-->>Client: queue_url for tracking
API->>Worker: Enqueue build processing job
Worker->>Store: Download tarball from staging
Worker->>Store: Stream-unpack & upload files to build prefix
Worker->>Store: Delete staging tarball
Worker->>Worker: Inventory objects in Postgres, evaluate tracking rules
Worker->>Store: Update editions (pointer or copy mode)
Worker->>Worker: Render project dashboard
Client authenticates with a Gafaelfawr token (org uploader role).
Client creates a build record via the API (
POST /orgs/:org/projects/:project/builds). The request includes the git ref and a content hash of the tarball. The response provides a single presigned upload URL pointing to the staging location. See the REST API design section for request/response details.Client uploads the tarball directly to the object store using the presigned URL. The tarball is a gzipped tar archive (
.tar.gz) of the built documentation directory. The client implementation is straightforward –tar czfpiped to an HTTP PUT. The object store handles the bandwidth; the API server is not involved. Multipart uploads are supported where the object store provider allows (S3, GCS), enabling resumable uploads for large sites.Client signals upload complete (
PATCH /orgs/:org/projects/:project/builds/:buildwith status update). The response includes aqueue_urlfor tracking the background processing.Background processing – a single background job executes the build processing pipeline:
Download tarball from the staging location (staging bucket or
__staging/prefix in the publishing bucket).Stream-unpack and upload: extract entries from the tar stream and upload individual files to the build’s permanent prefix (
__builds/{build_id}/) in the publishing bucket. Uploads are parallelized via anasyncio.Semaphore-bounded pool of concurrent uploads. For a 5,000-file Sphinx site, this processes in well under a minute.Delete staging tarball after successful extraction.
Inventory the build’s objects in Postgres.
Evaluate tracking rules to determine affected editions.
Update editions in parallel via
asyncio.gather().Render project dashboard and metadata JSON once after all edition updates.
The API handler for step 4 is thin: it validates the request, updates the build status to processing, enqueues the background job, and returns the queue_url.
Staging location#
The tarball is uploaded to a staging location that is separate from the build’s permanent prefix. The staging path is __staging/{build_id}.tar.gz. Where this lives depends on whether the org has a dedicated staging store configured:
With staging store: the presigned URL points to the staging bucket. The worker reads from the staging bucket (fast, intra-region) and writes extracted files to the publishing bucket.
Without staging store: the presigned URL points to the publishing bucket using the
__staging/prefix. The worker reads and writes within the same bucket.
The staging store optimization matters when the publishing bucket is on a different network than the Docverse compute cluster. For example, in Rubin Observatory’s deployment:
Staging bucket: GCS in
us-central1(same region as the GKE cluster). The CI runner uploads the tarball to GCS. The Docverse worker downloads it over Google’s internal network – fast, with zero egress cost.Publishing bucket: Cloudflare R2. The worker uploads extracted files to R2, which is optimized for CDN serving with zero egress cost to readers.
Without the split, the worker would download the tarball from R2 over the public internet, unpack it, then upload thousands of individual files back to R2 over the public internet. With the split, only the final extracted files cross the network boundary, and the tarball round-trip stays within GCP.
For orgs where the publishing store is in the same region as the cluster (e.g., GCS publishing via Google Cloud CDN, or S3 publishing via CloudFront in the same AWS region), the staging store adds no benefit and can be left unconfigured.
Stream unpacking#
The worker unpacks the tarball using Python’s tarfile module in streaming mode, uploading each file to the publishing bucket as it’s extracted. This keeps memory usage bounded – the worker never needs to hold the entire unpacked site in memory or on disk:
async def unpack_and_upload(
self,
staging_store: ObjectStore,
publishing_store: ObjectStore,
build_id: str,
semaphore: asyncio.Semaphore,
) -> list[str]:
"""Stream-unpack a tarball and upload files to the build prefix."""
tarball_stream = await staging_store.get_object_stream(
f"__staging/{build_id}.tar.gz"
)
uploaded_keys: list[str] = []
upload_tasks: list[asyncio.Task] = []
with tarfile.open(fileobj=tarball_stream, mode="r:gz") as tar:
for member in tar:
if not member.isfile():
continue
file_obj = tar.extractfile(member)
if file_obj is None:
continue
content = file_obj.read()
key = f"__builds/{build_id}/{member.name}"
async def _upload(k: str, data: bytes) -> None:
async with semaphore:
await publishing_store.upload_object(
key=k,
data=data,
content_type=guess_content_type(k),
)
task = asyncio.create_task(_upload(key, content))
upload_tasks.append(task)
uploaded_keys.append(key)
await asyncio.gather(*upload_tasks)
await staging_store.delete_object(f"__staging/{build_id}.tar.gz")
return uploaded_keys
The semaphore bounds concurrency (e.g., 50 concurrent uploads) to avoid overwhelming the object store API. Content types are inferred from file extensions.
The choice of tar.gz over ZIP is deliberate. Tar archives are designed for sequential streaming (originally tape I/O), and each entry header includes the file size upfront, so the worker can stream each file directly into an object store upload without buffering. Gzip compresses the archive as a single stream, which yields better compression ratios for documentation sites whose HTML, CSS, and JavaScript files share significant redundancy — ZIP, by contrast, compresses each file independently and cannot exploit cross-file similarity. ZIP’s main advantage is random access via its central directory, but that is irrelevant here since the worker extracts every file sequentially.
Build object inventory table#
Each build has an associated set of object records in Postgres: key (object path), content hash (ETag or SHA-256), content type, and size. This is populated during the inventory phase of the build processing job. The inventory enables:
Fast diff computation for edition updates (no object store listing calls)
Orphan detection for build cleanup rules
Metadata for dashboards (e.g., build size)
Note that this inventory table is motivated for the original S3 and Fastly-based architecture where editions are updated by copying the build objects. With the Cloudflare-based architecture where editions are updated by pointing to the build prefix, the inventory is less critical for edition updates but still be valuable for orphan detection and dashboard metadata. See BuildObject for the column definition.
Edition overview#
An Edition is a named, published view of a Product’s documentation at a stable URL. Editions are pointers — they represent a specific build’s content served at an edition-specific URL path (e.g., /v/main/, /v/DM-12345/, /v/2.x/).
Two concepts govern edition behavior:
Tracking mode: determines which builds the edition follows (the algorithm for auto-updating).
Edition kind: classifies the edition for dashboard display and lifecycle rule targeting.
Edition slugs#
Editions are identified by URL-safe slugs. The slug system has three layers:
Reserved slugs:
__mainis the sole reserved slug, representing the default edition that serves at the project root (no/v/prefix in the URL). It does not correspond to a git ref and uses double-underscore prefix to avoid collisions with any git branch or tag name.Org-configurable rewrite rules: organizations can configure pattern-based transforms from git ref → edition slug. These are an ordered list of rules where the first match wins, with three rule types:
prefix_strip,regex, andignore. For example, Rubin’s convention uses aprefix_striprule to rewritetickets/DM-12345→ slugDM-12345. See the Edition slug rewrite rules section for the full rule format, evaluation algorithm, and examples.Default behavior: slashes in git refs become dashes (e.g.,
feature/dark-mode→feature-dark-mode), keeping all edition slugs as single URL path segments.
An edition tracks a slug, not a single canonical git ref. Multiple git refs can map to the same slug through rewrite rules and all contribute builds to that edition. For example, if both tickets/DM-12345 and DM-12345 exist as branches and both rewrite to slug DM-12345, builds from either ref update the same edition. This fixes a bug in LTD Keeper where only one ref could contribute to a special-case edition.
Edition slug rewrite rules#
When a new build arrives and its git ref doesn’t match any existing edition, Docverse uses slug rewrite rules to determine how to transform the git ref into an edition slug. Rules are evaluated in order; the first match wins.
Rule types#
Three rule types cover the practical use cases:
prefix_strip — the workhorse. Matches refs starting with a literal prefix, strips it, and uses the remainder as the slug (with any remaining slashes replaced by a configurable character, default -).
{
"type": "prefix_strip",
"prefix": "tickets/",
"edition_kind": "draft"
}
tickets/DM-12345 → slug DM-12345. tickets/foo/bar → slug foo-bar.
regex — for patterns that prefix stripping can’t express. Uses a Python regex with a named capture group slug to extract the edition slug. Remaining slashes in the captured group are still replaced by slash_replacement by default, protecting against invalid slugs.
{
"type": "regex",
"pattern": "^release/(?P<slug>v\\d+\\.\\d+)$",
"edition_kind": "release"
}
release/v2.3 → slug v2.3 with kind release. release/experimental → no match.
ignore — suppresses edition auto-creation for matching refs. Uses glob patterns (Python fnmatch semantics with ** for recursive matching). Useful for filtering noise from dependency bot branches, CI scratch branches, and similar refs that should never produce editions.
{
"type": "ignore",
"glob": "dependabot/**"
}
dependabot/npm/lodash-4.17.21 → no edition created.
Rule fields#
Field |
Type |
Applies to |
Default |
Description |
|---|---|---|---|---|
|
enum |
all |
— |
|
|
str |
|
|
Kind assigned to auto-created editions |
|
str |
|
— |
Literal prefix to match and strip |
|
str |
|
— |
Python regex with named group |
|
str |
|
— |
Glob pattern for ref matching |
|
str |
|
|
Character replacing remaining slashes in extracted slug. Must be one of |
Storage and scoping#
Rules are stored as a JSONB array on the Organization and Project tables:
Organization.slug_rewrite_rules(JSONB): ordered rule list applied to all projects in the org.Project.slug_rewrite_rules(JSONB, nullable): when set, completely replaces the org-level rules for that project. No merging or inheritance — if a project needs one different rule, it copies the org rules and modifies. This avoids the complexity of ordered-list merge semantics.
Setting the project-level rules to null (or omitting the field) restores inheritance from the org.
Evaluation algorithm#
1. rules = project.slug_rewrite_rules ?? org.slug_rewrite_rules ?? []
2. For each rule in order:
a. If type=ignore and glob matches git_ref → return None (suppress)
b. If type=prefix_strip and git_ref starts with prefix →
remainder = git_ref[len(prefix):]
slug = remainder.replace("/", rule.slash_replacement)
return (slug, rule.edition_kind)
c. If type=regex and pattern matches git_ref →
slug = match.group("slug")
slug = slug.replace("/", rule.slash_replacement)
return (slug, rule.edition_kind)
3. No rule matched (default fallback):
slug = git_ref.replace("/", "-")
return (slug, "draft")
The default fallback (step 3) always applies, so every non-ignored ref produces a valid slug even with zero rules configured. This preserves LTD Keeper’s existing behavior for orgs that don’t configure any rewrite rules.
Slug validation#
After a slug is produced (by rule or default), it is validated:
Must be non-empty.
Must contain only URL-safe characters: lowercase alphanumeric, hyphens, underscores, dots. Uppercase characters are lowercased.
Must not start with
__(reserved prefix for system slugs like__main).Must not exceed 128 characters.
If validation fails, the build is processed but no edition is auto-created. The build record’s status reflects that slug generation failed, and the issue is logged for operator attention.
Example: Rubin Observatory configuration#
[
{ "type": "ignore", "glob": "dependabot/**" },
{ "type": "ignore", "glob": "renovate/**" },
{ "type": "prefix_strip", "prefix": "tickets/", "edition_kind": "draft" },
{
"type": "regex",
"pattern": "^v?(?P<slug>\\d+\\.\\d+\\.\\d+)$",
"edition_kind": "release"
}
]
Evaluation for various git refs:
Git ref |
Matched rule |
Slug |
Kind |
|---|---|---|---|
|
|
— (suppressed) |
— |
|
|
— (suppressed) |
— |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
default fallback |
|
|
|
default fallback |
|
|
Note: for main, the default fallback produces slug main with kind draft, but in practice the __main edition already exists and matches via its tracking mode, so auto-creation is not triggered.
Dry-run endpoint#
A preview endpoint allows org admins to test their rewrite rules against a git ref without creating any resources:
POST /orgs/:org/slug-preview → preview slug resolution (admin)
Request:
{
"git_ref": "tickets/DM-12345",
"project": "pipelines"
}
The optional project field causes the endpoint to use project-level rule overrides if they exist. Without it, the org-level rules are used.
Response:
{
"git_ref": "tickets/DM-12345",
"edition_slug": "DM-12345",
"edition_kind": "draft",
"matched_rule": {
"type": "prefix_strip",
"prefix": "tickets/",
"index": 0
},
"rule_source": "org"
}
For an ignored ref:
{
"git_ref": "dependabot/npm/lodash-4.17.21",
"edition_slug": null,
"edition_kind": null,
"matched_rule": {
"type": "ignore",
"glob": "dependabot/**",
"index": 0
},
"rule_source": "org"
}
For a ref that hits the default fallback:
{
"git_ref": "feature/dark-mode",
"edition_slug": "feature-dark-mode",
"edition_kind": "draft",
"matched_rule": null,
"rule_source": "default"
}
The rule_source field indicates whether the rules came from "org", "project", or "default" (no rules configured, using built-in fallback).
Compound slug derivation for alternate-scoped builds#
When a build includes an alternate_name (see REST API design), slug derivation adds a scoping prefix to keep alternate-specific editions separate from generic ones:
Apply normal slug rewrite rules to
git_ref→ base slug + edition kind.Prepend the
alternate_namewith a--separator:{alternate_name}--{base_slug}.Set the edition’s tracking mode to
alternate_git_refwithtracking_params: {"git_ref": "<git_ref>", "alternate_name": "<alternate_name>"}.
Example: a build with git_ref: "tickets/DM-12345" and alternate_name: "usdf-dev":
Rewrite rules produce base slug
DM-12345, kinddraft.Final slug:
usdf-dev--DM-12345.Tracking mode:
alternate_git_ref, tracking params:{"git_ref": "tickets/DM-12345", "alternate_name": "usdf-dev"}.
For builds without alternate_name, slug derivation is unchanged.
The -- separator is chosen because it is not a valid sequence in git branch names (double hyphens are legal but uncommon), making it a reliable delimiter between the alternate name and the base slug. Slug validation still applies to the full compound slug.
Tracking modes#
The full set of tracking modes in Docverse:
Mode |
Behavior |
Carried from |
|---|---|---|
|
Track a specific branch or tag |
LTD Keeper v1 (was |
|
Track latest LSST document version tag ( |
LTD Keeper v1 |
|
Track latest EUPS major release tag |
LTD Keeper v1 |
|
Track latest EUPS weekly release tag |
LTD Keeper v1 |
|
Track latest EUPS daily release tag |
LTD Keeper v1 |
|
Track the latest semver release, excluding pre-releases (alpha, beta, rc) |
New |
|
Track the latest release within a major version stream (e.g., latest |
New |
|
Track the latest release within a minor version stream (e.g., latest |
New |
|
Track a specific branch or tag scoped to an alternate name. Parameterized by |
New |
Semver tracking supports tags both with and without a v prefix (e.g., v2.1.0 and 2.1.0).
The alternate_git_ref tracking mode#
The alternate_git_ref mode is the dedicated tracking mode for deployment-scoped editions. It matches builds by both a specific git_ref and an alternate_name, parameterized via tracking_params. For example, edition usdf-dev--DM-12345 uses alternate_git_ref with tracking_params: {"git_ref": "tickets/DM-12345", "alternate_name": "usdf-dev"} — it updates only when a build arrives carrying that exact git ref and alternate name pair.
Builds with alternate_name set are invisible to editions that do not use alternate_git_ref (or otherwise filter on alternate_name). This prevents a deployment-specific build from accidentally updating __main or a generic draft edition. Conversely, builds without alternate_name are invisible to alternate_git_ref editions. The alternate name acts as a namespace partition within a project’s build stream.
Edition kinds#
Edition kinds classify editions for display and lifecycle purposes:
Kind |
Description |
Typical tracking mode |
|---|---|---|
|
The default edition |
|
|
A stable release |
|
|
A draft/development edition |
|
|
Tracks a major version stream |
|
|
Tracks a minor version stream |
|
|
An alternative product variant or deployment |
|
When an edition is auto-created, its kind is assigned based on the tracking mode. The kind does not constrain tracking behavior — it provides context for dashboards and lifecycle rules.
alternate editions are exempt from draft_inactivity lifecycle rules by default — they represent long-lived deployment targets, not transient branches. Alternate editions can be created manually via the API, or auto-created when builds carry an alternate_name (see below). Slug rewrite rules can also assign edition_kind: "alternate" to control the kind assigned during auto-creation.
Auto-creation of editions#
When a new build arrives and matches no existing edition’s tracking criteria, Docverse can auto-create editions:
git_ref editions: auto-created for new branches/tags (as in LTD Keeper). Classified as
draftkind by default.semver_major editions: auto-created when a build introduces a new major version stream (e.g., first
v3.x.xtag). Classified asmajorkind.semver_minor editions: auto-created when a build introduces a new minor version stream (e.g., first
v3.1.xtag). Classified asminorkind.alternate_git_ref editions: auto-created when a build carries an
alternate_nameand its compound slug ({alternate_name}--{base_slug}) does not match an existing edition. The edition is created withalternate_git_reftracking mode andtracking_paramscontaining both thegit_refandalternate_name. The edition kind comes from slug rewrite rules applied to the base git ref (typicallydraftfor ticket branches).Note
Because auto-creation derives the edition kind from slug rewrite rules, an alternate edition tracking the default branch (e.g.,
usdf-dev--main) gets kinddraftvia the default fallback — not kindalternate. This means it would be excluded from the version switcher (which includes kindalternatebut notdraft) and would be subject todraft_inactivitylifecycle cleanup. To avoid this, pre-create long-lived deployment editions with kindalternatevia thePOST /orgs/:org/projects/:project/editionsendpoint before the first build arrives. See the REST API design section for the edition creation request body.
Auto-creation for semver_major, semver_minor, and alternate_git_ref modes can be disabled at both the org and project level.
Build annotations#
The upload client can optionally annotate a build with metadata about its nature (e.g., “this is a release”, “this is a draft/PR build”). This supplements the pattern-based classification from tracking rules. Two complementary mechanisms:
Client annotations: optional metadata on the build record provided at upload time.
Project/org pattern rules: configurable rules that classify builds based on git ref patterns (e.g., “tags matching
v*.*.*are releases”, “branches matchingtickets/*are drafts”).
The tracking system uses both signals, with pattern-based rules as the primary classifier.
Edition-build history#
Docverse maintains an explicit log of every build that an edition has pointed to, stored in an EditionBuildHistory table with the edition ID, build ID, timestamp, and ordering position. This replaces the implicit relationship in LTD Keeper (where you’d have to reconstruct history from build timestamps). The history enables:
Rollback API: an org admin can roll an edition back to any previous build in its history with a single API call (PATCH the edition with a
buildfield pointing to the desired build).Orphan build detection: lifecycle rules can reference history position (e.g., “a build that is 5+ versions back and older than 30 days is an orphan”).
See EditionBuildHistory for the column definition.
Edition update strategy#
When an edition is updated to point to a new build, the strategy depends on the CDN’s declared capabilities — pointer mode or copy mode.
Pointer mode (instant switchover)#
For CDNs with edge compute and an edge data store (see CDN provider comparison), the edition is a metadata pointer and no object copying is required. The update process:
Write new mapping: update the edition→build mapping in the edge data store (e.g., Cloudflare Workers KV, Fastly KV Store).
Purge CDN cache: invalidate cached content for the edition so new requests resolve the updated mapping.
Update database: record the new edition→build association and log to
EditionBuildHistory.
The edge Worker/Compute function intercepts each request, looks up the current build for the requested edition in the edge data store, and fetches the content directly from the build’s object store prefix. Edition updates are effectively instant — a KV write + cache purge, no bulk object operations.
Copy mode (ordered in-place update)#
For CDNs without edge compute, Docverse performs an ordered in-place update (the fallback from the LTD Keeper approach of delete-then-copy, which caused temporary 404s):
Diff: compare the new build’s object inventory against the edition’s current inventory using the Postgres object tables.
Copy new/changed assets first: images, CSS, JS — so that HTML pages referencing new assets won’t break.
Copy HTML pages: starting from the deepest directory levels, working up to the homepage last. This minimizes the window where a user might see a partially-updated site.
Move orphaned objects to purgatory: objects in the edition that don’t exist in the new build are moved to a purgatory key prefix rather than deleted immediately.
Purge CDN cache: invalidate cached content for the edition.
This ordering ensures that at no point during the update will a user get a 404 for content that should exist.
Mode selection#
The edition update service checks the CDN’s declared supports_pointer_mode capability at runtime to select the appropriate code path. This keeps the service logic clean and makes it straightforward to add new CDN providers without modifying orchestration logic. Orgs using Cloudflare Workers + R2 or Fastly Compute get instant switchovers; orgs using Google Cloud CDN or other providers without edge compute get the ordered copy strategy.
Deletion and lifecycle rules#
Soft delete and purgatory#
Docverse uses a two-layer soft delete approach:
Database: objects are soft-deleted (marked with a
date_endedordate_deletedtimestamp) rather than immediately removed.Object store: files are moved to a purgatory key prefix rather than deleted. A background job hard-deletes purgatory objects after a configurable retention period.
This provides reversibility at both layers. The purgatory timeout is configurable at the org level with per-project overrides.
Lifecycle rules#
Lifecycle rules are stored as JSONB in Postgres — a list of rule objects, each with a type discriminator and type-specific parameters. Rules are configured at the org level as defaults and can be overridden per-project. Individual editions can be marked as exempt from all lifecycle rules (a “never delete” flag).
Rule types:
Rule type |
Parameters |
Behavior |
|---|---|---|
|
|
Delete draft editions with no new builds for N days |
|
|
Delete editions whose tracked Git ref no longer exists on GitHub |
|
|
Delete builds that are N+ positions back in an edition’s history and older than M days |
Example rule configuration (JSONB):
[
{ "type": "draft_inactivity", "max_days_inactive": 30 },
{ "type": "ref_deleted", "enabled": true },
{ "type": "build_history_orphan", "min_position": 5, "min_age_days": 30 }
]
Different edition kinds can have different default lifecycle rules. For example, draft editions might default to draft_inactivity with 30 days, while release editions have no auto-deletion rules.
GitHub event-driven deletion#
Edition deletion triggered by Git ref deletion is event-driven via GitHub webhooks. Docverse supports two models for receiving GitHub events:
Direct webhooks: Docverse receives webhook events directly as a GitHub App, using the Safir GitHub App framework.
Kafka via Squarebot: Squarebot acts as a GitHub App events gateway, receiving webhooks and republishing them internally via Kafka. Docverse consumes events from Kafka. This allows Docverse to share a GitHub App installation with other internal tools.
Both models are supported; the deployment can use either or both.
A periodic audit job supplements the event-driven approach by verifying that Git refs referenced by Docverse editions and projects still exist on GitHub. This catches cases where webhook delivery failed or events were missed.
Documentation hosting#
Like LTD before, Docverse decouple the documentation hosting from the API service. The documentation host uses a simple and reliable cloud-based CDN and object store so that documentation is served with low latency and high availability around the world, even if the API service itself is down. In LTD, we used Fastly as the CDN and AWS S3 for object storage. With Docverse, we wish to support multiple hosting stacks, but also add a new hosting architecture using Cloudflare Workers + R2 that eliminates the S3 copy-on-publish bottleneck and reduces costs by an order of magnitude.
The S3 copy-on-publish bottleneck#
The original LSST the Docs platform serves documentation projects as wildcard subdomains (pipelines.lsst.io, dmtn-139.lsst.io) through Fastly’s CDN.
LTD Keeper, a Flask-based REST API, manages three core entities: products (documentation projects), builds (individual CI-produced documentation snapshots), and editions (named pointers like main or v1.0 that track specific builds).
Each entity has a corresponding path prefix in a shared S3 bucket.
Fastly VCL intercepts every request and performs regex-based URL rewriting to map the requested URL to an S3 object path.
For example, pipelines.lsst.io/v/main/page.html becomes s3://bucket/pipelines/editions/main/page.html.
This is elegant but rigid: it can only serve what’s physically at the edition’s S3 path.
So when the main edition is updated from build b41 to b42, LTD Keeper must copy every object from pipelines/builds/b42/ to pipelines/editions/main/.
The system then purges Fastly’s cache using surrogate keys — which works well at ~150ms global propagation — but the S3 copy itself can take minutes for large documentation sets.
This copy-on-publish approach has the fundamental issue that edition updates are slow and can also be inconsistent for users during the copy window. The original LTD implementation had the additional bug that an edition’s objects would be deleted before the new build’s objects were copied into place, causing 404s for pages that weren’t in the Fastly cache.
The proposed solution replaces this with edge-side dynamic resolution: the CDN intercepts the request, extracts the project name and edition from the URL, consults an edge data store to determine which build the edition currently points to, and fetches the correct object directly from the build’s storage path. Edition updates become a metadata change (updating a key-value mapping) rather than a bulk data operation.
The Cloudflare stack#
Cloudflare can provide this edge-side dynamic edition resolution through a combination of its Workers edge compute platform, R2 object storage, and Workers KV key-value store. Surprisingly, this platform is also substantially more cost-effective than the current Fastly + S3 architecture, even at large scale, due to R2’s zero egress fees, free wildcard TLS certification and Workers’ efficient edge execution model.
Architecture overview#
With Cloudflare, a request is handled at the edge by a Worker script that parses the URL to determine the project and edition. With Workers KV, the Worker looks up which build the edition currently points to, constructs the R2 key for the requested object, and fetches it directly from R2 using the native R2 bindings. The Worker then returns the object with appropriate caching headers:
graph TD
DNS["*.lsst.io<br/>(wildcard DNS → Cloudflare)"]
DNS --> W1
subgraph Worker["Cloudflare Worker"]
W1["1. Parse Host header<br/>→ extract project"]
W2["2. Parse URL path<br/>→ extract edition"]
W3["3. KV lookup<br/>edition → build ID"]
W4["4. Construct R2 key<br/>{project}/builds/{build}/{path}"]
W5["5. Fetch from R2"]
W6["6. Return with cache headers"]
W1 --> W2 --> W3 --> W4 --> W5 --> W6
end
W3 -.-> KV[("Workers KV<br/>edition→build mappings")]
W5 -.-> R2[("R2 Bucket<br/>Stores all builds once")]
API["Docverse API"] -->|REST API writes| KV
With this architecture, builds are only stored in R2 in one place ({project}/builds/{build_id}/), and editions are just pointers to builds in Workers KV.
When an edition is updated, only the KV mapping changes — an instant metadata operation instead of a bulk S3 copy.
Worker request flow#
The Worker implementation is approximately 100–200 lines of TypeScript. The request handling flow:
Parse Host header to extract the project name from the subdomain (e.g.,
pipelines.lsst.io→pipelines).Parse URL path to determine the edition and file path:
/v/{edition}/page.html→ named edition/page.htmlor/→ default edition (configurable, typicallymain)/builds/{build_id}/page.html→ direct build access (bypasses KV lookup)
KV lookup: read the edition→build mapping from Workers KV using key
{project}/{edition}. KV reads are cached at the edge with acacheTtl(e.g., 30 seconds) to reduce KV costs and latency on hot paths.Construct R2 key:
{project}/builds/{build_id}/{file_path}.Fetch from R2 via the native R2 binding (no HTTP overhead).
Return response with appropriate
Content-Type,Cache-Control, andCache-Tagheaders.
The Worker also handles routing to the project dashboard page (and other Docverse metadata files) and returns appropriate 404 responses. See dashboard templating system for how dashboard pages are served.
Cloudflare configuration#
The Worker is configured via a wrangler.toml file:
name = "lsst-io-router"
main = "src/index.ts"
compatibility_date = "2025-01-01"
# Route: catch all subdomains of lsst.io
routes = [
{ pattern = "*.lsst.io/*", zone_name = "lsst.io" }
]
# KV Namespace for edition → build mappings
[[kv_namespaces]]
binding = "EDITIONS"
id = "<KV_NAMESPACE_ID>"
preview_id = "<KV_PREVIEW_NAMESPACE_ID>"
# R2 Bucket for documentation builds
[[r2_buckets]]
binding = "DOCS_BUCKET"
bucket_name = "lsst-io-docs"
# Environment variables
[vars]
DEFAULT_EDITION = "__main"
DNS setup: a proxied wildcard CNAME record (*.lsst.io) combined with an HTTP route pattern (*.lsst.io/*) directs all subdomain traffic through a single Worker.
Cloudflare’s Universal SSL automatically provisions and renews a wildcard certificate for *.lsst.io at no additional cost.
The underlying A record content is irrelevant (it is never reached) because the Worker intercepts all requests.
Infrastructure provisioning: the KV namespace and R2 bucket are created via the Wrangler CLI:
# Create the KV namespace
npx wrangler kv namespace create "EDITIONS"
npx wrangler kv namespace create "EDITIONS" --preview
# Create the R2 bucket
npx wrangler r2 bucket create ltd-docs
Two-tier caching strategy#
The caching architecture uses two layers:
Workers KV as the global source of truth for edition→build mappings, with an API for external writes that Docverse calls when an edition is updated. KV propagates updates globally within approximately 60 seconds — acceptable for documentation that doesn’t require sub-second freshness.
Per-PoP Cache API as a hot local cache in front of KV reads (
cacheTtlon KV read operations), eliminating KV costs and latency for frequently accessed mappings.
For the content itself, the Worker sets Cache-Control headers to layer browser and edge caching:
Browser cache: short TTL (5 minutes,
max-age=300) so users see updates relatively quickly.Edge cache: longer TTL (1 hour,
s-maxage=3600) to reduce R2 read operations.
R2’s built-in Tiered Read Cache automatically caches hot objects closer to users, providing an additional optimization layer.
Cache invalidation#
When an edition is re-pointed to a new build, two things happen:
KV is updated — the new edition→build mapping takes effect within ~60 seconds globally.
Edge cache is purged — so users don’t keep seeing the old build for the duration of the
s-maxage.
Three cache purge strategies are available, depending on the Cloudflare plan:
Purge by hostname (all plans): purge all cached content for a subdomain (e.g.,
pipelines.lsst.io). Slightly broader than necessary but fast and simple.Purge by prefix (all plans): purge by URL prefix (e.g.,
pipelines.lsst.io/v/v23.0/) for more targeted invalidation.Purge by Cache-Tag (Enterprise plan): surgical invalidation using
Cache-Tagheaders set by the Worker (e.g.,edition:pipelines/main).
The complete “re-point edition” operation — a KV write plus a cache purge — replaces the S3 directory copy plus Fastly surrogate-key purge in the current LTD architecture. Total time: under 2 seconds, versus minutes for the S3 copy approach.
Wildcard subdomains and SSL#
Wildcard subdomains work on all Cloudflare plans, including the free tier.
A proxied wildcard DNS record (*.lsst.io) combined with an HTTP route pattern (*.lsst.io/*) directs all subdomain traffic through a single Worker.
Cloudflare’s Universal SSL automatically provisions and renews a wildcard certificate at no additional cost.
KV management#
The KV namespace is the source of truth for which build each edition points to. Docverse writes to it via the Cloudflare REST API whenever an edition is created, updated, or deleted.
Key format: {project}/{edition} (e.g., pipelines/main)
Value format: JSON with build metadata:
{
"build_id": "b42",
"updated_at": "2025-06-15T10:30:00Z",
"git_ref": "main",
"title": "Latest (main)"
}
Single edition write — the API call Docverse makes on every edition update:
curl -X PUT \
"https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/storage/kv/namespaces/${KV_NAMESPACE_ID}/values/pipelines%2Fmain" \
-H "Authorization: Bearer ${CF_API_TOKEN}" \
-H "Content-Type: application/json" \
--data-raw '{"build_id":"b42","updated_at":"2025-06-15T10:30:00Z","git_ref":"main","title":"Latest (main)"}'
Bulk write — for migration or seeding all edition mappings at once (supports up to 10,000 key-value pairs per call):
curl -X PUT \
"https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/storage/kv/namespaces/${KV_NAMESPACE_ID}/bulk" \
-H "Authorization: Bearer ${CF_API_TOKEN}" \
-H "Content-Type: application/json" \
--data-raw '[
{
"key": "pipelines/main",
"value": "{\"build_id\":\"b42\",\"updated_at\":\"2025-06-15T10:30:00Z\",\"git_ref\":\"main\"}"
},
{
"key": "pipelines/v23.0",
"value": "{\"build_id\":\"b38\",\"updated_at\":\"2025-05-01T00:00:00Z\",\"git_ref\":\"v23.0\"}"
}
]'
Cost model#
R2’s zero egress fees are the defining cost feature. At 10 million requests per month, the estimated total cost is approximately \(6/month**. Even at 100 million requests per month, the cost rises to roughly **\)110/month — still dramatically less than equivalent AWS infrastructure.
Component |
10M req/mo |
100M req/mo |
|---|---|---|
Workers base + requests |
$5.00 |
$32.00 |
Workers KV reads |
$0.00 (within free tier) |
$45.00 |
Workers KV writes |
~$0.05 |
~$0.05 |
R2 storage (50 GB) |
$0.60 |
$0.60 |
R2 Class B ops (reads) |
$0.00 (within free tier) |
$32.40 |
Bandwidth (egress) |
$0.00 |
$0.00 |
Total |
~$6 |
~$110 |
Other hosting options#
Through configuration, Docverse can support multiple hosting stacks, allowing different organizations to choose their preferred CDN provider and architecture. We can certainly continue to support the existing Fastly VCL and S3 architecture that we have used thusfar. This sections surveys other hosting options evaluated, including Fastly Compute, CloudFront + Lambda@Edge and Google Cloud CDN.
Fastly Compute#
Since LSST the Docs already runs on Fastly, migrating from VCL to Fastly Compute avoids changing CDN providers entirely. Compute replaces VCL’s domain-specific language with WebAssembly modules compiled from Rust, JavaScript, or Go, while retaining access to Fastly’s identical cache infrastructure and ~150ms global surrogate-key purge — the fastest in the industry.
Key characteristics:
Dynamic Backends (GA since April 2023) allow the Wasm module to construct S3 URLs at runtime rather than pre-configuring every possible origin.
The Fastly KV Store provides an edge-local key-value store for edition→build mappings, readable from any PoP with low latency.
Wasm cold starts are 35 microseconds, matching VCL’s near-instant request processing.
Existing surrogate key infrastructure carries over directly, preserving the current cache purging workflow.
One structural constraint: VCL and Compute cannot coexist on the same Fastly service. Fastly recommends service chaining during transition — placing a Compute service in front of the existing VCL service, then gradually moving logic to Compute until the VCL service can be retired.
Cost: Fastly requires contacting sales for Compute pricing, with a \(50/month minimum** for paid accounts. Bandwidth starts at **\)0.12/GB in North America (versus Cloudflare’s $0.00/GB for R2 egress), making it roughly 5–30x more expensive than Cloudflare Workers + R2 at equivalent scale.
CloudFront + Lambda@Edge#
CloudFront + Lambda@Edge supports dynamic routing through Lambda functions on the origin-request trigger (cache-miss only). The CloudFront KeyValueStore (introduced in late 2024) offers a hybrid approach with sub-millisecond reads from a 5 MB key-value store accessible from CloudFront Functions. However, Lambda@Edge functions must be deployed in us-east-1, logs are scattered across regional CloudWatch instances, cold starts range from 100–520ms, and cache invalidation takes ~2 minutes. At approximately $443/month for 10 million requests, it is the most expensive option evaluated.
Google Cloud CDN#
Google Cloud CDN provides no edge compute capability, restricting it to copy mode only.
CDN provider comparison#
Criterion |
Cloudflare Workers + R2 |
Fastly Compute |
CloudFront + Lambda@Edge |
|---|---|---|---|
Edge API calls |
✅ fetch(), 1000/req |
✅ Dynamic backends |
✅ Lambda@Edge only |
Wildcard subdomains |
✅ All plans, free SSL |
✅ Paid, wildcard cert |
✅ ACM wildcard |
Edge data store |
KV (~60s propagation) |
KV Store (eventually consistent) |
KeyValueStore (5 MB max) |
Cache purge speed |
Seconds (URL purge) |
~150ms (surrogate key) |
~2 minutes |
Cold starts |
None (V8 isolates) |
None (35μs Wasm) |
100–520ms (Lambda) |
Global PoPs |
330+ |
80+ |
600+ (but Lambda runs at ~13 regional edges) |
Egress cost |
$0.00 |
$0.12/GB |
$0.085/GB |
Monthly cost (10M req) |
~$6 |
~$50–200 |
~$443 |
Implementation effort |
Low (~100–200 LOC TS) |
Medium (~200–400 LOC) |
High (multi-service orchestration) |
Migration friction |
CDN provider change |
Same provider, new runtime |
CDN provider change |
Dashboard templating system#
Overview#
Docverse renders dashboard pages (/v/index.html) and custom 404 pages server-side using Jinja templates, then uploads the rendered HTML to the object store as self-contained files with all assets inlined. This subsumes the role of LTD Dasher and gives organizations full control over the look and feel of their documentation portals.
Three tiers of templates, resolved in priority order:
Project-level override: a project can specify its own template source, allowing it to host its template in its own documentation repo or a dedicated repo.
Org-level template: each organization can configure a GitHub repo containing templates and assets shared across all projects in the org.
Docverse built-in default: a minimal, functional template bundled with the Docverse application, used when no org or project template is configured.
Template source configuration#
A template source is defined by four fields:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
— |
GitHub organization or user |
|
str |
— |
Repository name |
|
str |
|
Path within the repo to the template directory |
|
str |
repo’s default branch |
Branch or tag to track |
This structure is flexible enough to support several layouts:
Dedicated template repo:
lsst-sqre/docverse-templates, path"", refmain— an org maintains a standalone repo for their templates.Monorepo with multiple templates:
lsst-sqre/docverse-templates, pathrubin/— multiple orgs share a repo but use different subdirectories.Template in a project repo:
lsst/pipelines_lsst_io, path.docverse/template/— a project hosts its own template alongside its documentation source.
The same four fields appear on both the org-level and project-level template configuration. When a project has its own template source, it completely replaces the org template (no inheritance/merging of individual files).
Template directory structure#
At the root of the template directory (as specified by path), a template.toml file declares the template’s contents:
[dashboard]
template = "dashboard.html.jinja"
[dashboard.assets]
css = ["style.css"]
js = ["filter.js"]
images = ["logo.svg", "favicon.png"]
[error_404]
template = "404.html.jinja"
[error_404.assets]
css = ["style.css"]
images = ["logo.svg"]
A minimal template directory might look like:
template.toml
dashboard.html.jinja
404.html.jinja ← optional
style.css
logo.svg
filter.js
The [error_404] section is optional. If omitted, Docverse uses a built-in default 404 page. The Cloudflare Worker (or equivalent edge function) serves the 404 page when no object matches the requested path within a project.
Asset inlining#
All assets referenced in template.toml are inlined into the rendered HTML at render time, producing a single self-contained HTML file per page type (dashboard and 404). This eliminates the need for relative asset paths, simplifies CDN cache management (one URL to purge per page), and avoids asset/HTML version skew during updates.
The inlining strategy by file type:
CSS (
.css): inlined into<style>tags. Multiple CSS files are concatenated.JavaScript (
.js): inlined into<script>tags. Multiple JS files are concatenated in declared order.SVG (
.svg): inlined as raw SVG markup (preserving DOM interactivity and CSS styling).Raster images (
.png,.jpg,.gif,.webp): base64-encoded as data URIs.
The renderer makes inlined assets available to the Jinja template as context variables. CSS and JS are provided as concatenated strings; images are provided as a dict keyed by filename (with dots and hyphens converted to underscores for template-friendly access):
<head>
<style>{{ assets.css }}</style>
</head>
<body>
<header>{{ assets.images.logo_svg }}</header>
<!-- ... -->
<script>{{ assets.js }}</script>
</body>
For a typical dashboard page (CSS + a small JS filter script + an SVG logo + a favicon), the inlined HTML should be 30–80KB — well within reason for a metadata page that users visit occasionally.
Jinja template context#
The template receives a maximalist context with all available project and edition metadata. Since rendering happens in a background job (not on the request path), there is no performance concern with assembling a rich context.
Context structure#
@dataclass
class DashboardContext:
org: OrgContext
project: ProjectContext
editions: EditionsContext
assets: AssetsContext
docverse: DocverseContext
rendered_at: datetime
@dataclass
class OrgContext:
slug: str
title: str
base_domain: str
url_scheme: str # "subdomain" or "path_prefix"
published_base_url: str # e.g. "https://lsst.io"
@dataclass
class ProjectContext:
slug: str
title: str
doc_repo: str # GitHub repo URL
published_url: str # root URL, e.g. "https://pipelines.lsst.io"
surrogate_key: str
date_created: datetime
@dataclass
class EditionsContext:
all: list[EditionContext]
main: EditionContext | None # the __main edition
releases: list[EditionContext] # kind=release, sorted semver descending
drafts: list[EditionContext] # kind=draft, sorted date_updated descending
major: list[EditionContext] # kind=major, sorted version descending
minor: list[EditionContext] # kind=minor, sorted version descending
alternates: list[EditionContext] # kind=alternate, sorted by title ascending
@dataclass
class EditionContext:
slug: str
title: str
kind: str
tracking_mode: str
tracking_params: dict # e.g. {"major_version": 2}
published_url: str
build: BuildContext | None
date_created: datetime
date_updated: datetime
lifecycle_exempt: bool
alternate_name: str | None # deployment/variant name, if scoped
@dataclass
class BuildContext:
id: str # Crockford Base32
git_ref: str
annotations: dict # client-provided metadata
uploader: str
object_count: int
total_size_bytes: int
date_created: datetime
date_uploaded: datetime
alternate_name: str | None # deployment/variant name, if any
@dataclass
class AssetsContext:
css: str # concatenated CSS
js: str # concatenated JS
images: dict[str, str] # filename_with_underscores → inlined content
@dataclass
class DocverseContext:
api_url: str
version: str
Pre-grouped editions#
The EditionsContext provides both the flat all list and pre-grouped convenience lists so template authors can iterate by category without writing filter logic. Each group is pre-sorted in the most natural order for display:
releases: semver descending (newest release first)drafts:date_updateddescending (most recently active first)major/minor: version descendingalternates:titleascending
Custom Jinja filters#
The rendering environment registers utility filters:
timesince: relative time display (e.g., “3 hours ago”, “2 days ago”)filesizeformat: human-readable byte sizes (e.g., “1.2 MB”)isoformat: ISO 8601 datetime formattingsemver_sort: sort a list of editions by semantic version
Template example#
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>{{ project.title }} — Documentation Editions</title>
<style>{{ assets.css }}</style>
</head>
<body>
<header>
{{ assets.images.logo_svg }}
<h1><a href="{{ project.published_url }}">{{ project.title }}</a></h1>
</header>
{% if editions.main and editions.main.build %}
<section class="current">
<h2>Current</h2>
<a href="{{ editions.main.published_url }}">
Latest ({{ editions.main.build.git_ref }})
</a>
<span class="meta">
Updated {{ editions.main.date_updated | timesince }}
· {{ editions.main.build.total_size_bytes | filesizeformat }}
</span>
</section>
{% endif %}
{% if editions.releases %}
<section class="releases">
<h2>Releases</h2>
{% for edition in editions.releases %}
<div class="edition-row">
<a href="{{ edition.published_url }}">{{ edition.slug }}</a>
<span class="ref">{{ edition.build.git_ref }}</span>
<span class="date">{{ edition.build.date_uploaded | isoformat }}</span>
</div>
{% endfor %}
</section>
{% endif %}
{% if editions.alternates %}
<section class="alternates">
<h2>Deployments</h2>
{% for edition in editions.alternates %}
<div class="edition-row">
<a href="{{ edition.published_url }}">{{ edition.title }}</a>
<span class="date">{{ edition.date_updated | timesince }}</span>
</div>
{% endfor %}
</section>
{% endif %}
{% if editions.drafts %}
<section class="drafts">
<details>
<summary>{{ editions.drafts | length }} draft(s)</summary>
{% for edition in editions.drafts %}
<div class="edition-row">
<a href="{{ edition.published_url }}">{{ edition.slug }}</a>
<span class="date">{{ edition.date_updated | timesince }}</span>
</div>
{% endfor %}
</details>
</section>
{% endif %}
<footer>
Generated by <a href="{{ docverse.api_url }}">Docverse {{ docverse.version }}</a>
at {{ rendered_at | isoformat }}
</footer>
<script>{{ assets.js }}</script>
</body>
</html>
Deployment-scoped draft editions carry an alternate_name field, which templates can use for filtering:
{# Drafts for a specific deployment #}
{% for edition in editions.drafts if edition.alternate_name == "usdf-dev" %}
<div class="edition-row">
<a href="{{ edition.published_url }}">{{ edition.slug }}</a>
<span class="badge">usdf-dev</span>
</div>
{% endfor %}
DashboardTemplate table#
Column |
Type |
Description |
|---|---|---|
|
int |
PK |
|
FK → Organization |
Owning org |
|
FK → Project (nullable) |
If set, this is a project-level override |
|
str |
GitHub org/user |
|
str |
Repository name |
|
str |
Path within repo (default |
|
str |
Branch/tag to track |
|
str (nullable) |
Object store prefix for current synced files |
|
str (nullable) |
Current sync version identifier (timestamp-based) |
|
datetime (nullable) |
Last successful sync |
Unique constraint on (org_id, project_id) — at most one template configuration per org (where project_id is null) and one per project. See DashboardTemplate in the database schema section for the column reference within the full schema.
Template sync#
When a GitHub push webhook fires on a tracked template repo/ref, Docverse enqueues a dashboard_sync job. The sync process:
Match webhook to templates: query the
DashboardTemplatetable for rows matching thegithub_owner/github_repoand where the push ref matches the trackedgit_ref. Check whether any changed files fall within the trackedpathprefix. If no rows match, the webhook is ignored.Fetch template files: use the GitHub Contents API (via the Safir GitHub App client or a Gafaelfawr-delegated token) to download the
template.tomland all referenced template/asset files from the matched directory at the pushed commit SHA.Write to object store: upload the fetched files to
__templates/{org_slug}/{sync_id}/in the org’s object store bucket, wheresync_idis a timestamp-based identifier (e.g.,20260208T120000). This creates a versioned snapshot alongside any previous syncs.Update database: update the
DashboardTemplaterow with the newstore_prefix,sync_id, anddate_synced.Re-render dashboards: for an org-level template, re-render dashboards for all projects in the org (parallelized via
asyncio.gather()). For a project-level template, re-render only that project’s dashboard. Each render reads the template files from the object store snapshot, assembles the Jinja context from the database, inlines assets, renders, and uploads the output HTML.Clean up previous sync: the previous sync directory in the object store is marked for purgatory cleanup after a retention period. This provides rollback capability — if a new template is broken, an operator can revert the
store_prefixin the database to the previoussync_idand re-render.
Docverse receives webhooks through two supported channels:
Direct GitHub App webhooks: Docverse acts as a GitHub App using the Safir GitHub App framework, receiving push events directly.
Kafka via Squarebot: Squarebot receives GitHub webhooks and publishes them to Kafka topics. Docverse consumes the relevant topics via FastStream. This allows sharing a GitHub App installation across multiple internal services.
Both channels feed into the same dashboard_sync job logic.
Rendered output storage#
The dashboard rendering pipeline produces several static files stored at well-known paths in the project’s object store area:
Dashboard:
{project_slug}/__dashboard.html404 page:
{project_slug}/__404.htmlVersion switcher:
{project_slug}/__switcher.jsonEdition metadata:
{project_slug}/__editions/{edition_slug}.json(one per edition)
The double-underscore prefix on these filenames prevents collisions with edition slugs (which also use __ prefix only for the reserved __main slug – but __main is an edition slug used in URL paths like /v/__main/, not a filename at the project root).
The Cloudflare Worker (or equivalent edge function) maps URL paths to these files:
project.lsst.io/v/orproject.lsst.io/v/index.html→ serves{project_slug}/__dashboard.htmlproject.lsst.io/v/switcher.json→ serves{project_slug}/__switcher.jsonproject.lsst.io/v/{edition}/_docverse.json→ serves{project_slug}/__editions/{edition_slug}.jsonAny request that resolves no object → serves
{project_slug}/__404.htmlwith a 404 status code
Version switcher and edition metadata JSON#
In addition to the dashboard HTML, Docverse generates static JSON files that client-side JavaScript in the published documentation can consume. These files are generated by the same rendering pipeline as the dashboard and are re-rendered on the same triggers.
Version switcher JSON (__switcher.json)#
The pydata-sphinx-theme version switcher loads a JSON file to populate its version dropdown. Docverse generates this file at {project_slug}/__switcher.json, served at project.lsst.io/v/switcher.json.
The file follows the pydata-sphinx-theme’s expected format – a JSON array of version entries:
[
{
"name": "Latest (main)",
"version": "__main",
"url": "https://pipelines.lsst.io/",
"preferred": true
},
{
"name": "v2.3",
"version": "v2.3",
"url": "https://pipelines.lsst.io/v/v2.3/"
},
{
"name": "v2.2",
"version": "v2.2",
"url": "https://pipelines.lsst.io/v/v2.2/"
}
]
Field mapping from Docverse’s edition model:
Switcher field |
Source |
Description |
|---|---|---|
|
edition title or formatted slug |
Display label in the dropdown |
|
edition slug |
Used by the theme for matching the current version |
|
edition |
Link target |
|
|
Marks the recommended/stable version |
By default, the switcher JSON includes the __main edition and all release, major, minor, and alternate editions, sorted with __main first, then alternates alphabetically, then by version descending. Draft editions are excluded by default to keep the dropdown focused on stable versions. alternate editions are included because they represent long-lived deployment targets that users need to navigate between. Both __main and alternate editions get preferred: true, since each represents a canonical view for its context.
This behavior is configurable via template.toml:
[switcher]
include_kinds = ["main", "release", "major", "alternate"] # default; add "draft" to include drafts
In conf.py, projects point the pydata-sphinx-theme at the Docverse-generated file:
html_theme_options = {
"switcher": {
"json_url": "https://pipelines.lsst.io/v/switcher.json",
"version_match": version, # matches against the "version" field
}
}
Per-edition metadata JSON (__editions/{slug}.json)#
Each edition gets a small metadata JSON file that client-side JavaScript in the published documentation can fetch to determine which edition is currently being viewed, whether it’s the canonical version, and where the canonical version lives. These files are stored at {project_slug}/__editions/{edition_slug}.json and served at project.lsst.io/v/{edition}/_docverse.json via a Worker URL rewrite.
This approach stores the per-edition files in a separate object store path (__editions/) rather than injecting them into the build’s immutable content. In pointer mode, the edition’s URL space serves content from the build’s object store prefix, so writing metadata into the build would violate immutability. The Worker recognizes requests for _docverse.json within an edition path and rewrites them to the corresponding __editions/ file. In copy mode, the file can optionally be copied into the edition prefix alongside the build content.
Example for a draft edition:
"project": {
"slug": "pipelines",
"title": "LSST Science Pipelines",
"published_url": "https://pipelines.lsst.io/"
},
"edition": {
"slug": "DM-12345",
"title": "DM-12345",
"kind": "draft",
"published_url": "https://pipelines.lsst.io/v/DM-12345/",
"tracking_mode": "git_ref",
"date_updated": "2026-02-08T12:00:00Z"
},
"canonical_url": "https://pipelines.lsst.io/",
"is_canonical": false,
"switcher_url": "https://pipelines.lsst.io/v/switcher.json",
"dashboard_url": "https://pipelines.lsst.io/v/"
}
Example for the __main (canonical) edition:
{
"project": {
"slug": "pipelines",
"title": "LSST Science Pipelines",
"published_url": "https://pipelines.lsst.io/"
},
"edition": {
"slug": "__main",
"title": "Latest",
"kind": "main",
"published_url": "https://pipelines.lsst.io/",
"tracking_mode": "git_ref",
"date_updated": "2026-02-10T08:30:00Z"
},
"canonical_url": "https://pipelines.lsst.io/",
"is_canonical": true,
"switcher_url": "https://pipelines.lsst.io/v/switcher.json",
"dashboard_url": "https://pipelines.lsst.io/v/"
}
The canonical_url always points to the __main edition’s published_url. Client-side JavaScript can use this metadata to:
Display “you’re viewing a draft” or “you’re viewing an older release” banners with a link to the canonical version.
Inject
<link rel="canonical" href="...">tags for SEO.Integrate with the version switcher to highlight the current edition.
Show the edition kind and last-updated date in the page footer.
The switcher_url and dashboard_url fields provide stable references to the project’s other Docverse-generated resources, so client JS doesn’t need to construct URLs.
Re-render triggers#
Dashboard re-rendering is triggered by several events, all funneled through the task queue:
Event |
Trigger mechanism |
Scope |
|---|---|---|
Template repo push |
GitHub webhook → |
All projects using that template |
Build processing completes |
Final step of |
Single project |
Edition created/deleted/updated |
Final step of |
Single project |
Project metadata changed |
Enqueued by PATCH handler |
Single project |
Manual re-render |
Admin API endpoint (if needed) |
Single project or all org projects |
For single-project re-renders triggered within other jobs (build processing, edition update), the render is performed inline as the final step — no separate job is enqueued. Only template syncs (which affect multiple projects) spawn their own dashboard_sync job.
Multiple triggers can race on the same project’s dashboard files (e.g., two build_processing jobs, or a build_processing and a dashboard_sync job running concurrently). Docverse uses Postgres advisory locks at the project and edition level to serialize these writes. See Cross-job serialization for the locking strategy.
Code architecture#
Docverse follows a hexagonal (ports-and-adapters) architecture where domain logic is isolated from infrastructure concerns. Storage backends are swappable per-organization, and services interact only with protocol interfaces — never with concrete implementations directly.
Layered architecture#
Docverse follows the layered architecture pattern established in Ook:
dbschema: SQLAlchemy models (database tables).
domain: domain models (dataclasses/Pydantic), business logic, protocol definitions. Storage-agnostic.
handlers: FastAPI route handlers, FastStream (Kafka) handlers. Thin — validate, resolve context, delegate to services.
services: orchestration layer. Services coordinate storage backends and domain logic. Called by handlers.
storage: interfaces to external data stores — databases, object stores, CDN APIs, DNS APIs, GitHub API.
CDN and object store abstractions live in the storage layer. Protocol classes defining their interfaces live in the domain layer.
Factory pattern#
Following the Ook factory pattern, Docverse uses a Factory class that is provided as a FastAPI dependency to handlers.
The factory combines process-level singletons (held in a ProcessContext) with a request-scoped database session to construct services and storage clients on demand.
For Docverse’s multi-tenant architecture, the factory additionally handles org-specific client construction. The flow:
A handler receives a request scoped to an organization (resolved from URL path or project slug).
The handler gets the
Factoryfrom FastAPI’s dependency injection.The factory loads the org’s configuration from the database.
factory.create_object_store(org)inspects the org’s provider config and returns the correct implementation (S3 client with that org’s bucket/credentials, or a GCS client, etc.).factory.create_cdn_client(org)does the same for the CDN provider.factory.create_edition_service(org)wires up the org-specific object store, CDN client, and database stores into the edition update service.
Org-specific clients are cached within a request using a dict keyed by org ID inside the factory instance, so multiple operations on the same org within a single request reuse the same clients.
Open design question — cross-request client pooling: Org-specific clients (and their underlying connection pools) could potentially be cached in ProcessContext across requests, lazily created and keyed by org ID. This would avoid per-request connection setup overhead for frequently-accessed orgs. Needs exploration around credential rotation, memory, and stale-client concerns.
Object store abstraction#
A protocol class in the domain layer defines the interface that all object store implementations must satisfy. Concrete implementations in the storage layer:
S3-compatible: uses
aiobotocore(async). Covers AWS S3, generic S3-compatible stores (MinIO, etc.), and Cloudflare R2 (which exposes an S3-compatible API).GCS: uses
gcloud-aio-storage(async). GCS’s API is distinct enough from S3 to warrant a separate implementation rather than forcing it through an S3-compatibility shim.
A factory method in the storage layer (called by the Factory) instantiates the correct implementation based on the org’s provider configuration.
Services only interact with the protocol — they are backend-agnostic.
Cloudflare R2 is accessed via the S3-compatible implementation. R2’s S3 API compatibility is comprehensive enough that a dedicated implementation is not required. The key difference is R2’s zero egress fees, which is a cost consideration rather than an API difference.
Object store interface#
The protocol defines these operations:
Individual operations: upload object, copy object, move object (to purgatory prefix), delete object, get object metadata, generate presigned URL (for client uploads).
Bulk operations (critical for performance): batch copy (for edition updates involving 500+ objects), batch move (for purgatory), batch delete (for hard-deleting purgatory contents). Bulk methods accept lists of objects and handle parallelism internally using asyncio semaphores. S3 implementations can additionally leverage S3-specific batch APIs where available.
Listing: list objects under a key prefix.
CDN abstraction#
The CDN abstraction follows the same pattern: protocol in domain, implementations in storage, factory-constructed per org.
The CDN interface exposes domain-meaningful operations rather than provider-specific primitives:
Purge edition: invalidate all cached content for a specific edition of a project.
Purge build: invalidate cached content for a build (used during build deletion).
Purge dashboard: invalidate the cached dashboard page for a project.
Update edition mapping: write a new edition→build mapping to the edge data store (pointer mode only; no-op for copy-mode CDNs).
Each CDN implementation translates these operations into provider-specific API calls:
Fastly: purge by surrogate key (a single API call invalidates all objects tagged with the edition’s surrogate key). Edition mappings via Fastly KV Store API.
Cloudflare Workers: purge by URL or embed build ID in cache keys so that mapping changes implicitly invalidate old entries. Edition mappings via Workers KV API.
Google Cloud CDN: purge by URL pattern. No edge data store (copy mode only).
The edition/build/product domain objects (or their identifiers) are passed to the purge methods. The implementation derives whatever provider-specific identifiers it needs (surrogate keys, URL prefixes, URL patterns, KV keys) from the domain objects and org configuration.
CDN implementations declare their capabilities via properties on the class.
The key capability is pointer mode support (supports_pointer_mode): CDNs with programmable edge compute and an edge data store can resolve the edition→build mapping at the edge, while CDNs without edge compute fall back to copy mode.
The edition update service checks supports_pointer_mode at runtime to select the pointer mode or copy mode code path.
Queue backend abstraction#
The queue backend follows the same protocol-based pattern as the object store and CDN abstractions. A protocol class in the domain layer defines the interface for enqueuing jobs and querying job metadata, while concrete implementations in the storage layer adapt specific queue technologies.
The initial implementation uses Arq via Safir’s ArqQueue with Redis as the message transport.
Safir provides both the production ArqQueue and a MockArqQueue for testing, which aligns with Docverse’s existing testing patterns.
The QueueJob Postgres table remains the authoritative state store — the queue backend is treated as a delivery mechanism only.
See the Queue backend abstraction section in the queue design for the full protocol definition, implementation details, and infrastructure notes.
Service layer#
Services are the orchestration layer that ties storage backends, domain logic, and database stores together. They are called by thin handlers that validate input, resolve context (e.g., which org the request targets), and delegate to the service.
For example, EditionService wires together the org-specific object store, CDN client, and database stores to coordinate edition updates.
When pointer mode is available, it writes a KV mapping and purges the CDN cache.
When copy mode is required, it computes a diff against the build inventory and performs the ordered in-place update.
Services are also invoked from background jobs processed by the Task queue design system — the same service logic applies whether the caller is an HTTP handler or a queue worker.
Beyond the server-side architecture, Docverse’s codebase includes a Python client library and CLI that share the repository with the server. This monorepo structure and the client-owned model pattern are integral to maintaining type safety across the API boundary.
Client-server monorepo#
The Python client and the Docverse server share a single Git repository, following the monorepo pattern established by Squarebot.
The server package lives at the repository root (matching the Ook and Safir conventions), which simplifies Docker builds and makes nox-uv’s @session decorator work naturally:
docverse/ # lsst-sqre/docverse
├── pyproject.toml # workspace root = server package
├── uv.lock # single lockfile for entire workspace
├── noxfile.py # shared dev tooling
├── ruff-shared.toml # shared Ruff configuration
├── Dockerfile
├── client/
│ ├── pyproject.toml # docverse-client (PyPI library)
│ └── src/
│ └── docverse/
│ └── client/
│ ├── __init__.py
│ ├── _client.py # DocverseClient
│ ├── _cli.py # CLI entry point
│ ├── _models.py # Pydantic request/response models
│ └── _tar.py # Tarball creation utility
├── src/
│ └── docverse/
│ ├── ...
│ └── models/ # imports and extends client models
└── tests/
Attribute |
Client |
Server |
|---|---|---|
PyPI name |
|
|
Import path |
|
|
Package style |
Namespace package ( |
Namespace package ( |
Location |
|
Repository root |
The monorepo is motivated by four concerns:
Atomic model changes: Pydantic models that define the API contract live in the client package. When an endpoint’s request or response shape changes, the model update and the server handler update land in the same commit, eliminating version skew.
Shared dev tooling: a single noxfile orchestrates linting, type-checking, and testing for both packages. The uv workspace handles editable installs of both packages automatically via
uv sync.Unidirectional dependency: the server depends on the client package (for the shared models); the client never imports from the server. This keeps the client lightweight and installable in CI without pulling in FastAPI, SQLAlchemy, or Redis.
Lessons from LTD: in the LTD era, the API server (
ltd-keeper) and upload client (ltd-conveyor) lived in separate repositories with independently-defined models. The monorepo eliminates model drift issues between the two.
The docverse-upload GitHub Action is intentionally not part of this monorepo.
The monorepo’s “atomic model changes” benefit applies to Python packages that share Pydantic models at import time; the TypeScript action consumes the API through a generated OpenAPI spec, not through Python imports, so co-location provides no additional type-safety benefit.
Keeping the action in its own repository also avoids Git tag ambiguity — the monorepo already uses prefix-scoped tags (client/v1.0.0) for the Python client, and adding v1-style tags for the action would conflict with any future need for repository-level semver tags.
The OpenAPI spec serves as the contract bridge between the two repositories (see OpenAPI-driven TypeScript development), and spec changes are explicitly visible in action-repo PR diffs, providing a clear review point for API evolution.
Dependency management#
The monorepo uses a uv workspace with a single uv.lock at the repository root that replaces the traditional server/requirements.txt pattern.
Workspace lockfile.
A single uv.lock locks all dependencies for both the server and client packages.
uv computes the workspace’s requires-python as the intersection of all members: if the client declares >=3.12 and the server declares >=3.13, the lockfile resolves at >=3.13.
The client’s pyproject.toml still declares >=3.12 for PyPI consumers — the lockfile constraint only affects the development environment.
Server (pyproject.toml at the repository root):
requires-python = ">=3.13"(single target version).Dependencies locked by
uv.lockfor reproducible Docker builds.Dependency groups for dev tooling:
dev(test dependencies),lint(pre-commit, ruff),typing(mypy and stubs),nox(nox, nox-uv).[tool.uv.sources]mapsdocverse-clientto the workspace member, souv syncinstalls the local client in editable mode.
Client (client/pyproject.toml):
requires-python = ">=3.12"(broad range for library consumers).Dependencies use broad version ranges appropriate for a PyPI library.
Dependency groups mirror the server’s pattern:
dev(pytest, respx, and other test dependencies). Unlocked nox sessions read these groups vianox.project.dependency_groups()so the noxfile never duplicates dependency lists.
Testing matrix#
The nox sessions use two distinct mechanisms to control dependency resolution:
Session |
Mechanism |
Resolution |
Python |
Purpose |
|---|---|---|---|---|
|
|
Locked ( |
3.13 |
Server tests — same deps as Docker |
|
|
Locked ( |
3.13 |
Client tests with lockfile deps |
|
|
Highest (unlocked) |
3.12, 3.13 |
Client as PyPI users install it |
|
|
|
3.12 |
Validates client’s lower bounds |
|
|
Locked ( |
3.13 |
Pre-commit hooks |
|
|
Locked ( |
3.13 |
mypy on both packages |
The key distinction: locked sessions use nox_uv.session (which calls uv sync under the hood); compatibility sessions use standard nox.session with session.install() (which calls uv pip install, bypassing the workspace lockfile entirely).
This separation lets the server pin exact versions for reproducibility while the client is tested against the same range of environments its PyPI users will encounter.
import nox
CLIENT_PYPROJECT = nox.project.load_toml("client/pyproject.toml")
@nox.session(python=["3.12", "3.13"])
def client_test_compat(session: nox.Session) -> None:
"""Test the client with unlocked highest dependencies."""
session.install(
"./client",
*nox.project.dependency_groups(CLIENT_PYPROJECT, "dev"),
)
session.run("pytest", "client/tests/")
Docker build#
The Docker build uses a two-stage pattern that separates dependency installation (layer-cached) from application code:
# Install locked dependencies (cached unless uv.lock changes)
COPY pyproject.toml uv.lock ./
COPY client/pyproject.toml client/pyproject.toml
RUN uv sync --frozen --no-default-groups --no-install-workspace
# Install workspace members without re-resolving
COPY client/ client/
COPY src/ src/
RUN uv pip install --no-deps ./client .
uv sync --frozen --no-default-groups --no-install-workspace installs only the locked production dependencies without development groups or the workspace packages themselves.
Both pyproject.toml files must be present for uv to validate the workspace structure.
The subsequent uv pip install --no-deps installs the actual workspace packages without triggering a new resolution.
Release workflow#
Client (
docversemonorepo): released to PyPI onclient/v*tags (e.g.,client/v1.2.0). The GitHub Actions publish workflow runsuv build --no-sources --package docverse-clientand uploads to PyPI. The--no-sourcesflag disables workspace source overrides so the built distribution references PyPI package names, not local paths.Server (
docversemonorepo): released as a Docker image on bare semver tags (e.g.,1.2.0), following the SQuaRE convention of omitting thevprefix. The GitHub Actions workflow builds and pushes the Docker image. Phalanx Helm charts reference the tagged image version.
Version metadata#
Both packages use setuptools_scm to derive their version from Git tags.
Because two independent tag namespaces coexist in one repository, each package scopes its tag matching with both tag_regex (for parsing) and a custom describe_command with --match (for discovery).
Both mechanisms must be used together — tag_regex alone is not sufficient because git describe would still pick up the wrong tag.
Server (pyproject.toml at repository root):
[project]
dynamic = ["version"]
[tool.setuptools_scm]
fallback_version = "0.0.0"
tag_regex = '(?P<version>\d+(?:\.\d+)*)$'
[tool.setuptools_scm.scm.git]
describe_command = [
"git", "describe", "--dirty", "--tags", "--long",
"--abbrev=40", "--match", "[0-9]*",
]
--match "[0-9]*"ensuresgit describeonly considers bare semver tags, excludingclient/v*tags.tag_regexmatches bare1.2.0without avprefix.fallback_versionprovides a version before the first tag exists.
Client (client/pyproject.toml):
[project]
dynamic = ["version"]
[tool.setuptools_scm]
root = ".."
fallback_version = "0.0.0"
tag_regex = '^client/(?P<version>[vV]?\d+(?:\.\d+)*)$'
[tool.setuptools_scm.scm.git]
describe_command = [
"git", "describe", "--dirty", "--tags", "--long",
"--abbrev=40", "--match", "client/v*",
]
root = ".."points setuptools_scm to the repository root, sinceclient/is a subdirectory.--match "client/v*"ensuresgit describeonly considers client-prefixed tags.tag_regexstrips theclient/prefix when parsing the version.The slash separator in the tag prefix (
client/v1.2.0) avoids the dash-parsing ambiguity that setuptools_scm historically had with dashed prefixes (e.g.,client-v1.2.0could be misinterpreted as a version component separator).
Changelog management#
Both packages use scriv (>= 1.8.0) with separate configuration files and fragment directories, so each package can collect changelog entries independently at its own release cadence.
Layout:
docverse/
├── scriv-client.ini # scriv config for client
├── scriv-server.ini # scriv config for server
├── client/
│ ├── changelog.d/ # client fragments
│ └── CHANGELOG.md # client changelog
├── server-changelog.d/ # server fragments (root level)
└── CHANGELOG.md # server changelog
The scriv configuration files use .ini format because scriv’s --config flag only accepts .ini files, not TOML.
This means the configuration cannot live in the respective pyproject.toml files as [tool.scriv] sections.
[scriv]
fragment_directory = server-changelog.d
changelog = CHANGELOG.md
format = md
[scriv]
fragment_directory = client/changelog.d
changelog = client/CHANGELOG.md
format = md
Usage, wrapped in nox targets for convenience:
$ scriv create --config scriv-client.ini # new client fragment
$ scriv create --config scriv-server.ini # new server fragment
$ scriv collect --config scriv-client.ini --version 1.2.0 # collect at release
$ scriv collect --config scriv-server.ini --version 1.2.0 # collect at release
Client-owned Pydantic models#
The client package owns every Pydantic model that appears in API request and response payloads. The server imports these models and can subclass them to add server-side concerns (database constructors, internal validators), but the wire format is always defined in the client.
from pydantic import BaseModel, HttpUrl
class BuildCreate(BaseModel):
"""Request body for POST /orgs/:org/projects/:project/builds."""
git_ref: str
content_hash: str
alternate_name: str | None = None
annotations: dict[str, str] | None = None
class BuildResponse(BaseModel):
"""Response from build endpoints."""
self_url: HttpUrl
project_url: HttpUrl
id: str
status: str
git_ref: str
upload_url: HttpUrl | None = None
queue_url: HttpUrl | None = None
date_created: str
from docverse.client._models import BuildCreate as ClientBuildCreate
from ..db import Build as BuildRow
class BuildCreate(ClientBuildCreate):
"""Server-side build creation with DB integration."""
async def to_db_row(self, project_id: int) -> BuildRow:
return BuildRow(
project_id=project_id,
git_ref=self.git_ref,
content_hash=self.content_hash,
alternate_name=self.alternate_name,
)
This pattern provides:
Single source of truth: one model definition governs both serialization (client) and deserialization (server).
Automatic OpenAPI alignment: FastAPI generates the OpenAPI spec from these Pydantic models. The TypeScript types for the GitHub Action are generated from that same spec (see OpenAPI-driven TypeScript development), creating a chain of consistency from Python to TypeScript.
Safe refactoring: renaming a field or adding a required attribute is a type error in both packages immediately, caught by mypy before CI even runs tests.
Python client library#
DocverseClient is an async HTTP client built on httpx that implements the full upload workflow.
It follows HATEOAS navigation — the client discovers endpoint URLs from the API root and resource responses rather than hardcoding paths.
Key methods#
Method |
Description |
|---|---|
|
|
|
|
|
|
|
|
|
Poll the queue job with exponential backoff until terminal state |
Upload workflow#
from docverse.client import DocverseClient, create_tarball
async with DocverseClient(
base_url="https://docverse.lsst.io",
token=os.environ["DOCVERSE_TOKEN"],
) as client:
# 1. Create a tarball from the built docs directory
tarball_path, content_hash = create_tarball("_build/html")
# 2. Create a build record (returns presigned upload URL)
build = await client.create_build(
org="rubin",
project="pipelines",
git_ref="tickets/DM-12345",
content_hash=content_hash,
)
# 3. Upload the tarball to the presigned URL
await client.upload_tarball(
upload_url=build.upload_url,
tarball_path=tarball_path,
)
# 4. Signal upload complete (enqueues processing)
job_url = await client.complete_upload(build.self_url)
# 5. Wait for processing to finish
job = await client.wait_for_job(job_url)
print(f"Build {build.id} processed: {job.status}")
Authentication#
The client authenticates via a Gafaelfawr token passed either as the token constructor argument or through the DOCVERSE_TOKEN environment variable.
The token is sent as a bearer token in the Authorization header on every request.
For CI pipelines, this token is typically a bot token stored as a CI secret (see Authentication and authorization for how bot tokens are provisioned).
Tarball creation#
The create_tarball() utility function creates a .tar.gz archive from a directory and computes its SHA-256 content hash:
def create_tarball(source_dir: str | Path) -> tuple[Path, str]:
"""Create a .tar.gz of source_dir; return (path, 'sha256:...' hash)."""
The tarball is written to a temporary file. The content hash is computed during archive creation (streaming through a hashlib.sha256 digest) so the file is read only once.
Job polling#
wait_for_job() polls the queue job endpoint with exponential backoff (initial interval 1 s, max interval 15 s, jitter).
It returns when the job reaches a terminal state (completed or failed).
On failure, it raises BuildProcessingError with the failure details from the job response.
Command-line interface#
The docverse upload CLI command wraps the Python client library for use in shell scripts and CI pipelines where the GitHub Action is not applicable (e.g., Jenkins, GitLab CI, local builds).
Usage#
$ docverse upload \
--org rubin \
--project pipelines \
--git-ref tickets/DM-12345 \
--dir _build/html
Options#
Flag |
Env var |
Default |
Description |
|---|---|---|---|
|
|
— |
Organization slug |
|
|
— |
Project slug |
|
|
current Git HEAD |
Git ref for the build |
|
|
— |
Path to the built documentation directory |
|
|
— |
Gafaelfawr authentication token |
|
|
|
Docverse API base URL |
|
|
— |
Alternate name for scoped editions |
|
wait enabled |
Return immediately after signaling upload |
Exit codes#
Code |
Meaning |
|---|---|
0 |
Build uploaded and processed successfully |
1 |
Failure (authentication error, upload error, job failed) |
2 |
Upload succeeded but job had warnings (partial success) |
When --no-wait is used, exit code 0 means the upload was accepted and processing was enqueued — the CLI does not wait for the job to finish.
Task queue design#
Design philosophy#
Docverse interacts with the job queue through a backend-agnostic abstraction layer (see Queue backend abstraction). The initial implementation uses Arq via Safir’s ArqQueue with Redis as the message transport. The queue backend handles delivery, retries, and worker dispatch. All orchestration and parallelism within a job is handled by Docverse’s service layer using standard Python asyncio. This minimizes coupling to any specific queue technology and keeps the business logic testable with plain async functions.
Each user-facing operation that triggers background work results in a single background job. The job’s worker function calls through the service layer, which coordinates the steps internally. Where steps are independent, the service layer uses asyncio.gather() to parallelize them.
QueueJob table#
Docverse maintains its own QueueJob table in Postgres as the single source of truth for job state and progress. This table serves the user-facing queue API, operator dashboards, and internal coordination (e.g., detecting conflicting concurrent edition updates). The queue backend’s internal state is not queried directly for status — Docverse treats the backend as a delivery mechanism only. See QueueJob in the database schema section for the column reference within the full schema.
Column |
Type |
Description |
|---|---|---|
|
int |
Internal PK |
|
int |
Crockford Base32 serialized in API |
|
str (nullable) |
Reference to the queue backend’s job ID (e.g., Arq UUID) |
|
enum |
|
|
enum |
|
|
str (nullable) |
Current phase: |
|
FK → Organization |
Scoped to org (for operator filtering) |
|
FK → Project (nullable) |
Set for build/edition jobs |
|
FK → Build (nullable) |
Set for build processing jobs |
|
JSONB (nullable) |
Structured progress data, phase-specific |
|
JSONB (nullable) |
Collected error details |
|
datetime |
When enqueued |
|
datetime (nullable) |
When a worker picked it up |
|
datetime (nullable) |
When finished |
Queue backend abstraction#
The queue backend is accessed through a protocol interface, following the same hexagonal architecture pattern as the object store and CDN abstractions. This keeps the service layer decoupled from any specific queue technology and allows backend swaps without disrupting application logic.
Protocol definition#
from typing import Protocol
class QueueBackend(Protocol):
"""Protocol for queue backend implementations."""
async def enqueue(
self,
job_type: str,
payload: dict,
*,
queue_name: str = "default",
) -> str | None:
"""Enqueue a job for background processing.
Returns the backend-assigned job ID (str), or None if the
backend does not assign IDs synchronously.
"""
...
async def get_job_metadata(
self, backend_job_id: str
) -> dict | None:
"""Retrieve metadata about a job from the backend.
Returns backend-specific metadata (e.g., status, result),
or None if the job is not found. Used for diagnostics only —
the QueueJob table is the authoritative state store.
"""
...
async def get_job_result(
self, backend_job_id: str
) -> object | None:
"""Retrieve the result of a completed job.
Returns the job result, or None if not available.
"""
...
Implementations#
ArqQueueBackend wraps Safir’s ArqQueue for production use. Arq uses UUID strings as job IDs, which are stored in the backend_job_id column of the QueueJob table. The worker functions are standard async functions that receive the job payload and call through the service layer.
MockQueueBackend wraps Safir’s MockArqQueue for testing. Jobs are executed in-process, making tests deterministic without requiring a running Redis instance.
Both implementations are constructed by the factory and injected into services, consistent with Docverse’s dependency injection pattern.
Infrastructure#
Arq requires a Redis instance as its message broker. In Phalanx deployments, Redis is a standard in-cluster service. The QueueJob Postgres table remains the authoritative state store — Redis holds only transient message data. If Redis state is lost, in-flight jobs can be re-enqueued from QueueJob records with status = 'queued'.
Progress tracking#
The service layer writes progress to the QueueJob table at each phase transition and within phases where granular tracking is useful. These are lightweight single-row UPDATEs.
Phase transitions#
At each major phase boundary, the service updates the phase column and resets/initializes the progress JSONB:
await queue_job_store.update_phase(job_id, "inventory")
await inventory_service.catalog_build(build)
await queue_job_store.update_phase(job_id, "tracking")
affected_editions = await tracking_service.evaluate(build)
await queue_job_store.start_editions_phase(job_id, affected_editions)
results = await asyncio.gather(
*[self._update_single_edition(e, build, job_id)
for e in affected_editions],
return_exceptions=True,
)
await queue_job_store.update_phase(job_id, "dashboard")
await dashboard_service.render(project)
await queue_job_store.complete(job_id)
Live edition progress via conditional JSONB merge#
During the editions phase, multiple edition update coroutines run concurrently via asyncio.gather(). Each coroutine updates the progress JSONB when it completes, using Postgres jsonb_set to atomically move its edition slug between the editions_in_progress and editions_completed (as a structured object including the edition’s published_url), editions_skipped, or editions_failed arrays:
-- Mark edition completed
UPDATE queue_job
SET progress = jsonb_set(
jsonb_set(
progress,
'{editions_completed}',
(progress->'editions_completed') || :completed_entry::jsonb
),
'{editions_in_progress}',
(progress->'editions_in_progress') - :edition_slug
)
WHERE id = :job_id
Where completed_entry is {"slug": "__main", "published_url": "https://pipelines.lsst.io/"}.
For failures, the slug is moved to editions_failed as a structured object with error context:
-- Mark edition failed
UPDATE queue_job
SET progress = jsonb_set(
jsonb_set(
progress,
'{editions_failed}',
(progress->'editions_failed') || :failed_entry::jsonb
),
'{editions_in_progress}',
(progress->'editions_in_progress') - :edition_slug
)
WHERE id = :job_id
Where failed_entry is {"slug": "DM-12345", "error": "R2 timeout after 3 retries"}.
For skipped editions (superseded by a newer build; see Cross-job serialization), the slug is moved to editions_skipped as a structured object with the reason:
-- Mark edition skipped
UPDATE queue_job
SET progress = jsonb_set(
jsonb_set(
progress,
'{editions_skipped}',
(progress->'editions_skipped') || :skipped_entry::jsonb
),
'{editions_in_progress}',
(progress->'editions_in_progress') - :edition_slug
)
WHERE id = :job_id
Where skipped_entry is {"slug": "v2.x", "reason": "superseded by build 01HQ-3KBR-T5GN-8W"}.
Postgres serializes the row locks, but since these are sub-millisecond metadata writes against a single row, contention is negligible compared to the actual edition update work (KV writes, cache purges, or object copies).
The service layer wraps each edition update coroutine:
async def _update_single_edition(self, edition, build, job_id):
try:
skipped = await self._edition_service.update(edition, build)
if skipped:
await self._queue_store.mark_edition_skipped(
job_id, edition.slug, reason="superseded"
)
else:
await self._queue_store.mark_edition_completed(
job_id, edition.slug, edition.published_url
)
except Exception as e:
await self._queue_store.mark_edition_failed(
job_id, edition.slug, str(e)
)
raise
Progress JSONB structure#
The progress JSONB is phase-specific. During the editions phase:
{
"editions_total": 3,
"editions_completed": [
{ "slug": "__main", "published_url": "https://pipelines.lsst.io/" }
],
"editions_skipped": [
{ "slug": "v2.x", "reason": "superseded by build 01HQ-3KBR-T5GN-8W" }
],
"editions_failed": [
{ "slug": "DM-12345", "error": "R2 timeout after 3 retries" }
],
"editions_in_progress": []
}
The editions_skipped and editions_failed arrays already used structured objects with contextual fields (reason and error, respectively). Promoting editions_completed to a structured object with published_url makes the shape consistent across all terminal-state arrays and enables clients — particularly the GitHub Action’s PR comment feature (Pull request comments) — to discover published URLs directly from job progress without additional API calls. The editions_in_progress array remains a plain string array since in-progress editions have no published URL yet.
For other phases, progress can carry simpler metadata (e.g., {"message": "Cataloging 1,247 objects"} during inventory).
Cross-job serialization#
Several background jobs can race on the same project’s resources. Two rapid build uploads from the same branch can produce two build_processing jobs that both try to update the same edition concurrently. A build_processing job and a dashboard_sync job can both try to write the same project’s dashboard files at the same time. An edition_update job and a build_processing job can both write the same edition’s metadata JSON. Since asyncio.gather() parallelizes edition updates within a job, and multiple workers can process different jobs simultaneously, these concurrent mutations can lead to interleaved KV writes, partial cache purges, torn dashboard HTML, or inconsistent metadata JSON.
Docverse prevents this with Postgres advisory locks at two granularities — per-edition and per-project — combined with a stale build guard for edition updates.
Lock namespacing#
Advisory locks use the two-argument form pg_advisory_lock(classid, objid) to namespace by resource type, avoiding key collisions between edition and project PKs (which come from different sequences):
pg_advisory_lock(1, edition.id)— edition-level lock, serializes edition content updates and per-edition metadata JSON writes.pg_advisory_lock(2, project.id)— project-level lock, serializes project-wide dashboard renders (__dashboard.html,__404.html,__switcher.json).
Both services acquire locks through a shared advisory_lock async context manager that wraps the acquire/release pair, making the lock scope visually explicit and eliminating repeated try/finally boilerplate:
@asynccontextmanager
async def advisory_lock(session, classid, objid):
"""Acquire a Postgres advisory lock for the duration of the block."""
await session.execute(
text("SELECT pg_advisory_lock(:classid, :objid)"),
{"classid": classid, "objid": objid},
)
try:
yield
finally:
await session.execute(
text("SELECT pg_advisory_unlock(:classid, :objid)"),
{"classid": classid, "objid": objid},
)
Advisory lock acquisition#
Before performing any mutation, EditionService.update() uses the advisory_lock context manager to hold an advisory lock keyed on the edition’s primary key for the duration of the update:
async def update(self, edition, build) -> bool:
"""Update edition to point to build. Returns True if skipped."""
async with advisory_lock(self._session, 1, edition.id):
current_build = await self._get_current_build(edition)
if current_build and current_build.date_created > build.date_created:
return True # Skipped — edition already has a newer build
# ... perform KV write / copy-mode update ...
# ... log to EditionBuildHistory ...
# ... write __editions/{slug}.json metadata ...
return False
The underlying pg_advisory_lock() call blocks (rather than failing) if another session holds the lock for the same key. This means a competing job simply waits its turn — no job is rejected or fails due to contention.
Stale build guard#
After acquiring the lock, the method compares the candidate build’s date_created against the edition’s current build. If the edition already points to a newer build (because a more recent job acquired the lock first), the update is skipped. The caller logs the skip in the job’s progress JSONB via mark_edition_skipped, and the edition slug appears in the editions_skipped array rather than editions_completed.
This guarantees the edition never regresses to an older build, regardless of the order in which competing jobs acquire the lock.
Project-level lock for dashboard renders#
DashboardService.render(project) acquires a project-level advisory lock before writing the project-wide dashboard files. After releasing the project lock, it acquires each edition’s lock in turn to write per-edition metadata JSON, serializing against any concurrent EditionService.update() that writes the same file.
async def render(self, project):
"""Render all dashboard outputs for a project."""
# Project-wide files under project lock
async with advisory_lock(self._session, 2, project.id):
await self._write_dashboard_html(project)
await self._write_404_html(project)
await self._write_switcher_json(project)
# Per-edition metadata under individual edition locks
for edition in await self._get_editions(project):
async with advisory_lock(self._session, 1, edition.id):
await self._write_edition_metadata_json(edition)
No stale guard is needed for dashboard renders. The dashboard output is deterministic from the current database state, so the last render to complete always produces the correct output. The per-edition metadata writes can be parallelized across editions (different lock keys), but each individual write serializes against any concurrent EditionService.update() for the same edition.
Why this works#
No concurrent mutation: The edition-level advisory lock serializes all updates to a given edition, whether from
build_processingoredition_updatejobs. The project-level lock serializes all dashboard renders for a project, whether from build processing, edition updates, template syncs, or manual re-renders.No failures:
pg_advisory_lock()blocks until the lock is available — the job waits rather than failing.Correct final state: If Build B (newer) is processed before Build A (older) due to lock acquisition order, Build A’s update is skipped by the stale guard. The edition always reflects the most recent build. Dashboard renders are deterministic from database state, so the last render to complete is always correct.
Compatible with
asyncio.gather(): Each edition’s lock is independent, so parallel updates of different editions within the same job proceed without contention. Only updates to the same edition across jobs serialize. Similarly,dashboard_syncjobs that re-render multiple projects in parallel acquire independent project-level locks.Covers all code paths: Placing the edition lock inside
EditionService.update()covers both thebuild_processingparallel edition phase and theedition_updatemanual reassignment path. Placing the project lock insideDashboardService.render()covers all dashboard render triggers. Per-edition metadata JSON is protected in both locations — insideEditionService.update()and duringDashboardService.render()’s per-edition loop.
Connection impact#
The advisory lock holds a database session open for the duration of the locked operation. For edition updates in pointer mode (~2 seconds for KV write + cache purge) this is negligible. For copy mode (longer due to object copies), the session is held longer but this is acceptable given expected concurrency levels — at most a few concurrent builds per project. The project-level dashboard lock is held only for the duration of writing the three project-wide files (HTML + JSON), which is sub-second — significantly shorter than edition content updates.
Operator queries#
The QueueJob table provides a single place for operators to understand system state across all workers:
Backlog depth:
SELECT count(*), kind FROM queue_job WHERE status = 'queued' GROUP BY kindActive work:
SELECT * FROM queue_job WHERE status = 'in_progress'— shows what every worker is doing, which phase each job is in, and per-edition progressEdition update activity:
SELECT * FROM queue_job WHERE status = 'in_progress' AND project_id = :pid AND phase = 'editions'— shows concurrent edition work for a project. Advisory locks (see Cross-job serialization) handle serialization automatically; this query is for observabilityError rates:
SELECT count(*) FROM queue_job WHERE status IN ('failed', 'completed_with_errors') AND date_completed > now() - interval '1 hour'Per-org throughput:
SELECT org_id, count(*) FROM queue_job WHERE status = 'completed' AND date_completed > now() - interval '1 hour' GROUP BY org_idSlow jobs:
SELECT * FROM queue_job WHERE status = 'in_progress' AND date_started < now() - interval '10 minutes'
Job types#
Build processing (build_processing)#
Triggered when a client signals upload complete (PATCH .../builds/:build with status: uploaded). This is the primary job type.
The service layer executes the following steps inside the single background job:
Inventory (sequential) — catalog the build’s objects from the object store into the
BuildObjecttable in Postgres (key, content hash, content type, size). This is a listing + metadata operation against the object store.Evaluate tracking rules (sequential) — determine which editions should update based on the build’s git ref, the project’s edition tracking modes, and the org’s rewrite rules. Auto-create new editions if needed (e.g., new semver major stream, new git ref). Returns a list of affected editions.
Update editions (parallel via
asyncio.gather()) — for each affected edition, update the edition to point to the new build. In pointer mode this writes a new KV mapping and purges the CDN cache; in copy mode this performs the ordered diff-copy-purge sequence. Each edition update also logs the transition to theEditionBuildHistorytable. If one edition update fails, the others continue to completion; failures are collected and reported via theQueueJobprogress JSONB.Render project dashboard (sequential, runs once after all edition updates complete) — re-render the project’s dashboard and 404 pages using the current edition metadata from the database and the resolved template. A single build may update multiple editions, but the dashboard reflects the project’s full edition list and only needs to be rendered once.
Update job status (sequential) — mark the
QueueJobascompleted,completed_with_errors(if some editions failed), orfailed. Also update the build’s status accordingly.
Edition reassignment (edition_update)#
Triggered when an admin PATCHes an edition with a new build field (manual reassignment or rollback). Simpler than build processing — a single background job that:
Updates the edition to point to the specified build (pointer mode KV write or copy mode diff-copy).
Logs the transition to
EditionBuildHistory.Renders the project dashboard.
Dashboard template sync (dashboard_sync)#
Triggered by a GitHub webhook when a tracked dashboard template repository is updated. A single background job that syncs the template files from GitHub to the object store, then re-renders dashboards for all affected projects (all projects in the org for an org-level template, or a single project for a project-level override), using asyncio.gather() to parallelize across projects. See the Dashboard templating system section for the full sync flow.
Lifecycle evaluation (lifecycle_eval)#
Scheduled periodically (see Periodic job scheduling). A single background job that scans all orgs and projects for editions and builds matching lifecycle rules (stale drafts, orphan builds). Soft-deletes matching resources and moves object store content to purgatory. Uses asyncio.gather() to parallelize across orgs.
Git ref audit (git_ref_audit)#
Scheduled periodically (see Periodic job scheduling). A single background job that verifies git refs tracked by editions still exist on their GitHub repositories. Flags or soft-deletes editions whose refs have been deleted (if the ref_deleted lifecycle rule is enabled). Catches cases where GitHub webhook delivery for ref deletion events was missed.
Purgatory cleanup (purgatory_cleanup)#
Scheduled periodically (see Periodic job scheduling). A single background job that hard-deletes object store objects that have been in purgatory longer than the org’s configured retention period. Simple listing + batch delete per org.
Credential re-encryption (credential_reencrypt)#
Scheduled periodically (see Periodic job scheduling). A single background job that iterates over all organization_credentials rows and calls CredentialEncryptor.rotate() on each encrypted_credential value. This re-encrypts every token under the current primary Fernet key. Unlike Vault’s vault:vN: prefix, Fernet tokens don’t indicate which key encrypted them, so the job processes all rows unconditionally — MultiFernet.rotate() is idempotent. This ensures that after a key rotation, all stored credentials are migrated to the new key without ever exposing plaintext. Parallelized across orgs via asyncio.gather().
Periodic job scheduling#
Periodic jobs are scheduled using Kubernetes CronJobs rather than the queue backend’s built-in scheduling features (e.g., Arq’s cron). This keeps scheduling decoupled from the queue backend, enabling backend swaps without changing scheduling infrastructure. Kubernetes CronJobs are well-understood, observable, and already used throughout Phalanx.
Each periodic job type gets a CronJob that runs a thin CLI command to create a QueueJob record and enqueue it via the queue backend:
apiVersion: batch/v1
kind: CronJob
metadata:
name: docverse-lifecycle-eval
spec:
schedule: "0 3 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: docverse-enqueue
image: ghcr.io/lsst-sqre/docverse:latest
command: ["docverse", "enqueue", "lifecycle_eval"]
restartPolicy: OnFailure
The docverse enqueue CLI command connects to the database and Redis, creates a QueueJob record with status: queued, enqueues the job via the queue backend, and exits. The actual work is performed by the Docverse worker process.
Schedule table#
Job type |
Default schedule |
Description |
|---|---|---|
|
Daily at 03:00 UTC |
Evaluate edition and build lifecycle rules |
|
Daily at 04:00 UTC |
Verify git refs tracked by editions |
|
Daily at 05:00 UTC |
Hard-delete expired purgatory objects |
|
Weekly (Sunday 02:00) |
Re-encrypt credentials under current primary Fernet key |
Schedules are configurable per-deployment via Phalanx Helm values. Operators can adjust frequencies, add maintenance windows, or disable specific jobs without code changes.
Failure and retry#
The queue backend handles job-level retries. With Arq, retry behavior is configured per job type via the worker’s job definitions. The retry policy varies by job type:
build_processing and edition_update: retry with backoff, up to 3 attempts. Jobs are idempotent at each step — inventory upserts, tracking evaluation is deterministic, edition updates use diffs (already-updated editions show no changes on re-run). On retry, the job re-runs from the beginning but completed steps are effectively no-ops. The
QueueJobprogress is reset at the start of each attempt.Periodic jobs (lifecycle_eval, git_ref_audit, purgatory_cleanup): retry once, then wait for the next scheduled run. These are self-correcting — anything missed on one run will be caught on the next.
Within a build processing job, individual edition update failures (in the asyncio.gather()) do not fail the entire job. The service layer uses return_exceptions=True, collects results, and marks the job as completed_with_errors if some editions failed while others succeeded. Failed editions are recorded in the progress JSONB with error messages for diagnosis. A subsequent retry or manual edition PATCH can address individual failures.
Job retention#
Completed and failed QueueJob records are retained for a configurable period (default: 7 days) before being cleaned up by the purgatory cleanup job. The queue API returns 404 for expired jobs.
REST API design#
Conventions#
The Docverse REST API follows SQuaRE’s REST API design conventions, and specifically follows patterns established in Ook and Times Square:
No URL versioning: the API has a single set of paths (no
/v1/prefix). Breaking changes would be handled by introducing new endpoints alongside deprecated ones. A lot of design work would have to go into supporting multiple API versions simultaneously, so assuming path-based API versioning seems presumptive at this stage.HATEOAS-style navigation: all resource representations include
self_urland relevant navigation URLs (e.g.,project_url,org_url,builds_url,editions_url). Clients navigate the API via these provided URLs rather than constructing their own.Collections are top-level arrays: collection endpoints return a JSON array of resource objects. Pagination metadata is carried in response headers, not a wrapper object.
Keyset pagination: all collection endpoints use keyset pagination via Safir’s pagination library. Pagination cursors and links are returned in headers.
Errors: all error responses follow
safir.models.ErrorModel— adetailarray of{type, msg, loc?}objects, compatible with FastAPI’s built-in validation error format.Background job URLs: when a POST or PATCH enqueues background work (e.g., build processing, edition updates), the response includes a
queue_urlpointing to the job status resource.
Resource identifiers#
Resources use human-readable identifiers in URLs rather than auto-generated database primary keys:
Resource |
URL identifier |
Format |
Example |
|---|---|---|---|
Organization |
slug |
lowercase alphanumeric + hyphens |
|
Project |
slug |
URL-safe, matches doc handle |
|
Edition |
slug |
URL-safe, derived from git ref via rewrite rules |
|
Build |
Crockford Base32 ID |
12 chars + checksum, hyphenated |
|
Queue job |
Crockford Base32 ID |
12 chars + checksum, hyphenated |
|
Org membership |
composite key |
|
|
Build and queue job IDs use the Crockford Base32 implementation from Ook, backed by base32-lib. IDs are stored as integers in Postgres but serialized as base32 strings with checksums in the API via Pydantic. Build IDs are randomly generated (not ordered). Queue job IDs are Docverse-owned public identifiers that map internally to the queue backend’s job IDs via the backend_job_id column in the QueueJob table.
Endpoint catalog#
Root#
GET / → API metadata and navigation URLs
Returns API version, available org URLs, and links to documentation. No authentication required beyond the base exec:docverse scope.
Superadmin — organization management#
These endpoints are separated under /admin/ to enable the admin:docverse Gafaelfawr scope at the ingress level.
POST /admin/orgs → create organization
DELETE /admin/orgs/:org → delete organization
Organizations#
GET /orgs → list organizations (reader+)
GET /orgs/:org → get organization (reader+)
PATCH /orgs/:org → update org settings (admin)
The GET /orgs/:org response includes navigation URLs and the slug rewrite rules:
{
"self_url": "https://docverse.../orgs/rubin",
"projects_url": "https://docverse.../orgs/rubin/projects",
"members_url": "https://docverse.../orgs/rubin/members",
"slug": "rubin",
"title": "Rubin Observatory",
"slug_rewrite_rules": [
{ "type": "ignore", "glob": "dependabot/**" },
{ "type": "prefix_strip", "prefix": "tickets/", "edition_kind": "draft" }
]
}
Slug preview#
POST /orgs/:org/slug-preview → preview slug resolution (admin)
Tests the org’s (or a specific project’s) edition slug rewrite rules against a git ref without creating any resources. See the Edition slug rewrite rules section for request/response details.
Org membership#
GET /orgs/:org/members → list memberships (admin)
POST /orgs/:org/members → add membership (admin)
GET /orgs/:org/members/:id → get membership (admin)
DELETE /orgs/:org/members/:id → remove membership (admin)
Membership :id uses a composite key format: user:jdoe or group:g_spherex. This is self-documenting in URLs and corresponds directly to the principal_type:principal pair which is unique within an org.
Projects#
GET /orgs/:org/projects → list projects (reader+, paginated)
POST /orgs/:org/projects → create project (admin)
GET /orgs/:org/projects/:project → get project (reader+)
PATCH /orgs/:org/projects/:project → update project (admin)
DELETE /orgs/:org/projects/:project → soft-delete project (admin)
The GET /orgs/:org/projects/:project response includes navigation URLs:
{
"self_url": "https://docverse.../orgs/rubin/projects/pipelines",
"org_url": "https://docverse.../orgs/rubin",
"editions_url": "https://docverse.../orgs/rubin/projects/pipelines/editions",
"builds_url": "https://docverse.../orgs/rubin/projects/pipelines/builds",
"slug": "pipelines",
"title": "LSST Science Pipelines",
"doc_repo": "https://github.com/lsst/pipelines_lsst_io",
"slug_rewrite_rules": null
}
The slug_rewrite_rules field is null when the project inherits the org’s rules, or contains a project-specific rule list when overridden.
Editions#
GET /orgs/:org/projects/:project/editions → list editions (reader+, paginated)
POST /orgs/:org/projects/:project/editions → create edition (admin)
GET /orgs/:org/projects/:project/editions/:ed → get edition (reader+)
PATCH /orgs/:org/projects/:project/editions/:ed → update edition (admin)
DELETE /orgs/:org/projects/:project/editions/:ed → soft-delete edition (admin)
PATCH supports setting build to reassign an edition to a specific build (used for rollback or manual reassignment). This enqueues an edition update task and returns a queue_url.
The POST to create an edition accepts the following request body:
{
"slug": "usdf-dev--main",
"title": "USDF Dev — main",
"kind": "alternate",
"tracking_mode": "alternate_git_ref",
"tracking_params": {
"git_ref": "main",
"alternate_name": "usdf-dev"
}
}
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
yes |
URL-safe edition identifier. Must be unique within the project. |
|
string |
yes |
Human-readable title for dashboards and metadata. |
|
string |
yes |
Edition kind ( |
|
string |
yes |
One of the supported tracking modes (see Projects, editions and builds). Determines which builds update this edition. |
|
object |
no |
Mode-specific parameters (e.g., |
GET /orgs/:org/projects/:project/editions/:ed/history → edition-build history (reader+, paginated)
The history endpoint returns a paginated list of builds the edition has pointed to, ordered by position (most recent first). Each entry includes the build reference, timestamp, and position.
The GET /orgs/:org/projects/:project/editions/:ed response:
{
"self_url": "https://docverse.../orgs/rubin/projects/pipelines/editions/__main",
"project_url": "https://docverse.../orgs/rubin/projects/pipelines",
"build_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
"history_url": "https://docverse.../orgs/rubin/projects/pipelines/editions/__main/history",
"published_url": "https://pipelines.lsst.io/",
"slug": "__main",
"kind": "main",
"tracking_mode": "git_ref",
"tracking_params": { "git_ref": "main" },
"date_updated": "2026-02-08T12:00:00Z"
}
Builds#
GET /orgs/:org/projects/:project/builds → list builds (reader+, paginated)
POST /orgs/:org/projects/:project/builds → create build (uploader+)
GET /orgs/:org/projects/:project/builds/:build → get build (reader+)
PATCH /orgs/:org/projects/:project/builds/:build → signal upload complete (uploader+)
DELETE /orgs/:org/projects/:project/builds/:build → soft-delete build (admin)
The POST to create a build returns a single presigned upload URL for the tarball:
Request:
{
"git_ref": "main",
"alternate_name": "usdf-dev",
"content_hash": "sha256:a1b2c3d4...",
"annotations": { "kind": "release" }
}
The alternate_name field is optional — most projects don’t use it. When present, it scopes the build to alternate-aware editions. Builds with alternate_name are not matched by generic git_ref-only tracking editions — the alternate name acts as a namespace. See Compound slug derivation for alternate-scoped builds in the projects section for details on how alternate names interact with edition tracking and slug derivation.
Response:
{
"self_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
"project_url": "https://docverse.../orgs/rubin/projects/pipelines",
"id": "01HQ-3KBR-T5GN-8W",
"status": "uploading",
"git_ref": "main",
"upload_url": "https://storage.googleapis.com/docverse-staging-rubin/__staging/01HQ...tar.gz?sig=...",
"date_created": "2026-02-08T12:00:00Z"
}
The upload_url is a presigned PUT URL for the tarball. It points to either the staging bucket (if configured) or the publishing bucket’s __staging/ prefix. The client uploads the tarball via HTTP PUT to this URL, then signals upload complete via PATCH.
The content_hash field allows the server to verify tarball integrity after upload. The files field from the per-file upload model is removed – the file manifest is discovered during the unpack step.
The PATCH to signal upload complete updates the build status and enqueues background processing:
Request:
{
"status": "uploaded"
}
Response:
{
"self_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
"queue_url": "https://docverse.../queue/jobs/0G4R-MFBZ-K7QP-5X",
"status": "processing"
}
Queue jobs#
GET /queue/jobs/:job → get job status (authenticated)
Queue jobs provide status tracking for background operations (build processing, edition updates, dashboard rendering). The job resource is identified by a Docverse-owned Crockford Base32 ID. See the Task queue design section for the QueueJob table schema and progress tracking implementation.
{
"self_url": "https://docverse.../queue/jobs/0G4R-MFBZ-K7QP-5X",
"id": "0G4R-MFBZ-K7QP-5X",
"status": "in_progress",
"kind": "build_processing",
"build_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
"date_created": "2026-02-08T12:00:00Z",
"date_started": "2026-02-08T12:00:01Z",
"date_completed": null,
"phase": "editions",
"progress": {
"editions_total": 3,
"editions_completed": [
{ "slug": "__main", "published_url": "https://pipelines.lsst.io/" }
],
"editions_failed": [],
"editions_in_progress": ["v2.x", "DM-12345"]
}
}
Database models#
Docverse stores all state in a PostgreSQL database accessed through SQLAlchemy (async, via Safir’s database utilities). This section provides a centralized reference for the database schema. Individual sections describe the behavioral design around each table; this section focuses on column definitions and relationships.
Entity-relationship diagram#
erDiagram
Organization ||--o{ Project : "has"
Organization ||--o{ OrgMembership : "has"
Organization ||--o{ organization_credentials : "has"
Organization ||--o{ DashboardTemplate : "has"
Organization ||--o{ QueueJob : "scoped to"
Project ||--o{ Build : "has"
Project ||--o{ Edition : "has"
Project ||--o{ QueueJob : "scoped to"
Project ||--o{ DashboardTemplate : "overrides"
Build ||--o{ BuildObject : "inventories"
Build ||--o{ EditionBuildHistory : "logged in"
Build ||--o{ QueueJob : "tracked by"
Edition ||--o{ EditionBuildHistory : "logged in"
Edition }o--o| Build : "current build"
Organization {
int id PK
string slug UK
string title
string base_domain
enum url_scheme
string root_path_prefix
JSONB slug_rewrite_rules
JSONB lifecycle_rules
datetime date_created
datetime date_updated
}
Project {
int id PK
string slug
string title
int org_id FK
string doc_repo
JSONB slug_rewrite_rules
JSONB lifecycle_rules
datetime date_created
datetime date_updated
datetime date_deleted
}
Build {
int id PK
string public_id
int project_id FK
string git_ref
string alternate_name
string content_hash
enum status
string staging_key
int object_count
bigint total_size_bytes
string uploader
JSONB annotations
datetime date_created
datetime date_uploaded
datetime date_completed
datetime date_deleted
}
Edition {
int id PK
string slug
string title
int project_id FK
enum kind
enum tracking_mode
JSONB tracking_params
int current_build_id FK
bool lifecycle_exempt
datetime date_created
datetime date_updated
datetime date_deleted
}
OrgMembership {
UUID id PK
int org_id FK
string principal
enum principal_type
enum role
}
organization_credentials {
UUID id PK
int organization_id FK
string label
string service_type
string encrypted_credential
datetime created_at
datetime updated_at
}
BuildObject {
int id PK
int build_id FK
string key
string content_hash
string content_type
bigint size
}
EditionBuildHistory {
int id PK
int edition_id FK
int build_id FK
int position
datetime date_created
}
DashboardTemplate {
int id PK
int org_id FK
int project_id FK
string github_owner
string github_repo
string path
string git_ref
string store_prefix
string sync_id
datetime date_synced
}
QueueJob {
int id PK
int public_id
string backend_job_id
enum kind
enum status
string phase
int org_id FK
int project_id FK
int build_id FK
JSONB progress
JSONB errors
datetime date_created
datetime date_started
datetime date_completed
}
Core domain tables#
These tables define the primary domain model for Docverse.
Organization#
The organization is the top-level resource and the sole infrastructure configuration boundary. All projects within an org share the same object store, CDN, root domain, URL scheme, and default dashboard templates. See Organizations design for the full behavioral design.
Column |
Type |
Description |
|---|---|---|
|
int |
Primary key |
|
str (unique) |
URL-safe identifier (e.g., |
|
str |
Human-readable name |
|
str |
Root domain for published URLs (e.g., |
|
enum |
|
|
str |
Path prefix for path-prefix URL scheme (e.g., |
|
JSONB |
Ordered list of edition slug rewrite rules (see Edition slug rewrite rules) |
|
JSONB |
Default lifecycle rules for projects in this org (see Projects, editions and builds) |
|
datetime |
Creation timestamp |
|
datetime |
Last modification timestamp |
Infrastructure connections (object store, CDN, DNS, staging store) are configured through the organization_credentials table and additional org-level configuration fields.
Project#
A documentation site with a stable URL and multiple versions (editions). Projects belong to an organization and inherit its infrastructure and default configuration, with optional per-project overrides for slug rewrite rules and lifecycle rules. See Projects, editions and builds for the full behavioral design.
Column |
Type |
Description |
|---|---|---|
|
int |
Primary key |
|
str |
URL-safe identifier, unique within org (e.g., |
|
str |
Human-readable name |
|
FK → Organization |
Owning organization |
|
str |
GitHub repository URL for the documentation source |
|
JSONB (nullable) |
When set, completely replaces the org-level rules for this project |
|
JSONB (nullable) |
When set, overrides the org-level lifecycle rules for this project |
|
datetime |
Creation timestamp |
|
datetime |
Last modification timestamp |
|
datetime (nullable) |
Soft-delete timestamp; |
Build#
A discrete upload of documentation content for a project. Builds are conceptually immutable after processing and carry metadata about their origin. Builds are identified externally with a Crockford Base32 ID. See Projects, editions and builds for the upload flow and processing pipeline.
Column |
Type |
Description |
|---|---|---|
|
int |
Internal primary key |
|
Crockford Base32 |
Externally-visible identifier (e.g., |
|
FK → Project |
Owning project |
|
str |
Git branch or tag that produced this build |
|
str (nullable) |
Deployment/variant scope (e.g., |
|
str |
SHA-256 hash of the uploaded tarball for integrity verification |
|
enum |
|
|
str |
Object store key for the uploaded tarball (e.g., |
|
int |
Number of files extracted from the tarball (populated during inventory) |
|
bigint |
Total size of extracted content (populated during inventory) |
|
str |
Username of the authenticated uploader |
|
JSONB |
Client-provided metadata about the build |
|
datetime |
When the build record was created |
|
datetime (nullable) |
When the client signaled upload complete |
|
datetime (nullable) |
When background processing finished |
|
datetime (nullable) |
Soft-delete timestamp; |
Edition#
A named, published view of a project’s documentation at a stable URL.
Editions are pointers — they represent a specific build’s content served at an edition-specific URL path (e.g., /v/main/, /v/DM-12345/).
See Projects, editions and builds for tracking modes, edition kinds, and auto-creation behavior.
Column |
Type |
Description |
|---|---|---|
|
int |
Primary key |
|
str |
URL-safe identifier, unique within project (e.g., |
|
str |
Human-readable name for dashboards |
|
FK → Project |
Owning project |
|
enum |
|
|
enum |
Determines which builds update this edition (see Projects, editions and builds for the full list) |
|
JSONB |
Mode-specific parameters (e.g., |
|
FK → Build (nullable) |
The build currently served at this edition’s URL; |
|
bool |
When |
|
datetime |
Creation timestamp |
|
datetime |
Last update timestamp (changes when build pointer moves) |
|
datetime (nullable) |
Soft-delete timestamp; |
Supporting tables#
These tables support the core domain model with membership, credentials, object inventory, history tracking, dashboard templates, and background job management. Each is discussed in detail in its respective section.
OrgMembership#
Maps users and groups to roles within organizations. See Authentication and authorization for the full authorization model, role definitions, and resolution algorithm.
Column |
Type |
Description |
|---|---|---|
|
UUID |
Primary key |
|
FK → Organization |
The organization |
|
str |
A username or group name |
|
enum |
|
|
enum |
|
organization_credentials#
Stores Fernet-encrypted credentials for organization infrastructure services (object stores, CDNs, DNS providers). See Organizations design for the encryption scheme, key rotation, and credential management.
Column |
Type |
Description |
|---|---|---|
|
UUID |
Primary key |
|
FK → Organization |
Owning organization |
|
str |
Human-friendly name (e.g., “Cloudflare R2 production”) |
|
str |
Provider identifier (e.g., |
|
str |
Fernet token containing the encrypted credential value |
|
datetime |
Creation timestamp |
|
datetime |
Last modification timestamp |
Unique constraint on (organization_id, label).
BuildObject#
Inventories every file extracted from a build’s tarball. Populated during the inventory phase of build processing. See Projects, editions and builds for how the inventory enables diff-based edition updates and orphan detection.
Column |
Type |
Description |
|---|---|---|
|
int |
Primary key |
|
FK → Build |
Owning build |
|
str |
Object store path (e.g., |
|
str |
ETag or SHA-256 hash of the object content |
|
str |
MIME type (e.g., |
|
bigint |
Object size in bytes |
EditionBuildHistory#
Logs every build that an edition has pointed to, enabling rollback and orphan detection. See Projects, editions and builds for the rollback API and how lifecycle rules reference history position.
Column |
Type |
Description |
|---|---|---|
|
int |
Primary key |
|
FK → Edition |
The edition |
|
FK → Build |
The build that was served |
|
int |
Ordering position (1 = most recent) |
|
datetime |
When this history entry was recorded |
DashboardTemplate#
Tracks dashboard template sources (GitHub repos) and their sync state. See Dashboard templating system for the template directory structure, sync flow, and rendering pipeline.
Column |
Type |
Description |
|---|---|---|
|
int |
Primary key |
|
FK → Organization |
Owning organization |
|
FK → Project (nullable) |
If set, this is a project-level override |
|
str |
GitHub organization or user |
|
str |
Repository name |
|
str |
Path within repo (default |
|
str |
Branch or tag to track |
|
str (nullable) |
Object store prefix for current synced template files |
|
str (nullable) |
Current sync version identifier (timestamp-based) |
|
datetime (nullable) |
Last successful sync timestamp |
Unique constraint on (org_id, project_id) — at most one template per org (where project_id is null) and one per project.
QueueJob#
Tracks all background jobs as the single source of truth for job state and progress. The queue backend (Arq/Redis) handles delivery; this table is the authoritative state store. See Task queue design for progress tracking, cross-job serialization, and operator queries.
Column |
Type |
Description |
|---|---|---|
|
int |
Internal primary key |
|
int |
Crockford Base32 serialized in API |
|
str (nullable) |
Reference to the queue backend’s job ID (e.g., Arq UUID) |
|
enum |
|
|
enum |
|
|
str (nullable) |
Current processing phase (e.g., |
|
FK → Organization |
Scoped to org for filtering |
|
FK → Project (nullable) |
Set for build/edition jobs |
|
FK → Build (nullable) |
Set for build processing jobs |
|
JSONB (nullable) |
Structured progress data, phase-specific |
|
JSONB (nullable) |
Collected error details |
|
datetime |
When the job was enqueued |
|
datetime (nullable) |
When a worker picked it up |
|
datetime (nullable) |
When the job finished |
GitHub Actions action (docverse-upload)#
The docverse-upload action is a native JavaScript GitHub Action published to the GitHub Marketplace from the lsst-sqre/docverse-upload repository.
It provides the same upload workflow as the Python client (see Client-server monorepo) but is purpose-built for GitHub Actions runners.
docverse-upload/ # lsst-sqre/docverse-upload
├── action.yml
├── src/
│ └── index.ts
├── generated/
│ └── api-types.ts # openapi-typescript output
├── openapi.json # pinned OpenAPI spec from docverse
├── package.json
├── tsconfig.json
└── dist/
└── index.js # ncc-bundled output
Why native JavaScript instead of wrapping the Python client#
A composite action that installs Python and then calls docverse upload would work, but carries overhead:
Python setup cost:
actions/setup-pythonadds 15–30 seconds to every job. For documentation builds that already have Python, this is free — but for projects using other languages, or for workflows where the docs build runs in a separate job, the setup time is wasted.Node 20 is guaranteed: GitHub Actions runners always have Node 20 available. A JavaScript action runs immediately with zero setup.
Native toolkit integration: the
@actions/coretoolkit provides first-class support for step outputs, job summaries, annotations, and failure reporting. Wrapping a CLI subprocess requires parsing its output to surface these features.Independent versioning: the action is versioned via Git tags (
v1,v1.2.0) following GitHub Actions conventions. It can release on its own cadence without coupling to the Python client’s PyPI release cycle.
Usage#
- name: Upload docs to Docverse
uses: lsst-sqre/docverse-upload@v1
with:
org: rubin
project: pipelines
dir: _build/html
token: ${{ secrets.DOCVERSE_TOKEN }}
Usage with PR comments enabled:
- name: Upload docs to Docverse
uses: lsst-sqre/docverse-upload@v1
with:
org: rubin
project: pipelines
dir: _build/html
token: ${{ secrets.DOCVERSE_TOKEN }}
github-token: ${{ github.token }}
Inputs#
Input |
Required |
Default |
Description |
|---|---|---|---|
|
yes |
— |
Organization slug |
|
yes |
— |
Project slug |
|
yes |
— |
Path to built documentation directory |
|
yes |
— |
Gafaelfawr token for authentication |
|
no |
|
Docverse API base URL |
|
no |
|
Git ref (auto-detected from workflow context) |
|
no |
— |
Alternate name for scoped editions |
|
no |
|
Wait for processing to complete |
|
no |
— |
GitHub token for posting PR comments with links to updated editions. Typically |
Outputs#
Output |
Description |
|---|---|
|
The Docverse build ID |
|
API URL of the created build |
|
Public URL where the edition is served |
|
Terminal queue job status |
|
JSON array of all updated editions with slugs and published URLs (e.g., |
Pull request comments#
When github-token is provided, the action posts or updates a comment on the associated pull request summarizing the build and linking to all updated editions.
This gives PR reviewers immediate, clickable access to staged documentation without navigating the Docverse API or dashboards.
PR discovery#
How the action finds the PR number depends on the workflow trigger event:
pull_request/pull_request_targetevents: the PR number is read directly fromgithub.event.pull_request.number.pushevents: the action queries the GitHub API (GET /repos/{owner}/{repo}/pulls?head={owner}:{branch}&state=open) to find open PRs for the pushed branch. If multiple PRs match, the action comments on all of them. If none match, the comment step is skipped silently.Other events (
workflow_dispatch,schedule): the comment step is skipped silently.
Comment deduplication#
A hidden HTML marker <!-- docverse:pr-comment:{org}/{project} --> at the top of the comment body identifies the comment, scoped by organization and project.
On each build the action:
Lists existing comments on the PR and searches for the marker.
If found: updates the existing comment via
PATCH /repos/{owner}/{repo}/issues/comments/{comment_id}.If not found: creates a new comment via
POST /repos/{owner}/{repo}/issues/{pr_number}/comments.
Multi-project PRs (repositories that publish to multiple Docverse projects) get one comment per project, each independently updated.
Edge cases#
Job failed: the comment reports the failure status and build ID instead of an edition table.
No editions updated: the comment notes that no editions were updated and includes the build ID.
Partial failure (
completed_with_errors): successful editions appear in the main table; failed and skipped editions are listed in a collapsible<details>block.Token lacks permissions: the GitHub API returns 403; the action logs a warning via
core.warning()but does not fail the step (the upload itself succeeded).No PR context: the comment step is skipped silently; the build proceeds normally.
Permissions#
The github-token requires pull-requests: write permission. Workflows must declare this explicitly:
permissions:
pull-requests: write
steps:
- name: Upload docs to Docverse
uses: lsst-sqre/docverse-upload@v1
with:
org: rubin
project: pipelines
dir: _build/html
token: ${{ secrets.DOCVERSE_TOKEN }}
github-token: ${{ github.token }}
Implementation details#
OpenAPI-driven TypeScript development#
The action’s TypeScript types are generated from the Docverse server’s OpenAPI spec, creating a cross-repo type safety chain:
Pydantic models in the
docversemonorepo’s client package define the API contract.FastAPI generates an OpenAPI spec from those models; monorepo CI publishes the spec as a versioned artifact.
The
docverse-uploadrepository pins a copy of the spec asopenapi.json.openapi-typescriptgenerates TypeScript types (generated/api-types.ts) from the pinned spec.openapi-fetchprovides a type-safe HTTP client that uses those generated types.
When the API contract changes, a developer updates openapi.json in the action repository (either manually or via a Dependabot-style automation).
Because the spec is committed, the diff in the pull request makes every schema change explicitly visible — field renames, added enum values, or new required properties are all reviewable before the action code is updated to match.
This provides a deliberate review gate that catches unintended breaking changes before they ship.
Build and bundle#
The action is built with TypeScript and bundled into a single dist/index.js file using ncc.
The bundled output is committed to the repository (standard practice for JavaScript GitHub Actions) so that the action runs without a node_modules install step.
The action targets the Node 20 runtime.
Tarball creation and upload#
The action uses the Node.js tar package to create .tar.gz archives and computes a SHA-256 hash during creation.
The tarball is uploaded to the presigned URL via the Fetch API.
GitHub Actions integration#
The action uses the @actions/core toolkit for runner integration:
Step summary: on success, a Markdown summary is written to
$GITHUB_STEP_SUMMARYshowing the build ID, queue job status, and published URL.Warning annotations: if the queue job completes with warnings (partial success), the action emits warning annotations visible in the workflow run UI.
Step failure: if the queue job fails, the action calls
core.setFailed()with the failure reason, marking the step as failed.Outputs: build ID, build URL, published URL, and job status are set as step outputs for downstream workflow steps to consume.
PR comments: when
github-tokenis provided, posts a summary comment on the associated pull request with links to all updated editions (see Pull request comments).
Development workflow#
Action development#
The docverse-upload action uses a standard Node.js development workflow:
npm install— install dependencies.npm run generate-types— regenerategenerated/api-types.tsfromopenapi.json(runsopenapi-typescript).npm test— run unit tests with Vitest.npm run build— compile TypeScript and bundle withnccintodist/index.js.
Release workflow#
GitHub Action (
docverse-uploadrepo): versioned via Git tags following GitHub Actions conventions (v1,v1.2.0). Thev1tag is a floating major-version tag updated on each minor/patch release.
Housing the action in its own repository means its v1/v1.2.0 tags reflect the action’s own input/output contract and release cadence, independent of the server or client Python packages.
Migration from LSST the Docs#
The Rubin Observatory LSST the Docs (LTD) deployment at lsst.io serves ~300 documentation projects for the Rubin Observatory software stack and technical notes.
The current deployment runs LTD Keeper 1.23.0 with content stored in AWS S3, served through Fastly CDN, and uploaded via the lsst-sqre/ltd-upload GitHub Action and reusable workflows.
The migration moves this deployment to Docverse, targeting Cloudflare R2 for object storage and Cloudflare Workers for the CDN edge.
The migration involves three concerns: data migration (moving object store content and database records), client migration (updating CI workflows that upload documentation), and a phased rollout that minimizes disruption.
Data migration#
The data migration moves documentation content from the LTD object store layout to the Docverse layout and seeds the Docverse database with project, edition, and build records derived from the LTD Keeper database.
Scope: Only builds that are currently referenced by active (non-deleted) editions are migrated. Historical builds that are not pointed to by any edition are discarded. This significantly reduces the data volume — most projects accumulate hundreds of builds over time, but only a handful of editions (and therefore builds) are active at any given moment.
LTD vs. Docverse object store layout#
The LTD and Docverse object store layouts differ in both structure and semantics:
Aspect |
LTD layout |
Docverse layout |
|---|---|---|
Build storage |
|
|
Edition content |
|
No edition file copies — editions are pointers resolved at the CDN edge via KV lookup |
Edition metadata |
None in object store (stored in LTD Keeper database only) |
|
Staging |
N/A (files uploaded individually via presigned URLs) |
|
Dashboard |
Static HTML at domain root |
|
The key architectural difference is that LTD physically copies build files into edition paths on every edition update (the S3 copy-on-publish bottleneck described in Documentation hosting), while Docverse stores builds once and resolves editions to builds via edge KV lookups. This means the migration only needs to copy files for the builds themselves — edition content does not need to be duplicated.
Migration tool design#
The migration is implemented as a docverse migrate CLI command in the Docverse client package.
The tool reads from the LTD Keeper database and source object store, and writes to the Docverse API and target object store.
sequenceDiagram
participant CLI as docverse migrate
participant LTD_DB as LTD Keeper DB
participant S3_SRC as Source S3 Bucket
participant API as Docverse API
participant Store as Target Object Store
participant KV as CDN Edge KV
CLI->>LTD_DB: Query products, editions, builds
LTD_DB-->>CLI: Product→Edition→Build mappings
loop For each product
CLI->>API: Create project (with org mapping)
loop For each unique build referenced by active editions
CLI->>S3_SRC: List objects at {product}/builds/{build_id}/
S3_SRC-->>CLI: Object keys + metadata
loop For each object in build
CLI->>Store: Copy to {project}/__builds/{build_id}/{file}
end
CLI->>API: Register build (metadata, object inventory)
end
loop For each active edition
CLI->>API: Create edition (slug, tracking mode, kind)
CLI->>API: Set edition → build pointer
API->>KV: Write edition→build mapping
end
end
CLI->>API: Trigger dashboard renders
The migration tool proceeds in these steps:
Query LTD Keeper database for all products and their active edition→build mappings. For each product, determine which builds are referenced by at least one non-deleted edition.
Create Docverse projects via the API. Map LTD product slugs to Docverse project slugs (typically identical). Associate each project with the appropriate Docverse organization.
Copy build objects for each unique build referenced by an active edition. Objects are copied from
{product}/builds/{build_id}/in the source S3 bucket to{project}/__builds/{build_id}/in the target object store. The tool populates theBuildObjectinventory table during this step by recording each object’s key, content hash, content type, and size.Create edition records in Docverse with mapped tracking modes (see below) and appropriate edition kinds. Create
EditionBuildHistoryentries linking each edition to its current build.Seed CDN edge data store by writing edition→build mappings to the KV store, enabling the CDN to resolve edition URLs to the correct build paths immediately.
Render dashboards and metadata JSON by triggering dashboard render jobs for each migrated project.
Tracking mode mapping#
LTD Keeper tracking modes map directly to their Docverse equivalents.
The names have changed slightly (e.g., git_refs → git_ref), but the semantics are preserved:
LTD Keeper mode |
Docverse mode |
Notes |
|---|---|---|
|
|
Singular form; same behavior |
|
|
Unchanged |
|
|
Unchanged |
|
|
Unchanged |
|
|
Unchanged |
Edition kinds are inferred from the tracking mode and edition slug:
Editions with slug
__main(or equivalent main branch tracking) → kindmainEditions with
lsst_docoreups_*tracking → kindreleaseAll other editions → kind
draft
Rubin-specific notes#
The Rubin migration involves a cross-cloud transfer from AWS S3 to Cloudflare R2. R2 provides an S3-compatible API, so the migration tool can use standard S3 client libraries (boto3/aioboto3) for both source reads and target writes — no Cloudflare-specific SDK is needed.
Estimated data volume: Based on the ~300 LTD products and typical build sizes, the active build data (excluding historical builds not referenced by editions) is estimated at 50–200 GB. The migration tool should support concurrent transfers with configurable parallelism to complete within a reasonable time window (target: under 4 hours for the full corpus).
DNS cutover: The atomic switchover point for Rubin is the DNS change for *.lsst.io from Fastly to Cloudflare.
Before the cutover, the migrated content can be verified on a test domain (e.g., *.lsst-docs-test.org).
The DNS change is the point of no return — after this, all documentation traffic is served by the Cloudflare Workers stack described in Documentation hosting.
Fastly configuration is retained (but not actively serving) for a rollback window.
Client migration#
Client migration updates the CI workflows that upload documentation builds. The key difference is that LTD uses per-file presigned URL uploads authenticated with LTD API tokens, while Docverse uses tarball uploads authenticated with Gafaelfawr tokens (see Projects, editions and builds and GitHub Actions action (docverse-upload)).
The following table summarizes the differences between the current ltd-upload action and the Docverse upload path:
Aspect |
|
|
|---|---|---|
Authentication |
LTD API username + password (repo or org secret) |
Gafaelfawr token (org-level secret |
Upload mechanism |
Per-file presigned URLs from LTD API |
Tarball upload to presigned URL from Docverse API |
Project identifier |
|
|
API endpoint |
|
|
Build registration |
|
|
Key insight: Many Rubin documentation projects do not call ltd-upload directly.
Instead, they use the lsst-sqre/rubin-sphinx-technote-workflows reusable workflow (and similar reusable workflows for other document types), which internally calls ltd-upload.
Updating the reusable workflow to use the Docverse upload path migrates all downstream projects with zero per-repo changes.
Only projects with custom CI workflows that reference ltd-upload directly need individual updates.
Option A: Retrofit the composite action#
Update lsst-sqre/ltd-upload (or create a new docverse-upload action) to target the Docverse API.
The action accepts a Gafaelfawr token and uploads to Docverse.
The product input maps to the Docverse project; a new org input specifies the organization (defaulting to rubin for the Rubin deployment).
GitHub organization-level secrets (DOCVERSE_TOKEN) provide the Gafaelfawr token to all repositories in the org without per-repo configuration.
Reusable workflows like rubin-sphinx-technote-workflows absorb the interface change — they update their internal ltd-upload usage to pass the new inputs, and all downstream repositories that use the reusable workflow migrate automatically.
For repositories with custom workflows that reference ltd-upload directly, automated PRs update the workflow YAML (see below).
Pros:
No new service infrastructure to deploy or maintain.
Clear, explicit upgrade path — each repository’s workflow file shows which backend it uses.
Reusable workflow pattern covers the majority of Rubin repositories with zero per-repo changes.
Remaining custom-workflow repositories get automated PRs.
Cons:
Repositories with custom workflows need workflow file changes (but this is automatable).
The
ltd-uploadaction name becomes a misnomer once it targets Docverse (mitigated by eventually deprecating it in favor ofdocverse-upload).
Option B: LTD API compatibility shim#
Deploy a compatibility service at ltd-keeper.lsst.codes that translates LTD API calls to Docverse API calls.
Existing workflows continue to call the LTD API endpoints; the shim authenticates LTD credentials and forwards requests to Docverse using a service-level Gafaelfawr token.
Critical problem: upload format translation. The LTD upload flow works as follows:
Client calls
POST /products/{slug}/builds/to register a build.LTD Keeper returns a list of per-file presigned S3 URLs.
Client uploads each file directly to S3 using the presigned URLs — these uploads bypass the LTD Keeper API entirely.
Client calls
PATCH /builds/{id}to confirm the upload is complete.
The shim can intercept steps 1, 2, and 4, but cannot intercept step 3 because the client uploads directly to S3 using presigned URLs. Docverse expects a single tarball upload, not individual file uploads. To bridge this gap, the shim would need to either:
(a) Replace presigned S3 URLs with shim-hosted upload endpoints, making the shim a file proxy that receives every file, buffers them, assembles a tarball, and uploads it to Docverse. For a large Sphinx site with thousands of files, this turns the shim into a high-throughput file proxy handling gigabytes of traffic.
(b) Monitor the S3 bucket for uploaded files, detect when all files for a build are present, assemble a tarball, and upload it to Docverse. This is fragile (how to detect “all files uploaded”?) and adds significant latency.
Pros:
Zero workflow changes during the shim’s lifetime — existing
ltd-uploadcalls work unmodified.
Cons:
The upload format translation (per-file → tarball) makes this architecturally complex. The shim is not a thin API translator — it is a stateful file proxy service.
Significant development cost for temporary infrastructure that will be decommissioned once migration is complete.
The shim must handle the full upload throughput of all documentation builds, adding an operational burden (monitoring, scaling, failure handling).
Development cost is never amortized — the shim is throwaway infrastructure.
Recommendation: Option A (retrofit the composite action)#
The shim’s upload format translation complexity is disproportionate to its benefit. What appears at first glance to be a thin API translator is actually a stateful file proxy service, because LTD’s per-file presigned URL upload pattern cannot be transparently mapped to Docverse’s tarball upload pattern without intercepting and buffering all file uploads.
Meanwhile, the reusable workflow pattern means that most Rubin repositories need zero workflow changes — updating rubin-sphinx-technote-workflows (and similar reusable workflows) migrates all downstream repositories automatically.
The remaining repositories with custom workflows receive automated PRs.
This makes Option A both simpler to implement and less disruptive in practice.
Workflow changes are prepared in advance (PRs opened, reusable workflow branches ready) and merged as part of the cutover maintenance window (see below).
Automated PR generation#
For repositories with custom workflows that reference ltd-upload directly, a migration script scans repositories, identifies ltd-upload usage, generates updated workflow YAML, and opens PRs:
Use the GitHub API to list repositories in the
lsstandlsst-sqreorganizations.For each repository, search
.github/workflows/*.ymlfiles for references tolsst-sqre/ltd-upload.Skip repositories that use reusable workflows (these are migrated by updating the reusable workflow itself).
Generate an updated workflow file that replaces the
ltd-uploadstep with the new inputs.Open a PR with a standardized description explaining the migration.
Before (LTD):
- name: Upload to LSST the Docs
uses: lsst-sqre/ltd-upload@v1
with:
product: pipelines
dir: _build/html
env:
LTD_USERNAME: ${{ secrets.LTD_USERNAME }}
LTD_PASSWORD: ${{ secrets.LTD_PASSWORD }}
After (Docverse via retrofitted action):
- name: Upload to Docverse
uses: lsst-sqre/ltd-upload@v2
with:
org: rubin
project: pipelines
dir: _build/html
token: ${{ secrets.DOCVERSE_TOKEN }}
Or, using the dedicated docverse-upload action directly:
- name: Upload to Docverse
uses: lsst-sqre/docverse-upload@v1
with:
org: rubin
project: pipelines
dir: _build/html
token: ${{ secrets.DOCVERSE_TOKEN }}
Migration phases#
The migration proceeds in four phases, each with a clear milestone and rollback strategy:
Phase |
Description |
Key milestone |
Rollback strategy |
|---|---|---|---|
0: Preparation |
Deploy Docverse alongside LTD. Create organizations, configure object stores and CDN, provision Gafaelfawr tokens, validate with test projects. |
Docverse deployed and validated with test projects |
Remove Docverse deployment; no user impact (LTD unchanged) |
1: Data migration |
Run |
All active builds and editions migrated and verified |
Discard migrated data and re-run; LTD still serving production |
2: Client preparation |
Prepare all workflow changes without activating them: update |
All PRs open and tested; org secrets provisioned |
Close PRs; no user impact (LTD unchanged) |
3: Cutover |
Short maintenance window: run final data sync to capture builds since Phase 1, merge all workflow PRs and reusable workflow changes, switch |
Production DNS on Cloudflare, all repos uploading to Docverse |
Revert DNS to Fastly and revert workflow merges; Fastly configuration retained for rollback window |
Phase 0 and Phase 1 proceed with no user-visible changes — LTD continues to serve all production traffic. Phase 2 is preparation — PRs are opened and tested but not merged, so LTD remains fully operational. Phase 3 is the atomic cutover within a short maintenance window. Documentation remains readable throughout (LTD serves until DNS propagates). The window only affects new uploads, which are briefly paused while workflow changes and DNS propagate. A final data sync at the start of Phase 3 ensures Docverse has all builds up to the cutover moment. Estimated window: 1–2 hours.
Risk mitigation#
Risk |
Impact |
Likelihood |
Mitigation |
|---|---|---|---|
Data corruption during migration |
Incorrect documentation served |
Low |
Verify migrated content against LTD originals using content hash comparison; test domain validation before DNS cutover |
DNS cutover causes outage |
Documentation unavailable |
Low |
Test DNS configuration in advance; retain Fastly configuration for rapid rollback; use low TTL during cutover window |
Incomplete build migration |
Missing pages or broken links |
Medium |
Migration tool validates object counts per build against LTD inventory; flag discrepancies for manual review |
Gafaelfawr token provisioning issues |
CI uploads fail |
Medium |
Provision and test org-level secrets during Phase 2 preparation; validate with test uploads before cutover |
Builds in flight during cutover |
Some CI builds fail or upload to wrong backend |
Low |
Announce maintenance window in advance; re-trigger any builds that fail during the cutover window |
Reusable workflow update breaks downstream repos |
CI failures across many repos |
Low |
Test reusable workflow changes against a representative sample of downstream repos before merging; reusable workflow versioning allows rollback |
Comment format#
The comment uses a Markdown table to list all updated editions with their published URLs:
Edition data is extracted from the completed job’s
editions_completedprogress array (see Task queue design). For partial failures (job statuscompleted_with_errors), successful editions appear in the main table; failed and skipped editions are listed in a collapsible<details>block below.