SQR-112

Docverse documentation hosting platform design#

Abstract

For a decade, Rubin Observatory has hosted its documentation sites with its LSST the Docs service. That service provided excellent capabilities and performance, but its implementation is now out of step with our current application design practices and some long-standing bugs and features have been difficult to retrofit. Docverse is the next iteration of LSST the Docs that retains the qualities of hosting versioned, static web documentation, but resolves the feature gaps, performance issues, and bugs we experienced with LSST the Docs.

Introduction#

Rubin Observatory’s technical documentation has long followed the docs-like-code model where documentation is authored in Git repositories, often alongside and in conjunction with code. The LSST the Docs (LTD) application (SQR-006) has played a critical role in this ecosystem by providing a platform for hosting versioned documentation sites. For users, LTD’s design ensures documentation is served directly through a CDN and object store hosted in the public cloud, so that performance and reliability are excellent even under high load and aren’t affected by application-level issues. For project staff, LTD provides a seamless experience for integrating documentation hosting with their development and deployment workflows. When new branches or tags are pushed to a project, LTD creates documentation editions for those corresponding versions automatically.

Lessons learned from LTD#

After a decade of operating LTD, though, we have identified a number of area where we either wish to improve the platform, or resolve issues or limitations in the existing implementation:

  • The LTD codebase is a Flask (synchronous) Python application, whereas we now build applications with FastAPI and use asyncio throughout our codebases.

  • Edition updates need to be faster, near instant, and never produce 404s for users. With LTD, an edition update for a large documentation site could take nearly 15 minutes.

  • We need greater flexibility in how LTD is configured and operates. For example, publishing projects as subpaths rather than subdomains.

  • We needed to update and refine the project edition dashboards more easily.

  • Build uploads are slow for large documentation sites.

  • Need to be able to purge outdated draft editions and help projects ensure that their readers are using the default edition unless they explicitly choose a different one.

From LTD to Docverse#

Since around 2022 we started to work on a second version of LTD that filled some of these gaps. However, in the scope of that work we wanted to retain the existing Flask codebase and maintain compatibility with the existing API, with only a limited set of new API endpoints. At this point, we’ve realized that a more comprehensive reimplementation is needed to fully address the design goals. Rather than maintain compatibility with LTD, we will migrate existing documentation sites and projects to the new platform.

Key Docverse features and changes from LTD#

  • Implementation with FastAPI and Safir

  • Queue system built on a backend-agnostic abstraction layer, with Arq (via Safir) and Redis as the initial implementation, replacing the Celery system in LTD. The abstraction enables future evaluation of alternative queue backends without disrupting application logic.

  • Works with Gafaelfawr tokens for authentication and group membership, replacing the custom token system in LTD

  • Organization models to support multiple organizations with separate documentation domains and configurations hosted from the same Docverse instance

  • Support for Cloudflare to provide instant edition updates.

  • Improved build upload performance by allowing for multipart tarball uploads rather than requiring clients to upload individual files.

  • Support for deleting draft editions, including automation with GitHub webhooks to delete draft editions for deleted branches.

  • Edition dashboards are built from templates that are synchronized from GitHub so that the edition dashboards are easier to update and customizable by organization members.

  • Support for the versions.json files used by pydata-sphinx-theme and other documentation themes to power client-side edition switching.

Organization of this technote#

This technote dives into the design of Docverse at a fairly technical level to provide a record of design decisions and a reference for implementation.

  • Organizations discusses the organization model, including how organization configuration is supported by the API and database schemas.

  • Authentication and authorization describes how Docverse uses Gafaelfawr both to protect API endpoints for different roles, and to provide user and goup information for fine-grained access control.

  • Projects, editions, and builds describes the the core data model that Docverse carries over from LTD, but with substantial improvements to capabilities and performance.

  • Documentation hosting explores CDN and edge compute architectures for serving documentation, explains how pointer mode eliminates the S3 copy-on-publish bottleneck, and how organizations configure hosting infrastructure.

  • Dashboard templating system describes the new dashboard templating system that allows edition dashboards to be built from templates stored in GitHub repositories.

  • Code architecture describes Docverse’s layered architecture, factory pattern for multi-tenant client construction, protocol-based abstractions for object stores and CDN providers, and the client-server monorepo structure including the Python client library and CLI.

  • Queue system describes the queue system for processing edition updates and build uploads, built on a backend-agnostic abstraction with Arq as the initial implementation.

  • REST API design describes the API design and schema definitions for Docverse.

  • GitHub Actions action describes the native JavaScript GitHub Action for uploading documentation builds from GitHub Actions workflows.

  • Migration from LSST the Docs covers the data and client migration plan for moving existing LTD deployments to Docverse, including migration tooling, phased rollout, and risk mitigation.

Organizations design#

With organizations, a single Docverse deployment can host multiple documentation domains. Doing so can enable a single institution (like AURA or NOIRLab) to host documentation for multiple multiple missions with completely separate white-labelled presentations. It can also enable SQuaRE to provide Docverse-as-a-service for partner institutions.

The Organization is the sole infrastructure configuration boundary — all projects within an org share the same object store, CDN, root domain, URL scheme, and default dashboard templates.

Docverse follows a “bring your own infrastructure” strategy: each organization provisions and owns its cloud resources (object store buckets, CDN services, DNS zones) rather than Docverse providing centralized, shared infrastructure. This design is motivated by three concerns. First, cost allocation — organizations pay for and control their own cloud spend directly, avoiding the need for Docverse to meter usage or redistribute costs. Second, data ownership — organizations retain full ownership of their stored documentation artifacts; Docverse itself only stores connection metadata and encrypted credentials, never the data at rest. Third, regulatory compliance — some organizations face restrictions such as ITAR export controls that dictate where documentation can be hosted and who can access the underlying storage, requirements that are simplest to satisfy when the organization controls its own accounts.

Organization configuration#

Each organization owns:

  • Object store: bucket name, credentials, provider (AWS S3, GCS, generic S3-compatible, or Cloudflare R2). The bucket is provisioned externally by the org admin; Docverse stores connection details.

  • Staging store (optional): a separate object store bucket used for build tarball uploads, configured with the same shape as the publishing object store (provider, bucket, credentials). When configured, presigned upload URLs point to the staging bucket and the Docverse worker reads tarballs from it. When not configured, staging uses a __staging/ prefix in the publishing bucket. The staging store optimization is useful when the publishing store is on a different network than the Docverse compute cluster – for example, a GCS staging bucket in the same region as a GKE cluster paired with a Cloudflare R2 publishing bucket. The staging bucket only needs transient storage; tarballs are deleted after processing.

  • CDN: provider choice (Fastly, Cloudflare Workers, Google Cloud CDN), service ID, API keys for cache purging. The CDN is provisioned externally; Docverse only interacts at runtime for cache invalidation and edge data store updates.

  • DNS: For subdomain-based layouts, Docverse registers subdomains via DNS APIs (e.g., Route 53, Cloudflare DNS). When using Cloudflare, wildcard subdomains are supported on all plans via a proxied *.domain DNS record with free wildcard SSL.

  • URL scheme (per-org setting, one of):

    • Subdomain: each project gets project.base-domain (e.g., sqr-006.lsst.io)

    • Path-prefix: all projects under a root path (e.g., example.com/documentation/project)

  • Base domain (e.g., lsst.io) and root path prefix (for path-prefix mode).

  • Dashboard templates: a GitHub repo containing Jinja templates and assets, configured at the org level with optional per-project overrides. See the Dashboard Templating System section for full details.

  • Edition slug rewrite rules: an ordered list of rules that transform git refs into edition slugs. Configured at the org level with optional per-project overrides. See the Edition slug rewrite rules section for the full rule format.

  • Default edition lifecycle rules (Projects, editions and builds).

Credential storage#

Docverse encrypts organization credentials at rest using Fernet symmetric encryption from the cryptography library. Fernet provides AES-128-CBC encryption with HMAC-SHA256 authentication — ciphertext is tamper-evident and self-describing (the token embeds a timestamp and version byte). Encryption and decryption are in-process CPU-bound operations (sub-millisecond), requiring no external service calls or network round-trips. A single Fernet key is stored as a Kubernetes secret, never in the database, so database backups alone cannot decrypt credentials.

This approach avoids the operational complexity of Vault Transit (running a Vault instance, configuring Kubernetes auth, managing Vault policies, network round-trips for every encrypt/decrypt) for what amounts to encrypting a small number of short API tokens and keys. The cryptography library is already a transitive dependency via Safir.

Key provisioning#

The Fernet encryption key is provisioned through Phalanx’s standard secrets management. In the application’s secrets.yaml:

credential-encryption-key:
  description: >-
    Fernet key for encrypting organization credentials at rest.
  generate:
    type: fernet-key

Phalanx auto-generates the key, stores it in 1Password, and syncs it to a Kubernetes Secret. The key never appears in the database.

Key loading#

At startup, the application loads the encryption key from environment variables sourced from the Kubernetes Secret:

  • DOCVERSE_CREDENTIAL_ENCRYPTION_KEY — the current primary Fernet key.

  • DOCVERSE_CREDENTIAL_ENCRYPTION_KEY_RETIRED (optional) — a retired key, present only during rotation periods.

When both keys are present, Docverse constructs a MultiFernet([Fernet(primary), Fernet(retired)]). MultiFernet tries decryption with each key in order, so credentials encrypted under either key are readable, while new encryptions always use the primary key. When only the primary key is present, Docverse still wraps it in MultiFernet([Fernet(primary)]) to provide a uniform interface.

Database schema#

Fernet tokens are self-describing (they embed a version byte, timestamp, IV, and HMAC), so the database schema needs no separate columns for nonces, key versions, or algorithm metadata:

CREATE TABLE organization_credentials (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    organization_id UUID NOT NULL REFERENCES organizations(id),
    label TEXT NOT NULL,
    service_type TEXT NOT NULL,
    encrypted_credential TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(organization_id, label)
);

The label is a human-friendly name (e.g., “Cloudflare R2 production”). The service_type identifies the provider (e.g., cloudflare, aws_s3, fastly). Credentials are write-only through the API — the GET response returns metadata (label, service type, timestamps) but never the decrypted value. See Database models for all database tables and organization_credentials for the column reference.

Python integration#

CredentialEncryptor is a thin wrapper around MultiFernet that handles str↔bytes encoding:

from cryptography.fernet import Fernet, MultiFernet


class CredentialEncryptor:
    """Encrypt and decrypt organization credentials using Fernet."""

    def __init__(
        self,
        primary_key: str,
        retired_key: str | None = None,
    ) -> None:
        keys = [Fernet(primary_key)]
        if retired_key:
            keys.append(Fernet(retired_key))
        self._fernet = MultiFernet(keys)

    def encrypt(self, plaintext: str) -> str:
        """Encrypt a credential, returning a Fernet token."""
        return self._fernet.encrypt(
            plaintext.encode()
        ).decode()

    def decrypt(self, token: str) -> str:
        """Decrypt a Fernet token to recover the credential."""
        return self._fernet.decrypt(
            token.encode()
        ).decode()

    def rotate(self, token: str) -> str:
        """Re-encrypt a token under the current primary key.

        If the token is already encrypted under the primary key,
        the result is a fresh token (new IV and timestamp) under
        the same key. MultiFernet.rotate() is idempotent in the
        sense that calling it repeatedly always produces a valid
        token under the primary key.
        """
        return self._fernet.rotate(
            token.encode()
        ).decode()

All methods are synchronous — Fernet operations are sub-millisecond CPU-bound work, so no async/await is needed. The service layer calls decrypt when constructing an org-specific storage or CDN client; the plaintext is held only in memory for the duration of client construction.

In the factory pattern, CredentialEncryptor is a process-level singleton in ProcessContext. Since it holds no network connections or file handles, no shutdown cleanup is needed.

For testing, construct a CredentialEncryptor with Fernet.generate_key() — no mocking or external services required.

Key rotation#

Key rotation uses MultiFernet to provide a zero-downtime transition:

  1. Generate a new Fernet key in 1Password (or let Phalanx regenerate).

  2. Deploy with both keys: set the new key as DOCVERSE_CREDENTIAL_ENCRYPTION_KEY and the old key as DOCVERSE_CREDENTIAL_ENCRYPTION_KEY_RETIRED. Restart pods. At this point, MultiFernet decrypts credentials under either key; new encryptions use the new key.

  3. Run the credential_reencrypt job (scheduled periodically; see Periodic job scheduling). The job iterates over all organization_credentials rows and calls CredentialEncryptor.rotate(), which re-encrypts each token under the current primary key. Unlike Vault’s vault:vN: prefix, Fernet tokens don’t indicate which key encrypted them, so the job processes all rows unconditionally. MultiFernet.rotate() is idempotent — re-encrypting an already-migrated token simply produces a new token under the same primary key.

  4. Remove the retired key: once the re-encryption job completes, remove DOCVERSE_CREDENTIAL_ENCRYPTION_KEY_RETIRED and restart pods.

Organization management#

Organizations are created and configured via the Docverse API (not statically in Helm/Phalanx config). This keeps orgs in the same database as projects for consistency. The API has two tiers of admin endpoints:

  • Docverse superadmin APIs: create/delete/list organizations (scoped via Gafaelfawr token scope)

  • Org admin APIs: configure the org’s settings, manage projects, manage edition rules

Dashboard rendering#

Docverse subsumes the role of LTD Dasher. Dashboard pages (/v/index.html) and custom 404 pages are rendered server-side using Jinja templates and project/edition metadata from the database, then uploaded as self-contained HTML files to the object store and served through the CDN. See the Dashboard templating system section for the full design.

Re-rendering is triggered by:

  • Template repo changes (via GitHub webhook)

  • Project metadata changes

  • Edition updates (new build published, edition created/deleted)

Dashboard rendering is handled asynchronously via the task queue.

Relationship to LTD Keeper v2#

The LTD Keeper v2 Organization model (in keeper/models.py) established much of this design: the OrganizationLayoutMode enum (subdomain vs path), Fernet-encrypted credentials, org-scoped Tags and DashboardTemplates, and the Product→Organization foreign key. Key changes in Docverse:

  • Cloud-agnostic storage (not AWS-only)

  • Configurable CDN provider (not Fastly-only)

  • GitHub-repo-based dashboard templates (not S3-bucket-stored)

  • Gafaelfawr auth replacing the User/Permission model

  • Infrastructure configuration at the org level, not the project level (see Projects, editions and builds)

Authentication and authorization#

Docverse uses a two-layer authorization model:

  1. Ingress layer (Gafaelfawr): authenticates requests and enforces a baseline access scope for Docverse admin and Docverse API user access.

  2. Application layer (Docverse): performs fine-grained, org-scoped authorization using group memberships.

This approach keeps Gafaelfawr configuration minimal (no per-org ingress rules) while giving Docverse full control over org-level access policies that org admins can manage through the API without Phalanx Helm changes.

Ingress layer#

Two GafaelfawrIngress resources protect the Docverse API (REST API design Ingress and authorization mapping). The general ingress requires the exec:docverse scope for all API routes and requests a delegated internal token so that Docverse can query Gafaelfawr’s user-info API for the authenticated user’s group memberships. A separate admin ingress requires the admin:docverse scope for superadmin routes.

apiVersion: gafaelfawr.lsst.io/v1alpha1
kind: GafaelfawrIngress
metadata:
  name: docverse-ingress
config:
  scopes:
    all:
      - "exec:docverse"
  service: docverse
  delegate:
    internal:
      scopes:
        - "exec:docverse"
template:
  metadata:
    name: docverse-ingress
  spec:
    rules:
      - http:
          paths:
            - path: /
              pathType: Prefix
              backend:
                service:
                  name: docverse
                  port:
                    name: http
---
apiVersion: gafaelfawr.lsst.io/v1alpha1
kind: GafaelfawrIngress
metadata:
  name: docverse-admin-ingress
config:
  scopes:
    all:
      - "admin:docverse"
  service: docverse
template:
  metadata:
    name: docverse-admin-ingress
  spec:
    rules:
      - http:
          paths:
            - path: /admin
              pathType: Prefix
              backend:
                service:
                  name: docverse
                  port:
                    name: http

Gafaelfawr provides the following headers to Docverse on each authenticated request:

  • X-Auth-Request-User: the authenticated username

  • X-Auth-Request-Email: the user’s email (if available)

  • X-Auth-Request-Token: the delegated internal token

The Docverse superadmin role is enforced at the ingress level. The admin:docverse scope is mapped from an identity-provider group via Gafaelfawr’s groupMapping configuration. Routes that manage organizations (/admin/*) require this scope.

Application-layer authorization#

All org-scoped authorization is handled inside Docverse, using the authenticated user’s username and group memberships.

Retrieving user metadata#

On each request to an org-scoped endpoint, Docverse uses the delegated token from X-Auth-Request-Token to call Gafaelfawr’s /auth/api/v1/user-info API. This returns the user’s full metadata, including group memberships (sourced from LDAP, CILogon, or GitHub depending on the Gafaelfawr deployment). The response is cached within the request (in the factory) so a single request never calls the user-info API more than once.

Org membership table#

Docverse maintains an OrgMembership table in its database that maps users and groups to roles within organizations:

Column

Type

Description

id

UUID

Primary key

org_id

FK → Organization

The organization

principal

str

A username or group name

principal_type

enum

user or group

role

enum

reader, uploader, or admin

See OrgMembership in the database schema section for the complete column reference.

A membership row can reference either a username (for individual grants, e.g., a CI bot user) or a group name (for group-based grants, matching the groups from Gafaelfawr/LDAP, e.g., g_spherex).

Roles#

Three org-level roles, plus one global role:

Role

Permissions

reader

List and read projects, editions, builds within the org (read-only API access)

uploader

Everything reader can do, plus create builds (upload documentation). Designed to be safe for CI tokens.

admin

Everything uploader can do, plus create/manage projects, configure org settings, manage editions and lifecycle rules, manage org membership.

superadmin (global)

Create and manage organizations. Enforced via Gafaelfawr scope, not the OrgMembership table.

No per-project ACLs — authorization is at the org level.

Role resolution#

On each request to an org-scoped endpoint, Docverse resolves the user’s effective role for the target organization:

  1. Get the username from X-Auth-Request-User.

  2. Get the user’s group memberships from the Gafaelfawr user-info API (using the delegated token, cached within the request).

  3. Query the OrgMembership table for all rows matching the username or any of the user’s groups, scoped to the target org.

  4. Take the highest-privilege role found (admin > uploader > reader).

  5. If no matching rows exist, the user has no access to that org (403).

This resolution is implemented as a FastAPI dependency that handlers can require, receiving the resolved role (or raising 403).

Examples#

  • “Everyone in g_spherex is an uploader for the SPHEREx org”: single OrgMembership row with principal=g_spherex, principal_type=group, role=uploader.

  • “CI bot docverse-ci-spherex is an uploader for the SPHEREx org”: single row with principal=docverse-ci-spherex, principal_type=user, role=uploader.

  • “User jdoe is an admin for the Rubin org”: single row with principal=jdoe, principal_type=user, role=admin.

Org admins manage membership through the Docverse API — no Phalanx Helm changes needed.

CI bot tokens#

For CI/CD pipelines (e.g., GitHub Actions uploading documentation), Gafaelfawr tokens are created using the admin token creation API with a bot username (e.g., bot-docverse-ci-rubin). The resulting token string is stored as a GitHub Actions secret. The bot username is then added to the relevant org’s OrgMembership table with the uploader role. See Client-server monorepo for how the Python client and GitHub Actions action (docverse-upload) for how the GitHub Action consume these tokens.

Testing#

For testing, the X-Auth-Request-User header can be set directly (following the Safir testing pattern). The Gafaelfawr user-info API call can be mocked to return specific group memberships for test users. Safir provides a mock for this purpose via the rubin-gafaelfawr package.

Projects, editions and builds#

The key domain entities from LTD are retained in Docverse.

  • Project: a documentation site with a stable URL and multiple versions (editions). Projects are owned by organizations and have metadata like name, description, and configuration.

  • Edition: a published version of a project, representing a specific build’s content at a stable URL. Editions have tracking modes that determine which builds they follow (e.g., main edition tracks the default branch, DM-12345 edition tracks branches matching tickets/DM-12345).

  • Build: a discrete upload of documentation content for a project. Builds are conceptually immutable and carry metadata about their origin (git ref, uploader identity, etc). Builds are identified with a Crockford Base32 ID.

Project model simplification#

In the original LTD API, projects were called “products” to borrow terminology from EUPS. With time, we realized that EUPS wasn’t relevant to LTD, and we shifted the terminology to “project.”

With Docverse, we further improve projects by removing all infrastructure configuration from the project model and moving it to the organization level. This simplifies the project model and reflects the fact that infrastructure configuration (e.g., object store settings, CDN settings) is typically shared across all projects within an organization. See the Organizations design section for details on the new organization model and how infrastructure configuration is handled there.

Improved build uploads#

An issue with the LTD API was that build uploads are slow for large documentation sites. This was because the client uploaded each file individually using presigned URLs.

Docverse uses a tarball-to-object-store upload model. The client compresses the built documentation into a tarball, uploads it directly to the object store via a presigned URL, then signals Docverse to process it. This avoids the performance problems of per-file presigned URLs (HTTP overhead per file, no compression, thousands of separate uploads for large Sphinx sites) and keeps the API server thin by never routing large request bodies through the API. See Client-server monorepo for the Python client library and GitHub Actions action (docverse-upload) for the GitHub Action that implement this upload flow.

End-to-end flow#

This sequence diagram illustrates the end-to-end flow of a build upload and processing:

        sequenceDiagram
    participant Client
    participant API as Docverse API
    participant Store as Object Store
    participant Worker as Docverse Worker

    Client->>API: Authenticate (Gafaelfawr token)
    Client->>API: POST create build (git ref, content hash)
    API-->>Client: Presigned upload URL
    Client->>Store: Upload tarball (PUT via presigned URL)
    Client->>API: PATCH signal upload complete
    API-->>Client: queue_url for tracking
    API->>Worker: Enqueue build processing job
    Worker->>Store: Download tarball from staging
    Worker->>Store: Stream-unpack & upload files to build prefix
    Worker->>Store: Delete staging tarball
    Worker->>Worker: Inventory objects in Postgres, evaluate tracking rules
    Worker->>Store: Update editions (pointer or copy mode)
    Worker->>Worker: Render project dashboard
    
  1. Client authenticates with a Gafaelfawr token (org uploader role).

  2. Client creates a build record via the API (POST /orgs/:org/projects/:project/builds). The request includes the git ref and a content hash of the tarball. The response provides a single presigned upload URL pointing to the staging location. See the REST API design section for request/response details.

  3. Client uploads the tarball directly to the object store using the presigned URL. The tarball is a gzipped tar archive (.tar.gz) of the built documentation directory. The client implementation is straightforward – tar czf piped to an HTTP PUT. The object store handles the bandwidth; the API server is not involved. Multipart uploads are supported where the object store provider allows (S3, GCS), enabling resumable uploads for large sites.

  4. Client signals upload complete (PATCH /orgs/:org/projects/:project/builds/:build with status update). The response includes a queue_url for tracking the background processing.

  5. Background processing – a single background job executes the build processing pipeline:

    1. Download tarball from the staging location (staging bucket or __staging/ prefix in the publishing bucket).

    2. Stream-unpack and upload: extract entries from the tar stream and upload individual files to the build’s permanent prefix (__builds/{build_id}/) in the publishing bucket. Uploads are parallelized via an asyncio.Semaphore-bounded pool of concurrent uploads. For a 5,000-file Sphinx site, this processes in well under a minute.

    3. Delete staging tarball after successful extraction.

    4. Inventory the build’s objects in Postgres.

    5. Evaluate tracking rules to determine affected editions.

    6. Update editions in parallel via asyncio.gather().

    7. Render project dashboard and metadata JSON once after all edition updates.

The API handler for step 4 is thin: it validates the request, updates the build status to processing, enqueues the background job, and returns the queue_url.

Staging location#

The tarball is uploaded to a staging location that is separate from the build’s permanent prefix. The staging path is __staging/{build_id}.tar.gz. Where this lives depends on whether the org has a dedicated staging store configured:

  • With staging store: the presigned URL points to the staging bucket. The worker reads from the staging bucket (fast, intra-region) and writes extracted files to the publishing bucket.

  • Without staging store: the presigned URL points to the publishing bucket using the __staging/ prefix. The worker reads and writes within the same bucket.

The staging store optimization matters when the publishing bucket is on a different network than the Docverse compute cluster. For example, in Rubin Observatory’s deployment:

  • Staging bucket: GCS in us-central1 (same region as the GKE cluster). The CI runner uploads the tarball to GCS. The Docverse worker downloads it over Google’s internal network – fast, with zero egress cost.

  • Publishing bucket: Cloudflare R2. The worker uploads extracted files to R2, which is optimized for CDN serving with zero egress cost to readers.

Without the split, the worker would download the tarball from R2 over the public internet, unpack it, then upload thousands of individual files back to R2 over the public internet. With the split, only the final extracted files cross the network boundary, and the tarball round-trip stays within GCP.

For orgs where the publishing store is in the same region as the cluster (e.g., GCS publishing via Google Cloud CDN, or S3 publishing via CloudFront in the same AWS region), the staging store adds no benefit and can be left unconfigured.

Stream unpacking#

The worker unpacks the tarball using Python’s tarfile module in streaming mode, uploading each file to the publishing bucket as it’s extracted. This keeps memory usage bounded – the worker never needs to hold the entire unpacked site in memory or on disk:

async def unpack_and_upload(
    self,
    staging_store: ObjectStore,
    publishing_store: ObjectStore,
    build_id: str,
    semaphore: asyncio.Semaphore,
) -> list[str]:
    """Stream-unpack a tarball and upload files to the build prefix."""
    tarball_stream = await staging_store.get_object_stream(
        f"__staging/{build_id}.tar.gz"
    )
    uploaded_keys: list[str] = []
    upload_tasks: list[asyncio.Task] = []

    with tarfile.open(fileobj=tarball_stream, mode="r:gz") as tar:
        for member in tar:
            if not member.isfile():
                continue
            file_obj = tar.extractfile(member)
            if file_obj is None:
                continue
            content = file_obj.read()
            key = f"__builds/{build_id}/{member.name}"

            async def _upload(k: str, data: bytes) -> None:
                async with semaphore:
                    await publishing_store.upload_object(
                        key=k,
                        data=data,
                        content_type=guess_content_type(k),
                    )

            task = asyncio.create_task(_upload(key, content))
            upload_tasks.append(task)
            uploaded_keys.append(key)

    await asyncio.gather(*upload_tasks)
    await staging_store.delete_object(f"__staging/{build_id}.tar.gz")
    return uploaded_keys

The semaphore bounds concurrency (e.g., 50 concurrent uploads) to avoid overwhelming the object store API. Content types are inferred from file extensions.

The choice of tar.gz over ZIP is deliberate. Tar archives are designed for sequential streaming (originally tape I/O), and each entry header includes the file size upfront, so the worker can stream each file directly into an object store upload without buffering. Gzip compresses the archive as a single stream, which yields better compression ratios for documentation sites whose HTML, CSS, and JavaScript files share significant redundancy — ZIP, by contrast, compresses each file independently and cannot exploit cross-file similarity. ZIP’s main advantage is random access via its central directory, but that is irrelevant here since the worker extracts every file sequentially.

Build object inventory table#

Each build has an associated set of object records in Postgres: key (object path), content hash (ETag or SHA-256), content type, and size. This is populated during the inventory phase of the build processing job. The inventory enables:

  • Fast diff computation for edition updates (no object store listing calls)

  • Orphan detection for build cleanup rules

  • Metadata for dashboards (e.g., build size)

Note that this inventory table is motivated for the original S3 and Fastly-based architecture where editions are updated by copying the build objects. With the Cloudflare-based architecture where editions are updated by pointing to the build prefix, the inventory is less critical for edition updates but still be valuable for orphan detection and dashboard metadata. See BuildObject for the column definition.

Edition overview#

An Edition is a named, published view of a Product’s documentation at a stable URL. Editions are pointers — they represent a specific build’s content served at an edition-specific URL path (e.g., /v/main/, /v/DM-12345/, /v/2.x/).

Two concepts govern edition behavior:

  • Tracking mode: determines which builds the edition follows (the algorithm for auto-updating).

  • Edition kind: classifies the edition for dashboard display and lifecycle rule targeting.

Edition slugs#

Editions are identified by URL-safe slugs. The slug system has three layers:

  1. Reserved slugs: __main is the sole reserved slug, representing the default edition that serves at the project root (no /v/ prefix in the URL). It does not correspond to a git ref and uses double-underscore prefix to avoid collisions with any git branch or tag name.

  2. Org-configurable rewrite rules: organizations can configure pattern-based transforms from git ref → edition slug. These are an ordered list of rules where the first match wins, with three rule types: prefix_strip, regex, and ignore. For example, Rubin’s convention uses a prefix_strip rule to rewrite tickets/DM-12345 → slug DM-12345. See the Edition slug rewrite rules section for the full rule format, evaluation algorithm, and examples.

  3. Default behavior: slashes in git refs become dashes (e.g., feature/dark-modefeature-dark-mode), keeping all edition slugs as single URL path segments.

An edition tracks a slug, not a single canonical git ref. Multiple git refs can map to the same slug through rewrite rules and all contribute builds to that edition. For example, if both tickets/DM-12345 and DM-12345 exist as branches and both rewrite to slug DM-12345, builds from either ref update the same edition. This fixes a bug in LTD Keeper where only one ref could contribute to a special-case edition.

Edition slug rewrite rules#

When a new build arrives and its git ref doesn’t match any existing edition, Docverse uses slug rewrite rules to determine how to transform the git ref into an edition slug. Rules are evaluated in order; the first match wins.

Rule types#

Three rule types cover the practical use cases:

prefix_strip — the workhorse. Matches refs starting with a literal prefix, strips it, and uses the remainder as the slug (with any remaining slashes replaced by a configurable character, default -).

{
  "type": "prefix_strip",
  "prefix": "tickets/",
  "edition_kind": "draft"
}

tickets/DM-12345 → slug DM-12345. tickets/foo/bar → slug foo-bar.

regex — for patterns that prefix stripping can’t express. Uses a Python regex with a named capture group slug to extract the edition slug. Remaining slashes in the captured group are still replaced by slash_replacement by default, protecting against invalid slugs.

{
  "type": "regex",
  "pattern": "^release/(?P<slug>v\\d+\\.\\d+)$",
  "edition_kind": "release"
}

release/v2.3 → slug v2.3 with kind release. release/experimental → no match.

ignore — suppresses edition auto-creation for matching refs. Uses glob patterns (Python fnmatch semantics with ** for recursive matching). Useful for filtering noise from dependency bot branches, CI scratch branches, and similar refs that should never produce editions.

{
  "type": "ignore",
  "glob": "dependabot/**"
}

dependabot/npm/lodash-4.17.21 → no edition created.

Rule fields#

Field

Type

Applies to

Default

Description

type

enum

all

prefix_strip, regex, or ignore

edition_kind

str

prefix_strip, regex

"draft"

Kind assigned to auto-created editions

prefix

str

prefix_strip

Literal prefix to match and strip

pattern

str

regex

Python regex with named group slug

glob

str

ignore

Glob pattern for ref matching

slash_replacement

str

prefix_strip, regex

"-"

Character replacing remaining slashes in extracted slug. Must be one of -, _, .

Storage and scoping#

Rules are stored as a JSONB array on the Organization and Project tables:

  • Organization.slug_rewrite_rules (JSONB): ordered rule list applied to all projects in the org.

  • Project.slug_rewrite_rules (JSONB, nullable): when set, completely replaces the org-level rules for that project. No merging or inheritance — if a project needs one different rule, it copies the org rules and modifies. This avoids the complexity of ordered-list merge semantics.

Setting the project-level rules to null (or omitting the field) restores inheritance from the org.

Evaluation algorithm#

1. rules = project.slug_rewrite_rules ?? org.slug_rewrite_rules ?? []
2. For each rule in order:
   a. If type=ignore and glob matches git_ref → return None (suppress)
   b. If type=prefix_strip and git_ref starts with prefix →
      remainder = git_ref[len(prefix):]
      slug = remainder.replace("/", rule.slash_replacement)
      return (slug, rule.edition_kind)
   c. If type=regex and pattern matches git_ref →
      slug = match.group("slug")
      slug = slug.replace("/", rule.slash_replacement)
      return (slug, rule.edition_kind)
3. No rule matched (default fallback):
   slug = git_ref.replace("/", "-")
   return (slug, "draft")

The default fallback (step 3) always applies, so every non-ignored ref produces a valid slug even with zero rules configured. This preserves LTD Keeper’s existing behavior for orgs that don’t configure any rewrite rules.

Slug validation#

After a slug is produced (by rule or default), it is validated:

  • Must be non-empty.

  • Must contain only URL-safe characters: lowercase alphanumeric, hyphens, underscores, dots. Uppercase characters are lowercased.

  • Must not start with __ (reserved prefix for system slugs like __main).

  • Must not exceed 128 characters.

If validation fails, the build is processed but no edition is auto-created. The build record’s status reflects that slug generation failed, and the issue is logged for operator attention.

Example: Rubin Observatory configuration#

[
  { "type": "ignore", "glob": "dependabot/**" },
  { "type": "ignore", "glob": "renovate/**" },
  { "type": "prefix_strip", "prefix": "tickets/", "edition_kind": "draft" },
  {
    "type": "regex",
    "pattern": "^v?(?P<slug>\\d+\\.\\d+\\.\\d+)$",
    "edition_kind": "release"
  }
]

Evaluation for various git refs:

Git ref

Matched rule

Slug

Kind

dependabot/npm/lodash-4.17.21

ignore (index 0)

— (suppressed)

renovate/typescript-5.x

ignore (index 1)

— (suppressed)

tickets/DM-12345

prefix_strip (index 2)

DM-12345

draft

tickets/DM-99999

prefix_strip (index 2)

DM-99999

draft

v2.3.0

regex (index 3)

2.3.0

release

2.3.0

regex (index 3)

2.3.0

release

feature/dark-mode

default fallback

feature-dark-mode

draft

main

default fallback

main

draft

Note: for main, the default fallback produces slug main with kind draft, but in practice the __main edition already exists and matches via its tracking mode, so auto-creation is not triggered.

Dry-run endpoint#

A preview endpoint allows org admins to test their rewrite rules against a git ref without creating any resources:

POST /orgs/:org/slug-preview                         → preview slug resolution (admin)

Request:

{
  "git_ref": "tickets/DM-12345",
  "project": "pipelines"
}

The optional project field causes the endpoint to use project-level rule overrides if they exist. Without it, the org-level rules are used.

Response:

{
  "git_ref": "tickets/DM-12345",
  "edition_slug": "DM-12345",
  "edition_kind": "draft",
  "matched_rule": {
    "type": "prefix_strip",
    "prefix": "tickets/",
    "index": 0
  },
  "rule_source": "org"
}

For an ignored ref:

{
  "git_ref": "dependabot/npm/lodash-4.17.21",
  "edition_slug": null,
  "edition_kind": null,
  "matched_rule": {
    "type": "ignore",
    "glob": "dependabot/**",
    "index": 0
  },
  "rule_source": "org"
}

For a ref that hits the default fallback:

{
  "git_ref": "feature/dark-mode",
  "edition_slug": "feature-dark-mode",
  "edition_kind": "draft",
  "matched_rule": null,
  "rule_source": "default"
}

The rule_source field indicates whether the rules came from "org", "project", or "default" (no rules configured, using built-in fallback).

Compound slug derivation for alternate-scoped builds#

When a build includes an alternate_name (see REST API design), slug derivation adds a scoping prefix to keep alternate-specific editions separate from generic ones:

  1. Apply normal slug rewrite rules to git_ref → base slug + edition kind.

  2. Prepend the alternate_name with a -- separator: {alternate_name}--{base_slug}.

  3. Set the edition’s tracking mode to alternate_git_ref with tracking_params: {"git_ref": "<git_ref>", "alternate_name": "<alternate_name>"}.

Example: a build with git_ref: "tickets/DM-12345" and alternate_name: "usdf-dev":

  • Rewrite rules produce base slug DM-12345, kind draft.

  • Final slug: usdf-dev--DM-12345.

  • Tracking mode: alternate_git_ref, tracking params: {"git_ref": "tickets/DM-12345", "alternate_name": "usdf-dev"}.

For builds without alternate_name, slug derivation is unchanged.

The -- separator is chosen because it is not a valid sequence in git branch names (double hyphens are legal but uncommon), making it a reliable delimiter between the alternate name and the base slug. Slug validation still applies to the full compound slug.

Tracking modes#

The full set of tracking modes in Docverse:

Mode

Behavior

Carried from

git_ref

Track a specific branch or tag

LTD Keeper v1 (was git_refs)

lsst_doc

Track latest LSST document version tag (vMajor.Minor)

LTD Keeper v1

eups_major_release

Track latest EUPS major release tag

LTD Keeper v1

eups_weekly_release

Track latest EUPS weekly release tag

LTD Keeper v1

eups_daily_release

Track latest EUPS daily release tag

LTD Keeper v1

semver_release

Track the latest semver release, excluding pre-releases (alpha, beta, rc)

New

semver_major

Track the latest release within a major version stream (e.g., latest 2.x.x). Parameterized by major_version.

New

semver_minor

Track the latest release within a minor version stream (e.g., latest 2.3.x). Parameterized by major_version, minor_version.

New

alternate_git_ref

Track a specific branch or tag scoped to an alternate name. Parameterized by git_ref and alternate_name.

New

Semver tracking supports tags both with and without a v prefix (e.g., v2.1.0 and 2.1.0).

The alternate_git_ref tracking mode#

The alternate_git_ref mode is the dedicated tracking mode for deployment-scoped editions. It matches builds by both a specific git_ref and an alternate_name, parameterized via tracking_params. For example, edition usdf-dev--DM-12345 uses alternate_git_ref with tracking_params: {"git_ref": "tickets/DM-12345", "alternate_name": "usdf-dev"} — it updates only when a build arrives carrying that exact git ref and alternate name pair.

Builds with alternate_name set are invisible to editions that do not use alternate_git_ref (or otherwise filter on alternate_name). This prevents a deployment-specific build from accidentally updating __main or a generic draft edition. Conversely, builds without alternate_name are invisible to alternate_git_ref editions. The alternate name acts as a namespace partition within a project’s build stream.

Edition kinds#

Edition kinds classify editions for display and lifecycle purposes:

Kind

Description

Typical tracking mode

main

The default edition

git_ref tracking main/master

release

A stable release

semver_release, lsst_doc

draft

A draft/development edition

git_ref tracking a feature branch

major

Tracks a major version stream

semver_major

minor

Tracks a minor version stream

semver_minor

alternate

An alternative product variant or deployment

alternate_git_ref

When an edition is auto-created, its kind is assigned based on the tracking mode. The kind does not constrain tracking behavior — it provides context for dashboards and lifecycle rules.

alternate editions are exempt from draft_inactivity lifecycle rules by default — they represent long-lived deployment targets, not transient branches. Alternate editions can be created manually via the API, or auto-created when builds carry an alternate_name (see below). Slug rewrite rules can also assign edition_kind: "alternate" to control the kind assigned during auto-creation.

Auto-creation of editions#

When a new build arrives and matches no existing edition’s tracking criteria, Docverse can auto-create editions:

  • git_ref editions: auto-created for new branches/tags (as in LTD Keeper). Classified as draft kind by default.

  • semver_major editions: auto-created when a build introduces a new major version stream (e.g., first v3.x.x tag). Classified as major kind.

  • semver_minor editions: auto-created when a build introduces a new minor version stream (e.g., first v3.1.x tag). Classified as minor kind.

  • alternate_git_ref editions: auto-created when a build carries an alternate_name and its compound slug ({alternate_name}--{base_slug}) does not match an existing edition. The edition is created with alternate_git_ref tracking mode and tracking_params containing both the git_ref and alternate_name. The edition kind comes from slug rewrite rules applied to the base git ref (typically draft for ticket branches).

    Note

    Because auto-creation derives the edition kind from slug rewrite rules, an alternate edition tracking the default branch (e.g., usdf-dev--main) gets kind draft via the default fallback — not kind alternate. This means it would be excluded from the version switcher (which includes kind alternate but not draft) and would be subject to draft_inactivity lifecycle cleanup. To avoid this, pre-create long-lived deployment editions with kind alternate via the POST /orgs/:org/projects/:project/editions endpoint before the first build arrives. See the REST API design section for the edition creation request body.

Auto-creation for semver_major, semver_minor, and alternate_git_ref modes can be disabled at both the org and project level.

Build annotations#

The upload client can optionally annotate a build with metadata about its nature (e.g., “this is a release”, “this is a draft/PR build”). This supplements the pattern-based classification from tracking rules. Two complementary mechanisms:

  • Client annotations: optional metadata on the build record provided at upload time.

  • Project/org pattern rules: configurable rules that classify builds based on git ref patterns (e.g., “tags matching v*.*.* are releases”, “branches matching tickets/* are drafts”).

The tracking system uses both signals, with pattern-based rules as the primary classifier.

Edition-build history#

Docverse maintains an explicit log of every build that an edition has pointed to, stored in an EditionBuildHistory table with the edition ID, build ID, timestamp, and ordering position. This replaces the implicit relationship in LTD Keeper (where you’d have to reconstruct history from build timestamps). The history enables:

  • Rollback API: an org admin can roll an edition back to any previous build in its history with a single API call (PATCH the edition with a build field pointing to the desired build).

  • Orphan build detection: lifecycle rules can reference history position (e.g., “a build that is 5+ versions back and older than 30 days is an orphan”).

See EditionBuildHistory for the column definition.

Edition update strategy#

When an edition is updated to point to a new build, the strategy depends on the CDN’s declared capabilities — pointer mode or copy mode.

Pointer mode (instant switchover)#

For CDNs with edge compute and an edge data store (see CDN provider comparison), the edition is a metadata pointer and no object copying is required. The update process:

  1. Write new mapping: update the edition→build mapping in the edge data store (e.g., Cloudflare Workers KV, Fastly KV Store).

  2. Purge CDN cache: invalidate cached content for the edition so new requests resolve the updated mapping.

  3. Update database: record the new edition→build association and log to EditionBuildHistory.

The edge Worker/Compute function intercepts each request, looks up the current build for the requested edition in the edge data store, and fetches the content directly from the build’s object store prefix. Edition updates are effectively instant — a KV write + cache purge, no bulk object operations.

Copy mode (ordered in-place update)#

For CDNs without edge compute, Docverse performs an ordered in-place update (the fallback from the LTD Keeper approach of delete-then-copy, which caused temporary 404s):

  1. Diff: compare the new build’s object inventory against the edition’s current inventory using the Postgres object tables.

  2. Copy new/changed assets first: images, CSS, JS — so that HTML pages referencing new assets won’t break.

  3. Copy HTML pages: starting from the deepest directory levels, working up to the homepage last. This minimizes the window where a user might see a partially-updated site.

  4. Move orphaned objects to purgatory: objects in the edition that don’t exist in the new build are moved to a purgatory key prefix rather than deleted immediately.

  5. Purge CDN cache: invalidate cached content for the edition.

This ordering ensures that at no point during the update will a user get a 404 for content that should exist.

Mode selection#

The edition update service checks the CDN’s declared supports_pointer_mode capability at runtime to select the appropriate code path. This keeps the service logic clean and makes it straightforward to add new CDN providers without modifying orchestration logic. Orgs using Cloudflare Workers + R2 or Fastly Compute get instant switchovers; orgs using Google Cloud CDN or other providers without edge compute get the ordered copy strategy.

Deletion and lifecycle rules#

Soft delete and purgatory#

Docverse uses a two-layer soft delete approach:

  • Database: objects are soft-deleted (marked with a date_ended or date_deleted timestamp) rather than immediately removed.

  • Object store: files are moved to a purgatory key prefix rather than deleted. A background job hard-deletes purgatory objects after a configurable retention period.

This provides reversibility at both layers. The purgatory timeout is configurable at the org level with per-project overrides.

Lifecycle rules#

Lifecycle rules are stored as JSONB in Postgres — a list of rule objects, each with a type discriminator and type-specific parameters. Rules are configured at the org level as defaults and can be overridden per-project. Individual editions can be marked as exempt from all lifecycle rules (a “never delete” flag).

Rule types:

Rule type

Parameters

Behavior

draft_inactivity

max_days_inactive

Delete draft editions with no new builds for N days

ref_deleted

enabled

Delete editions whose tracked Git ref no longer exists on GitHub

build_history_orphan

min_position, min_age_days

Delete builds that are N+ positions back in an edition’s history and older than M days

Example rule configuration (JSONB):

[
  { "type": "draft_inactivity", "max_days_inactive": 30 },
  { "type": "ref_deleted", "enabled": true },
  { "type": "build_history_orphan", "min_position": 5, "min_age_days": 30 }
]

Different edition kinds can have different default lifecycle rules. For example, draft editions might default to draft_inactivity with 30 days, while release editions have no auto-deletion rules.

GitHub event-driven deletion#

Edition deletion triggered by Git ref deletion is event-driven via GitHub webhooks. Docverse supports two models for receiving GitHub events:

  • Direct webhooks: Docverse receives webhook events directly as a GitHub App, using the Safir GitHub App framework.

  • Kafka via Squarebot: Squarebot acts as a GitHub App events gateway, receiving webhooks and republishing them internally via Kafka. Docverse consumes events from Kafka. This allows Docverse to share a GitHub App installation with other internal tools.

Both models are supported; the deployment can use either or both.

A periodic audit job supplements the event-driven approach by verifying that Git refs referenced by Docverse editions and projects still exist on GitHub. This catches cases where webhook delivery failed or events were missed.

Documentation hosting#

Like LTD before, Docverse decouple the documentation hosting from the API service. The documentation host uses a simple and reliable cloud-based CDN and object store so that documentation is served with low latency and high availability around the world, even if the API service itself is down. In LTD, we used Fastly as the CDN and AWS S3 for object storage. With Docverse, we wish to support multiple hosting stacks, but also add a new hosting architecture using Cloudflare Workers + R2 that eliminates the S3 copy-on-publish bottleneck and reduces costs by an order of magnitude.

The S3 copy-on-publish bottleneck#

The original LSST the Docs platform serves documentation projects as wildcard subdomains (pipelines.lsst.io, dmtn-139.lsst.io) through Fastly’s CDN. LTD Keeper, a Flask-based REST API, manages three core entities: products (documentation projects), builds (individual CI-produced documentation snapshots), and editions (named pointers like main or v1.0 that track specific builds). Each entity has a corresponding path prefix in a shared S3 bucket.

Fastly VCL intercepts every request and performs regex-based URL rewriting to map the requested URL to an S3 object path. For example, pipelines.lsst.io/v/main/page.html becomes s3://bucket/pipelines/editions/main/page.html. This is elegant but rigid: it can only serve what’s physically at the edition’s S3 path. So when the main edition is updated from build b41 to b42, LTD Keeper must copy every object from pipelines/builds/b42/ to pipelines/editions/main/. The system then purges Fastly’s cache using surrogate keys — which works well at ~150ms global propagation — but the S3 copy itself can take minutes for large documentation sets.

This copy-on-publish approach has the fundamental issue that edition updates are slow and can also be inconsistent for users during the copy window. The original LTD implementation had the additional bug that an edition’s objects would be deleted before the new build’s objects were copied into place, causing 404s for pages that weren’t in the Fastly cache.

The proposed solution replaces this with edge-side dynamic resolution: the CDN intercepts the request, extracts the project name and edition from the URL, consults an edge data store to determine which build the edition currently points to, and fetches the correct object directly from the build’s storage path. Edition updates become a metadata change (updating a key-value mapping) rather than a bulk data operation.

The Cloudflare stack#

Cloudflare can provide this edge-side dynamic edition resolution through a combination of its Workers edge compute platform, R2 object storage, and Workers KV key-value store. Surprisingly, this platform is also substantially more cost-effective than the current Fastly + S3 architecture, even at large scale, due to R2’s zero egress fees, free wildcard TLS certification and Workers’ efficient edge execution model.

Architecture overview#

With Cloudflare, a request is handled at the edge by a Worker script that parses the URL to determine the project and edition. With Workers KV, the Worker looks up which build the edition currently points to, constructs the R2 key for the requested object, and fetches it directly from R2 using the native R2 bindings. The Worker then returns the object with appropriate caching headers:

        graph TD
    DNS["*.lsst.io<br/>(wildcard DNS → Cloudflare)"]
    DNS --> W1

    subgraph Worker["Cloudflare Worker"]
        W1["1. Parse Host header<br/>→ extract project"]
        W2["2. Parse URL path<br/>→ extract edition"]
        W3["3. KV lookup<br/>edition → build ID"]
        W4["4. Construct R2 key<br/>{project}/builds/{build}/{path}"]
        W5["5. Fetch from R2"]
        W6["6. Return with cache headers"]
        W1 --> W2 --> W3 --> W4 --> W5 --> W6
    end

    W3 -.-> KV[("Workers KV<br/>edition→build mappings")]
    W5 -.-> R2[("R2 Bucket<br/>Stores all builds once")]
    API["Docverse API"] -->|REST API writes| KV
    

With this architecture, builds are only stored in R2 in one place ({project}/builds/{build_id}/), and editions are just pointers to builds in Workers KV. When an edition is updated, only the KV mapping changes — an instant metadata operation instead of a bulk S3 copy.

Worker request flow#

The Worker implementation is approximately 100–200 lines of TypeScript. The request handling flow:

  1. Parse Host header to extract the project name from the subdomain (e.g., pipelines.lsst.iopipelines).

  2. Parse URL path to determine the edition and file path:

    • /v/{edition}/page.html → named edition

    • /page.html or / → default edition (configurable, typically main)

    • /builds/{build_id}/page.html → direct build access (bypasses KV lookup)

  3. KV lookup: read the edition→build mapping from Workers KV using key {project}/{edition}. KV reads are cached at the edge with a cacheTtl (e.g., 30 seconds) to reduce KV costs and latency on hot paths.

  4. Construct R2 key: {project}/builds/{build_id}/{file_path}.

  5. Fetch from R2 via the native R2 binding (no HTTP overhead).

  6. Return response with appropriate Content-Type, Cache-Control, and Cache-Tag headers.

The Worker also handles routing to the project dashboard page (and other Docverse metadata files) and returns appropriate 404 responses. See dashboard templating system for how dashboard pages are served.

Cloudflare configuration#

The Worker is configured via a wrangler.toml file:

name = "lsst-io-router"
main = "src/index.ts"
compatibility_date = "2025-01-01"

# Route: catch all subdomains of lsst.io
routes = [
  { pattern = "*.lsst.io/*", zone_name = "lsst.io" }
]

# KV Namespace for edition → build mappings
[[kv_namespaces]]
binding = "EDITIONS"
id = "<KV_NAMESPACE_ID>"
preview_id = "<KV_PREVIEW_NAMESPACE_ID>"

# R2 Bucket for documentation builds
[[r2_buckets]]
binding = "DOCS_BUCKET"
bucket_name = "lsst-io-docs"

# Environment variables
[vars]
DEFAULT_EDITION = "__main"

DNS setup: a proxied wildcard CNAME record (*.lsst.io) combined with an HTTP route pattern (*.lsst.io/*) directs all subdomain traffic through a single Worker. Cloudflare’s Universal SSL automatically provisions and renews a wildcard certificate for *.lsst.io at no additional cost. The underlying A record content is irrelevant (it is never reached) because the Worker intercepts all requests.

Infrastructure provisioning: the KV namespace and R2 bucket are created via the Wrangler CLI:

# Create the KV namespace
npx wrangler kv namespace create "EDITIONS"
npx wrangler kv namespace create "EDITIONS" --preview

# Create the R2 bucket
npx wrangler r2 bucket create ltd-docs

Two-tier caching strategy#

The caching architecture uses two layers:

  • Workers KV as the global source of truth for edition→build mappings, with an API for external writes that Docverse calls when an edition is updated. KV propagates updates globally within approximately 60 seconds — acceptable for documentation that doesn’t require sub-second freshness.

  • Per-PoP Cache API as a hot local cache in front of KV reads (cacheTtl on KV read operations), eliminating KV costs and latency for frequently accessed mappings.

For the content itself, the Worker sets Cache-Control headers to layer browser and edge caching:

  • Browser cache: short TTL (5 minutes, max-age=300) so users see updates relatively quickly.

  • Edge cache: longer TTL (1 hour, s-maxage=3600) to reduce R2 read operations.

R2’s built-in Tiered Read Cache automatically caches hot objects closer to users, providing an additional optimization layer.

Cache invalidation#

When an edition is re-pointed to a new build, two things happen:

  1. KV is updated — the new edition→build mapping takes effect within ~60 seconds globally.

  2. Edge cache is purged — so users don’t keep seeing the old build for the duration of the s-maxage.

Three cache purge strategies are available, depending on the Cloudflare plan:

  • Purge by hostname (all plans): purge all cached content for a subdomain (e.g., pipelines.lsst.io). Slightly broader than necessary but fast and simple.

  • Purge by prefix (all plans): purge by URL prefix (e.g., pipelines.lsst.io/v/v23.0/) for more targeted invalidation.

  • Purge by Cache-Tag (Enterprise plan): surgical invalidation using Cache-Tag headers set by the Worker (e.g., edition:pipelines/main).

The complete “re-point edition” operation — a KV write plus a cache purge — replaces the S3 directory copy plus Fastly surrogate-key purge in the current LTD architecture. Total time: under 2 seconds, versus minutes for the S3 copy approach.

Wildcard subdomains and SSL#

Wildcard subdomains work on all Cloudflare plans, including the free tier. A proxied wildcard DNS record (*.lsst.io) combined with an HTTP route pattern (*.lsst.io/*) directs all subdomain traffic through a single Worker. Cloudflare’s Universal SSL automatically provisions and renews a wildcard certificate at no additional cost.

KV management#

The KV namespace is the source of truth for which build each edition points to. Docverse writes to it via the Cloudflare REST API whenever an edition is created, updated, or deleted.

Key format: {project}/{edition} (e.g., pipelines/main)

Value format: JSON with build metadata:

{
  "build_id": "b42",
  "updated_at": "2025-06-15T10:30:00Z",
  "git_ref": "main",
  "title": "Latest (main)"
}

Single edition write — the API call Docverse makes on every edition update:

curl -X PUT \
  "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/storage/kv/namespaces/${KV_NAMESPACE_ID}/values/pipelines%2Fmain" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  --data-raw '{"build_id":"b42","updated_at":"2025-06-15T10:30:00Z","git_ref":"main","title":"Latest (main)"}'

Bulk write — for migration or seeding all edition mappings at once (supports up to 10,000 key-value pairs per call):

curl -X PUT \
  "https://api.cloudflare.com/client/v4/accounts/${CF_ACCOUNT_ID}/storage/kv/namespaces/${KV_NAMESPACE_ID}/bulk" \
  -H "Authorization: Bearer ${CF_API_TOKEN}" \
  -H "Content-Type: application/json" \
  --data-raw '[
    {
      "key": "pipelines/main",
      "value": "{\"build_id\":\"b42\",\"updated_at\":\"2025-06-15T10:30:00Z\",\"git_ref\":\"main\"}"
    },
    {
      "key": "pipelines/v23.0",
      "value": "{\"build_id\":\"b38\",\"updated_at\":\"2025-05-01T00:00:00Z\",\"git_ref\":\"v23.0\"}"
    }
  ]'

Cost model#

R2’s zero egress fees are the defining cost feature. At 10 million requests per month, the estimated total cost is approximately \(6/month**. Even at 100 million requests per month, the cost rises to roughly **\)110/month — still dramatically less than equivalent AWS infrastructure.

Component

10M req/mo

100M req/mo

Workers base + requests

$5.00

$32.00

Workers KV reads

$0.00 (within free tier)

$45.00

Workers KV writes

~$0.05

~$0.05

R2 storage (50 GB)

$0.60

$0.60

R2 Class B ops (reads)

$0.00 (within free tier)

$32.40

Bandwidth (egress)

$0.00

$0.00

Total

~$6

~$110

Other hosting options#

Through configuration, Docverse can support multiple hosting stacks, allowing different organizations to choose their preferred CDN provider and architecture. We can certainly continue to support the existing Fastly VCL and S3 architecture that we have used thusfar. This sections surveys other hosting options evaluated, including Fastly Compute, CloudFront + Lambda@Edge and Google Cloud CDN.

Fastly Compute#

Since LSST the Docs already runs on Fastly, migrating from VCL to Fastly Compute avoids changing CDN providers entirely. Compute replaces VCL’s domain-specific language with WebAssembly modules compiled from Rust, JavaScript, or Go, while retaining access to Fastly’s identical cache infrastructure and ~150ms global surrogate-key purge — the fastest in the industry.

Key characteristics:

  • Dynamic Backends (GA since April 2023) allow the Wasm module to construct S3 URLs at runtime rather than pre-configuring every possible origin.

  • The Fastly KV Store provides an edge-local key-value store for edition→build mappings, readable from any PoP with low latency.

  • Wasm cold starts are 35 microseconds, matching VCL’s near-instant request processing.

  • Existing surrogate key infrastructure carries over directly, preserving the current cache purging workflow.

One structural constraint: VCL and Compute cannot coexist on the same Fastly service. Fastly recommends service chaining during transition — placing a Compute service in front of the existing VCL service, then gradually moving logic to Compute until the VCL service can be retired.

Cost: Fastly requires contacting sales for Compute pricing, with a \(50/month minimum** for paid accounts. Bandwidth starts at **\)0.12/GB in North America (versus Cloudflare’s $0.00/GB for R2 egress), making it roughly 5–30x more expensive than Cloudflare Workers + R2 at equivalent scale.

CloudFront + Lambda@Edge#

CloudFront + Lambda@Edge supports dynamic routing through Lambda functions on the origin-request trigger (cache-miss only). The CloudFront KeyValueStore (introduced in late 2024) offers a hybrid approach with sub-millisecond reads from a 5 MB key-value store accessible from CloudFront Functions. However, Lambda@Edge functions must be deployed in us-east-1, logs are scattered across regional CloudWatch instances, cold starts range from 100–520ms, and cache invalidation takes ~2 minutes. At approximately $443/month for 10 million requests, it is the most expensive option evaluated.

Google Cloud CDN#

Google Cloud CDN provides no edge compute capability, restricting it to copy mode only.

CDN provider comparison#

Criterion

Cloudflare Workers + R2

Fastly Compute

CloudFront + Lambda@Edge

Edge API calls

✅ fetch(), 1000/req

✅ Dynamic backends

✅ Lambda@Edge only

Wildcard subdomains

✅ All plans, free SSL

✅ Paid, wildcard cert

✅ ACM wildcard

Edge data store

KV (~60s propagation)

KV Store (eventually consistent)

KeyValueStore (5 MB max)

Cache purge speed

Seconds (URL purge)

~150ms (surrogate key)

~2 minutes

Cold starts

None (V8 isolates)

None (35μs Wasm)

100–520ms (Lambda)

Global PoPs

330+

80+

600+ (but Lambda runs at ~13 regional edges)

Egress cost

$0.00

$0.12/GB

$0.085/GB

Monthly cost (10M req)

~$6

~$50–200

~$443

Implementation effort

Low (~100–200 LOC TS)

Medium (~200–400 LOC)

High (multi-service orchestration)

Migration friction

CDN provider change

Same provider, new runtime

CDN provider change

Dashboard templating system#

Overview#

Docverse renders dashboard pages (/v/index.html) and custom 404 pages server-side using Jinja templates, then uploads the rendered HTML to the object store as self-contained files with all assets inlined. This subsumes the role of LTD Dasher and gives organizations full control over the look and feel of their documentation portals.

Three tiers of templates, resolved in priority order:

  1. Project-level override: a project can specify its own template source, allowing it to host its template in its own documentation repo or a dedicated repo.

  2. Org-level template: each organization can configure a GitHub repo containing templates and assets shared across all projects in the org.

  3. Docverse built-in default: a minimal, functional template bundled with the Docverse application, used when no org or project template is configured.

Template source configuration#

A template source is defined by four fields:

Field

Type

Default

Description

github_owner

str

GitHub organization or user

github_repo

str

Repository name

path

str

"" (repo root)

Path within the repo to the template directory

git_ref

str

repo’s default branch

Branch or tag to track

This structure is flexible enough to support several layouts:

  • Dedicated template repo: lsst-sqre/docverse-templates, path "", ref main — an org maintains a standalone repo for their templates.

  • Monorepo with multiple templates: lsst-sqre/docverse-templates, path rubin/ — multiple orgs share a repo but use different subdirectories.

  • Template in a project repo: lsst/pipelines_lsst_io, path .docverse/template/ — a project hosts its own template alongside its documentation source.

The same four fields appear on both the org-level and project-level template configuration. When a project has its own template source, it completely replaces the org template (no inheritance/merging of individual files).

Template directory structure#

At the root of the template directory (as specified by path), a template.toml file declares the template’s contents:

[dashboard]
template = "dashboard.html.jinja"

[dashboard.assets]
css = ["style.css"]
js = ["filter.js"]
images = ["logo.svg", "favicon.png"]

[error_404]
template = "404.html.jinja"

[error_404.assets]
css = ["style.css"]
images = ["logo.svg"]

A minimal template directory might look like:

template.toml
dashboard.html.jinja
404.html.jinja          ← optional
style.css
logo.svg
filter.js

The [error_404] section is optional. If omitted, Docverse uses a built-in default 404 page. The Cloudflare Worker (or equivalent edge function) serves the 404 page when no object matches the requested path within a project.

Asset inlining#

All assets referenced in template.toml are inlined into the rendered HTML at render time, producing a single self-contained HTML file per page type (dashboard and 404). This eliminates the need for relative asset paths, simplifies CDN cache management (one URL to purge per page), and avoids asset/HTML version skew during updates.

The inlining strategy by file type:

  • CSS (.css): inlined into <style> tags. Multiple CSS files are concatenated.

  • JavaScript (.js): inlined into <script> tags. Multiple JS files are concatenated in declared order.

  • SVG (.svg): inlined as raw SVG markup (preserving DOM interactivity and CSS styling).

  • Raster images (.png, .jpg, .gif, .webp): base64-encoded as data URIs.

The renderer makes inlined assets available to the Jinja template as context variables. CSS and JS are provided as concatenated strings; images are provided as a dict keyed by filename (with dots and hyphens converted to underscores for template-friendly access):

<head>
  <style>{{ assets.css }}</style>
</head>
<body>
  <header>{{ assets.images.logo_svg }}</header>
  <!-- ... -->
  <script>{{ assets.js }}</script>
</body>

For a typical dashboard page (CSS + a small JS filter script + an SVG logo + a favicon), the inlined HTML should be 30–80KB — well within reason for a metadata page that users visit occasionally.

Jinja template context#

The template receives a maximalist context with all available project and edition metadata. Since rendering happens in a background job (not on the request path), there is no performance concern with assembling a rich context.

Context structure#

@dataclass
class DashboardContext:
    org: OrgContext
    project: ProjectContext
    editions: EditionsContext
    assets: AssetsContext
    docverse: DocverseContext
    rendered_at: datetime

@dataclass
class OrgContext:
    slug: str
    title: str
    base_domain: str
    url_scheme: str                # "subdomain" or "path_prefix"
    published_base_url: str        # e.g. "https://lsst.io"

@dataclass
class ProjectContext:
    slug: str
    title: str
    doc_repo: str                  # GitHub repo URL
    published_url: str             # root URL, e.g. "https://pipelines.lsst.io"
    surrogate_key: str
    date_created: datetime

@dataclass
class EditionsContext:
    all: list[EditionContext]
    main: EditionContext | None     # the __main edition
    releases: list[EditionContext]  # kind=release, sorted semver descending
    drafts: list[EditionContext]    # kind=draft, sorted date_updated descending
    major: list[EditionContext]     # kind=major, sorted version descending
    minor: list[EditionContext]     # kind=minor, sorted version descending
    alternates: list[EditionContext]  # kind=alternate, sorted by title ascending

@dataclass
class EditionContext:
    slug: str
    title: str
    kind: str
    tracking_mode: str
    tracking_params: dict           # e.g. {"major_version": 2}
    published_url: str
    build: BuildContext | None
    date_created: datetime
    date_updated: datetime
    lifecycle_exempt: bool
    alternate_name: str | None      # deployment/variant name, if scoped

@dataclass
class BuildContext:
    id: str                         # Crockford Base32
    git_ref: str
    annotations: dict               # client-provided metadata
    uploader: str
    object_count: int
    total_size_bytes: int
    date_created: datetime
    date_uploaded: datetime
    alternate_name: str | None      # deployment/variant name, if any

@dataclass
class AssetsContext:
    css: str                        # concatenated CSS
    js: str                         # concatenated JS
    images: dict[str, str]          # filename_with_underscores → inlined content

@dataclass
class DocverseContext:
    api_url: str
    version: str

Pre-grouped editions#

The EditionsContext provides both the flat all list and pre-grouped convenience lists so template authors can iterate by category without writing filter logic. Each group is pre-sorted in the most natural order for display:

  • releases: semver descending (newest release first)

  • drafts: date_updated descending (most recently active first)

  • major/minor: version descending

  • alternates: title ascending

Custom Jinja filters#

The rendering environment registers utility filters:

  • timesince: relative time display (e.g., “3 hours ago”, “2 days ago”)

  • filesizeformat: human-readable byte sizes (e.g., “1.2 MB”)

  • isoformat: ISO 8601 datetime formatting

  • semver_sort: sort a list of editions by semantic version

Template example#

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>{{ project.title }} — Documentation Editions</title>
  <style>{{ assets.css }}</style>
</head>
<body>
  <header>
    {{ assets.images.logo_svg }}
    <h1><a href="{{ project.published_url }}">{{ project.title }}</a></h1>
  </header>

  {% if editions.main and editions.main.build %}
  <section class="current">
    <h2>Current</h2>
    <a href="{{ editions.main.published_url }}">
      Latest ({{ editions.main.build.git_ref }})
    </a>
    <span class="meta">
      Updated {{ editions.main.date_updated | timesince }}
      · {{ editions.main.build.total_size_bytes | filesizeformat }}
    </span>
  </section>
  {% endif %}

  {% if editions.releases %}
  <section class="releases">
    <h2>Releases</h2>
    {% for edition in editions.releases %}
    <div class="edition-row">
      <a href="{{ edition.published_url }}">{{ edition.slug }}</a>
      <span class="ref">{{ edition.build.git_ref }}</span>
      <span class="date">{{ edition.build.date_uploaded | isoformat }}</span>
    </div>
    {% endfor %}
  </section>
  {% endif %}

  {% if editions.alternates %}
  <section class="alternates">
    <h2>Deployments</h2>
    {% for edition in editions.alternates %}
    <div class="edition-row">
      <a href="{{ edition.published_url }}">{{ edition.title }}</a>
      <span class="date">{{ edition.date_updated | timesince }}</span>
    </div>
    {% endfor %}
  </section>
  {% endif %}

  {% if editions.drafts %}
  <section class="drafts">
    <details>
      <summary>{{ editions.drafts | length }} draft(s)</summary>
      {% for edition in editions.drafts %}
      <div class="edition-row">
        <a href="{{ edition.published_url }}">{{ edition.slug }}</a>
        <span class="date">{{ edition.date_updated | timesince }}</span>
      </div>
      {% endfor %}
    </details>
  </section>
  {% endif %}

  <footer>
    Generated by <a href="{{ docverse.api_url }}">Docverse {{ docverse.version }}</a>
    at {{ rendered_at | isoformat }}
  </footer>
  <script>{{ assets.js }}</script>
</body>
</html>

Deployment-scoped draft editions carry an alternate_name field, which templates can use for filtering:

{# Drafts for a specific deployment #}
{% for edition in editions.drafts if edition.alternate_name == "usdf-dev" %}
<div class="edition-row">
  <a href="{{ edition.published_url }}">{{ edition.slug }}</a>
  <span class="badge">usdf-dev</span>
</div>
{% endfor %}

DashboardTemplate table#

Column

Type

Description

id

int

PK

org_id

FK → Organization

Owning org

project_id

FK → Project (nullable)

If set, this is a project-level override

github_owner

str

GitHub org/user

github_repo

str

Repository name

path

str

Path within repo (default "" for root)

git_ref

str

Branch/tag to track

store_prefix

str (nullable)

Object store prefix for current synced files

sync_id

str (nullable)

Current sync version identifier (timestamp-based)

date_synced

datetime (nullable)

Last successful sync

Unique constraint on (org_id, project_id) — at most one template configuration per org (where project_id is null) and one per project. See DashboardTemplate in the database schema section for the column reference within the full schema.

Template sync#

When a GitHub push webhook fires on a tracked template repo/ref, Docverse enqueues a dashboard_sync job. The sync process:

  1. Match webhook to templates: query the DashboardTemplate table for rows matching the github_owner/github_repo and where the push ref matches the tracked git_ref. Check whether any changed files fall within the tracked path prefix. If no rows match, the webhook is ignored.

  2. Fetch template files: use the GitHub Contents API (via the Safir GitHub App client or a Gafaelfawr-delegated token) to download the template.toml and all referenced template/asset files from the matched directory at the pushed commit SHA.

  3. Write to object store: upload the fetched files to __templates/{org_slug}/{sync_id}/ in the org’s object store bucket, where sync_id is a timestamp-based identifier (e.g., 20260208T120000). This creates a versioned snapshot alongside any previous syncs.

  4. Update database: update the DashboardTemplate row with the new store_prefix, sync_id, and date_synced.

  5. Re-render dashboards: for an org-level template, re-render dashboards for all projects in the org (parallelized via asyncio.gather()). For a project-level template, re-render only that project’s dashboard. Each render reads the template files from the object store snapshot, assembles the Jinja context from the database, inlines assets, renders, and uploads the output HTML.

  6. Clean up previous sync: the previous sync directory in the object store is marked for purgatory cleanup after a retention period. This provides rollback capability — if a new template is broken, an operator can revert the store_prefix in the database to the previous sync_id and re-render.

Docverse receives webhooks through two supported channels:

  • Direct GitHub App webhooks: Docverse acts as a GitHub App using the Safir GitHub App framework, receiving push events directly.

  • Kafka via Squarebot: Squarebot receives GitHub webhooks and publishes them to Kafka topics. Docverse consumes the relevant topics via FastStream. This allows sharing a GitHub App installation across multiple internal services.

Both channels feed into the same dashboard_sync job logic.

Rendered output storage#

The dashboard rendering pipeline produces several static files stored at well-known paths in the project’s object store area:

  • Dashboard: {project_slug}/__dashboard.html

  • 404 page: {project_slug}/__404.html

  • Version switcher: {project_slug}/__switcher.json

  • Edition metadata: {project_slug}/__editions/{edition_slug}.json (one per edition)

The double-underscore prefix on these filenames prevents collisions with edition slugs (which also use __ prefix only for the reserved __main slug – but __main is an edition slug used in URL paths like /v/__main/, not a filename at the project root).

The Cloudflare Worker (or equivalent edge function) maps URL paths to these files:

  • project.lsst.io/v/ or project.lsst.io/v/index.html → serves {project_slug}/__dashboard.html

  • project.lsst.io/v/switcher.json → serves {project_slug}/__switcher.json

  • project.lsst.io/v/{edition}/_docverse.json → serves {project_slug}/__editions/{edition_slug}.json

  • Any request that resolves no object → serves {project_slug}/__404.html with a 404 status code

Version switcher and edition metadata JSON#

In addition to the dashboard HTML, Docverse generates static JSON files that client-side JavaScript in the published documentation can consume. These files are generated by the same rendering pipeline as the dashboard and are re-rendered on the same triggers.

Version switcher JSON (__switcher.json)#

The pydata-sphinx-theme version switcher loads a JSON file to populate its version dropdown. Docverse generates this file at {project_slug}/__switcher.json, served at project.lsst.io/v/switcher.json.

The file follows the pydata-sphinx-theme’s expected format – a JSON array of version entries:

[
  {
    "name": "Latest (main)",
    "version": "__main",
    "url": "https://pipelines.lsst.io/",
    "preferred": true
  },
  {
    "name": "v2.3",
    "version": "v2.3",
    "url": "https://pipelines.lsst.io/v/v2.3/"
  },
  {
    "name": "v2.2",
    "version": "v2.2",
    "url": "https://pipelines.lsst.io/v/v2.2/"
  }
]

Field mapping from Docverse’s edition model:

Switcher field

Source

Description

name

edition title or formatted slug

Display label in the dropdown

version

edition slug

Used by the theme for matching the current version

url

edition published_url

Link target

preferred

true for __main and alternate editions

Marks the recommended/stable version

By default, the switcher JSON includes the __main edition and all release, major, minor, and alternate editions, sorted with __main first, then alternates alphabetically, then by version descending. Draft editions are excluded by default to keep the dropdown focused on stable versions. alternate editions are included because they represent long-lived deployment targets that users need to navigate between. Both __main and alternate editions get preferred: true, since each represents a canonical view for its context.

This behavior is configurable via template.toml:

[switcher]
include_kinds = ["main", "release", "major", "alternate"]  # default; add "draft" to include drafts

In conf.py, projects point the pydata-sphinx-theme at the Docverse-generated file:

    html_theme_options = {
        "switcher": {
            "json_url": "https://pipelines.lsst.io/v/switcher.json",
            "version_match": version,  # matches against the "version" field
        }
    }

Per-edition metadata JSON (__editions/{slug}.json)#

Each edition gets a small metadata JSON file that client-side JavaScript in the published documentation can fetch to determine which edition is currently being viewed, whether it’s the canonical version, and where the canonical version lives. These files are stored at {project_slug}/__editions/{edition_slug}.json and served at project.lsst.io/v/{edition}/_docverse.json via a Worker URL rewrite.

This approach stores the per-edition files in a separate object store path (__editions/) rather than injecting them into the build’s immutable content. In pointer mode, the edition’s URL space serves content from the build’s object store prefix, so writing metadata into the build would violate immutability. The Worker recognizes requests for _docverse.json within an edition path and rewrites them to the corresponding __editions/ file. In copy mode, the file can optionally be copied into the edition prefix alongside the build content.

Example for a draft edition:

      "project": {
        "slug": "pipelines",
        "title": "LSST Science Pipelines",
        "published_url": "https://pipelines.lsst.io/"
      },
      "edition": {
        "slug": "DM-12345",
        "title": "DM-12345",
        "kind": "draft",
        "published_url": "https://pipelines.lsst.io/v/DM-12345/",
        "tracking_mode": "git_ref",
        "date_updated": "2026-02-08T12:00:00Z"
      },
      "canonical_url": "https://pipelines.lsst.io/",
      "is_canonical": false,
      "switcher_url": "https://pipelines.lsst.io/v/switcher.json",
      "dashboard_url": "https://pipelines.lsst.io/v/"
    }

Example for the __main (canonical) edition:

{
  "project": {
    "slug": "pipelines",
    "title": "LSST Science Pipelines",
    "published_url": "https://pipelines.lsst.io/"
  },
  "edition": {
    "slug": "__main",
    "title": "Latest",
    "kind": "main",
    "published_url": "https://pipelines.lsst.io/",
    "tracking_mode": "git_ref",
    "date_updated": "2026-02-10T08:30:00Z"
  },
  "canonical_url": "https://pipelines.lsst.io/",
  "is_canonical": true,
  "switcher_url": "https://pipelines.lsst.io/v/switcher.json",
  "dashboard_url": "https://pipelines.lsst.io/v/"
}

The canonical_url always points to the __main edition’s published_url. Client-side JavaScript can use this metadata to:

  • Display “you’re viewing a draft” or “you’re viewing an older release” banners with a link to the canonical version.

  • Inject <link rel="canonical" href="..."> tags for SEO.

  • Integrate with the version switcher to highlight the current edition.

  • Show the edition kind and last-updated date in the page footer.

The switcher_url and dashboard_url fields provide stable references to the project’s other Docverse-generated resources, so client JS doesn’t need to construct URLs.

Re-render triggers#

Dashboard re-rendering is triggered by several events, all funneled through the task queue:

Event

Trigger mechanism

Scope

Template repo push

GitHub webhook → dashboard_sync job

All projects using that template

Build processing completes

Final step of build_processing job

Single project

Edition created/deleted/updated

Final step of edition_update job

Single project

Project metadata changed

Enqueued by PATCH handler

Single project

Manual re-render

Admin API endpoint (if needed)

Single project or all org projects

For single-project re-renders triggered within other jobs (build processing, edition update), the render is performed inline as the final step — no separate job is enqueued. Only template syncs (which affect multiple projects) spawn their own dashboard_sync job.

Multiple triggers can race on the same project’s dashboard files (e.g., two build_processing jobs, or a build_processing and a dashboard_sync job running concurrently). Docverse uses Postgres advisory locks at the project and edition level to serialize these writes. See Cross-job serialization for the locking strategy.

Code architecture#

Docverse follows a hexagonal (ports-and-adapters) architecture where domain logic is isolated from infrastructure concerns. Storage backends are swappable per-organization, and services interact only with protocol interfaces — never with concrete implementations directly.

Layered architecture#

Docverse follows the layered architecture pattern established in Ook:

  • dbschema: SQLAlchemy models (database tables).

  • domain: domain models (dataclasses/Pydantic), business logic, protocol definitions. Storage-agnostic.

  • handlers: FastAPI route handlers, FastStream (Kafka) handlers. Thin — validate, resolve context, delegate to services.

  • services: orchestration layer. Services coordinate storage backends and domain logic. Called by handlers.

  • storage: interfaces to external data stores — databases, object stores, CDN APIs, DNS APIs, GitHub API.

CDN and object store abstractions live in the storage layer. Protocol classes defining their interfaces live in the domain layer.

Factory pattern#

Following the Ook factory pattern, Docverse uses a Factory class that is provided as a FastAPI dependency to handlers. The factory combines process-level singletons (held in a ProcessContext) with a request-scoped database session to construct services and storage clients on demand.

For Docverse’s multi-tenant architecture, the factory additionally handles org-specific client construction. The flow:

  1. A handler receives a request scoped to an organization (resolved from URL path or project slug).

  2. The handler gets the Factory from FastAPI’s dependency injection.

  3. The factory loads the org’s configuration from the database.

  4. factory.create_object_store(org) inspects the org’s provider config and returns the correct implementation (S3 client with that org’s bucket/credentials, or a GCS client, etc.).

  5. factory.create_cdn_client(org) does the same for the CDN provider.

  6. factory.create_edition_service(org) wires up the org-specific object store, CDN client, and database stores into the edition update service.

Org-specific clients are cached within a request using a dict keyed by org ID inside the factory instance, so multiple operations on the same org within a single request reuse the same clients.

Open design question — cross-request client pooling: Org-specific clients (and their underlying connection pools) could potentially be cached in ProcessContext across requests, lazily created and keyed by org ID. This would avoid per-request connection setup overhead for frequently-accessed orgs. Needs exploration around credential rotation, memory, and stale-client concerns.

Object store abstraction#

A protocol class in the domain layer defines the interface that all object store implementations must satisfy. Concrete implementations in the storage layer:

  • S3-compatible: uses aiobotocore (async). Covers AWS S3, generic S3-compatible stores (MinIO, etc.), and Cloudflare R2 (which exposes an S3-compatible API).

  • GCS: uses gcloud-aio-storage (async). GCS’s API is distinct enough from S3 to warrant a separate implementation rather than forcing it through an S3-compatibility shim.

A factory method in the storage layer (called by the Factory) instantiates the correct implementation based on the org’s provider configuration. Services only interact with the protocol — they are backend-agnostic.

Cloudflare R2 is accessed via the S3-compatible implementation. R2’s S3 API compatibility is comprehensive enough that a dedicated implementation is not required. The key difference is R2’s zero egress fees, which is a cost consideration rather than an API difference.

Object store interface#

The protocol defines these operations:

  • Individual operations: upload object, copy object, move object (to purgatory prefix), delete object, get object metadata, generate presigned URL (for client uploads).

  • Bulk operations (critical for performance): batch copy (for edition updates involving 500+ objects), batch move (for purgatory), batch delete (for hard-deleting purgatory contents). Bulk methods accept lists of objects and handle parallelism internally using asyncio semaphores. S3 implementations can additionally leverage S3-specific batch APIs where available.

  • Listing: list objects under a key prefix.

CDN abstraction#

The CDN abstraction follows the same pattern: protocol in domain, implementations in storage, factory-constructed per org.

The CDN interface exposes domain-meaningful operations rather than provider-specific primitives:

  • Purge edition: invalidate all cached content for a specific edition of a project.

  • Purge build: invalidate cached content for a build (used during build deletion).

  • Purge dashboard: invalidate the cached dashboard page for a project.

  • Update edition mapping: write a new edition→build mapping to the edge data store (pointer mode only; no-op for copy-mode CDNs).

Each CDN implementation translates these operations into provider-specific API calls:

  • Fastly: purge by surrogate key (a single API call invalidates all objects tagged with the edition’s surrogate key). Edition mappings via Fastly KV Store API.

  • Cloudflare Workers: purge by URL or embed build ID in cache keys so that mapping changes implicitly invalidate old entries. Edition mappings via Workers KV API.

  • Google Cloud CDN: purge by URL pattern. No edge data store (copy mode only).

The edition/build/product domain objects (or their identifiers) are passed to the purge methods. The implementation derives whatever provider-specific identifiers it needs (surrogate keys, URL prefixes, URL patterns, KV keys) from the domain objects and org configuration.

CDN implementations declare their capabilities via properties on the class. The key capability is pointer mode support (supports_pointer_mode): CDNs with programmable edge compute and an edge data store can resolve the edition→build mapping at the edge, while CDNs without edge compute fall back to copy mode. The edition update service checks supports_pointer_mode at runtime to select the pointer mode or copy mode code path.

Queue backend abstraction#

The queue backend follows the same protocol-based pattern as the object store and CDN abstractions. A protocol class in the domain layer defines the interface for enqueuing jobs and querying job metadata, while concrete implementations in the storage layer adapt specific queue technologies.

The initial implementation uses Arq via Safir’s ArqQueue with Redis as the message transport. Safir provides both the production ArqQueue and a MockArqQueue for testing, which aligns with Docverse’s existing testing patterns. The QueueJob Postgres table remains the authoritative state store — the queue backend is treated as a delivery mechanism only.

See the Queue backend abstraction section in the queue design for the full protocol definition, implementation details, and infrastructure notes.

Service layer#

Services are the orchestration layer that ties storage backends, domain logic, and database stores together. They are called by thin handlers that validate input, resolve context (e.g., which org the request targets), and delegate to the service.

For example, EditionService wires together the org-specific object store, CDN client, and database stores to coordinate edition updates. When pointer mode is available, it writes a KV mapping and purges the CDN cache. When copy mode is required, it computes a diff against the build inventory and performs the ordered in-place update.

Services are also invoked from background jobs processed by the Task queue design system — the same service logic applies whether the caller is an HTTP handler or a queue worker.

Beyond the server-side architecture, Docverse’s codebase includes a Python client library and CLI that share the repository with the server. This monorepo structure and the client-owned model pattern are integral to maintaining type safety across the API boundary.

Client-server monorepo#

The Python client and the Docverse server share a single Git repository, following the monorepo pattern established by Squarebot. The server package lives at the repository root (matching the Ook and Safir conventions), which simplifies Docker builds and makes nox-uv’s @session decorator work naturally:

docverse/                           # lsst-sqre/docverse
├── pyproject.toml                  # workspace root = server package
├── uv.lock                         # single lockfile for entire workspace
├── noxfile.py                      # shared dev tooling
├── ruff-shared.toml                # shared Ruff configuration
├── Dockerfile
├── client/
│   ├── pyproject.toml              # docverse-client (PyPI library)
│   └── src/
│       └── docverse/
│           └── client/
│               ├── __init__.py
│               ├── _client.py      # DocverseClient
│               ├── _cli.py         # CLI entry point
│               ├── _models.py      # Pydantic request/response models
│               └── _tar.py         # Tarball creation utility
├── src/
│   └── docverse/
│       ├── ...
│       └── models/                 # imports and extends client models
└── tests/

Attribute

Client

Server

PyPI name

docverse-client

docverse

Import path

docverse.client

docverse

Package style

Namespace package (docverse.client)

Namespace package (docverse)

Location

client/ subdirectory

Repository root

The monorepo is motivated by four concerns:

  • Atomic model changes: Pydantic models that define the API contract live in the client package. When an endpoint’s request or response shape changes, the model update and the server handler update land in the same commit, eliminating version skew.

  • Shared dev tooling: a single noxfile orchestrates linting, type-checking, and testing for both packages. The uv workspace handles editable installs of both packages automatically via uv sync.

  • Unidirectional dependency: the server depends on the client package (for the shared models); the client never imports from the server. This keeps the client lightweight and installable in CI without pulling in FastAPI, SQLAlchemy, or Redis.

  • Lessons from LTD: in the LTD era, the API server (ltd-keeper) and upload client (ltd-conveyor) lived in separate repositories with independently-defined models. The monorepo eliminates model drift issues between the two.

The docverse-upload GitHub Action is intentionally not part of this monorepo. The monorepo’s “atomic model changes” benefit applies to Python packages that share Pydantic models at import time; the TypeScript action consumes the API through a generated OpenAPI spec, not through Python imports, so co-location provides no additional type-safety benefit. Keeping the action in its own repository also avoids Git tag ambiguity — the monorepo already uses prefix-scoped tags (client/v1.0.0) for the Python client, and adding v1-style tags for the action would conflict with any future need for repository-level semver tags. The OpenAPI spec serves as the contract bridge between the two repositories (see OpenAPI-driven TypeScript development), and spec changes are explicitly visible in action-repo PR diffs, providing a clear review point for API evolution.

Shared noxfile#

The monorepo uses a shared noxfile at the repository root. Sessions that need the full locked environment use nox_uv.session, which calls uv sync to install both workspace members in editable mode along with all locked dependencies:

Listing 1 noxfile.py (excerpt — locked sessions)#
from nox_uv import session

@session(uv_groups=["dev"])
def test(session: nox.Session) -> None:
    """Run server tests with locked dependencies."""
    session.run("pytest", "tests/")

@session(uv_groups=["typing", "dev"])
def typing(session: nox.Session) -> None:
    """Type-check both packages."""
    session.run("mypy", "src/", "client/src/", "tests/")

Because uv sync installs all workspace members as editable packages, changes to either package are reflected immediately — no manual pip install -e step is needed.

Dependency management#

The monorepo uses a uv workspace with a single uv.lock at the repository root that replaces the traditional server/requirements.txt pattern.

Workspace lockfile. A single uv.lock locks all dependencies for both the server and client packages. uv computes the workspace’s requires-python as the intersection of all members: if the client declares >=3.12 and the server declares >=3.13, the lockfile resolves at >=3.13. The client’s pyproject.toml still declares >=3.12 for PyPI consumers — the lockfile constraint only affects the development environment.

Server (pyproject.toml at the repository root):

  • requires-python = ">=3.13" (single target version).

  • Dependencies locked by uv.lock for reproducible Docker builds.

  • Dependency groups for dev tooling: dev (test dependencies), lint (pre-commit, ruff), typing (mypy and stubs), nox (nox, nox-uv).

  • [tool.uv.sources] maps docverse-client to the workspace member, so uv sync installs the local client in editable mode.

Client (client/pyproject.toml):

  • requires-python = ">=3.12" (broad range for library consumers).

  • Dependencies use broad version ranges appropriate for a PyPI library.

  • Dependency groups mirror the server’s pattern: dev (pytest, respx, and other test dependencies). Unlocked nox sessions read these groups via nox.project.dependency_groups() so the noxfile never duplicates dependency lists.

Testing matrix#

The nox sessions use two distinct mechanisms to control dependency resolution:

Session

Mechanism

Resolution

Python

Purpose

test

nox_uv.session

Locked (uv.lock)

3.13

Server tests — same deps as Docker

client-test

nox_uv.session

Locked (uv.lock)

3.13

Client tests with lockfile deps

client-test-compat

nox.session + session.install

Highest (unlocked)

3.12, 3.13

Client as PyPI users install it

client-test-oldest

nox.session + session.install

lowest-direct

3.12

Validates client’s lower bounds

lint

nox_uv.session

Locked (uv.lock)

3.13

Pre-commit hooks

typing

nox_uv.session

Locked (uv.lock)

3.13

mypy on both packages

The key distinction: locked sessions use nox_uv.session (which calls uv sync under the hood); compatibility sessions use standard nox.session with session.install() (which calls uv pip install, bypassing the workspace lockfile entirely). This separation lets the server pin exact versions for reproducibility while the client is tested against the same range of environments its PyPI users will encounter.

Listing 2 noxfile.py (excerpt — unlocked compatibility session)#
import nox

CLIENT_PYPROJECT = nox.project.load_toml("client/pyproject.toml")

@nox.session(python=["3.12", "3.13"])
def client_test_compat(session: nox.Session) -> None:
    """Test the client with unlocked highest dependencies."""
    session.install(
        "./client",
        *nox.project.dependency_groups(CLIENT_PYPROJECT, "dev"),
    )
    session.run("pytest", "client/tests/")

Docker build#

The Docker build uses a two-stage pattern that separates dependency installation (layer-cached) from application code:

Listing 3 Dockerfile (excerpt)#
# Install locked dependencies (cached unless uv.lock changes)
COPY pyproject.toml uv.lock ./
COPY client/pyproject.toml client/pyproject.toml
RUN uv sync --frozen --no-default-groups --no-install-workspace

# Install workspace members without re-resolving
COPY client/ client/
COPY src/ src/
RUN uv pip install --no-deps ./client .

uv sync --frozen --no-default-groups --no-install-workspace installs only the locked production dependencies without development groups or the workspace packages themselves. Both pyproject.toml files must be present for uv to validate the workspace structure. The subsequent uv pip install --no-deps installs the actual workspace packages without triggering a new resolution.

Release workflow#

  • Client (docverse monorepo): released to PyPI on client/v* tags (e.g., client/v1.2.0). The GitHub Actions publish workflow runs uv build --no-sources --package docverse-client and uploads to PyPI. The --no-sources flag disables workspace source overrides so the built distribution references PyPI package names, not local paths.

  • Server (docverse monorepo): released as a Docker image on bare semver tags (e.g., 1.2.0), following the SQuaRE convention of omitting the v prefix. The GitHub Actions workflow builds and pushes the Docker image. Phalanx Helm charts reference the tagged image version.

Version metadata#

Both packages use setuptools_scm to derive their version from Git tags. Because two independent tag namespaces coexist in one repository, each package scopes its tag matching with both tag_regex (for parsing) and a custom describe_command with --match (for discovery). Both mechanisms must be used together — tag_regex alone is not sufficient because git describe would still pick up the wrong tag.

Server (pyproject.toml at repository root):

Listing 4 pyproject.toml (server — version metadata)#
[project]
dynamic = ["version"]

[tool.setuptools_scm]
fallback_version = "0.0.0"
tag_regex = '(?P<version>\d+(?:\.\d+)*)$'

[tool.setuptools_scm.scm.git]
describe_command = [
    "git", "describe", "--dirty", "--tags", "--long",
    "--abbrev=40", "--match", "[0-9]*",
]
  • --match "[0-9]*" ensures git describe only considers bare semver tags, excluding client/v* tags.

  • tag_regex matches bare 1.2.0 without a v prefix.

  • fallback_version provides a version before the first tag exists.

Client (client/pyproject.toml):

Listing 5 client/pyproject.toml (client — version metadata)#
[project]
dynamic = ["version"]

[tool.setuptools_scm]
root = ".."
fallback_version = "0.0.0"
tag_regex = '^client/(?P<version>[vV]?\d+(?:\.\d+)*)$'

[tool.setuptools_scm.scm.git]
describe_command = [
    "git", "describe", "--dirty", "--tags", "--long",
    "--abbrev=40", "--match", "client/v*",
]
  • root = ".." points setuptools_scm to the repository root, since client/ is a subdirectory.

  • --match "client/v*" ensures git describe only considers client-prefixed tags.

  • tag_regex strips the client/ prefix when parsing the version.

  • The slash separator in the tag prefix (client/v1.2.0) avoids the dash-parsing ambiguity that setuptools_scm historically had with dashed prefixes (e.g., client-v1.2.0 could be misinterpreted as a version component separator).

Changelog management#

Both packages use scriv (>= 1.8.0) with separate configuration files and fragment directories, so each package can collect changelog entries independently at its own release cadence.

Layout:

docverse/
├── scriv-client.ini          # scriv config for client
├── scriv-server.ini          # scriv config for server
├── client/
│   ├── changelog.d/          # client fragments
│   └── CHANGELOG.md          # client changelog
├── server-changelog.d/       # server fragments (root level)
└── CHANGELOG.md              # server changelog

The scriv configuration files use .ini format because scriv’s --config flag only accepts .ini files, not TOML. This means the configuration cannot live in the respective pyproject.toml files as [tool.scriv] sections.

Listing 6 scriv-server.ini#
[scriv]
fragment_directory = server-changelog.d
changelog = CHANGELOG.md
format = md
Listing 7 scriv-client.ini#
[scriv]
fragment_directory = client/changelog.d
changelog = client/CHANGELOG.md
format = md

Usage, wrapped in nox targets for convenience:

$ scriv create --config scriv-client.ini     # new client fragment
$ scriv create --config scriv-server.ini     # new server fragment
$ scriv collect --config scriv-client.ini --version 1.2.0   # collect at release
$ scriv collect --config scriv-server.ini --version 1.2.0   # collect at release

Client-owned Pydantic models#

The client package owns every Pydantic model that appears in API request and response payloads. The server imports these models and can subclass them to add server-side concerns (database constructors, internal validators), but the wire format is always defined in the client.

Listing 8 docverse/client/_models.py#
from pydantic import BaseModel, HttpUrl

class BuildCreate(BaseModel):
    """Request body for POST /orgs/:org/projects/:project/builds."""
    git_ref: str
    content_hash: str
    alternate_name: str | None = None
    annotations: dict[str, str] | None = None

class BuildResponse(BaseModel):
    """Response from build endpoints."""
    self_url: HttpUrl
    project_url: HttpUrl
    id: str
    status: str
    git_ref: str
    upload_url: HttpUrl | None = None
    queue_url: HttpUrl | None = None
    date_created: str
Listing 9 docverse/models/build.py#
from docverse.client._models import BuildCreate as ClientBuildCreate
from ..db import Build as BuildRow

class BuildCreate(ClientBuildCreate):
    """Server-side build creation with DB integration."""

    async def to_db_row(self, project_id: int) -> BuildRow:
        return BuildRow(
            project_id=project_id,
            git_ref=self.git_ref,
            content_hash=self.content_hash,
            alternate_name=self.alternate_name,
        )

This pattern provides:

  • Single source of truth: one model definition governs both serialization (client) and deserialization (server).

  • Automatic OpenAPI alignment: FastAPI generates the OpenAPI spec from these Pydantic models. The TypeScript types for the GitHub Action are generated from that same spec (see OpenAPI-driven TypeScript development), creating a chain of consistency from Python to TypeScript.

  • Safe refactoring: renaming a field or adding a required attribute is a type error in both packages immediately, caught by mypy before CI even runs tests.

Python client library#

DocverseClient is an async HTTP client built on httpx that implements the full upload workflow. It follows HATEOAS navigation — the client discovers endpoint URLs from the API root and resource responses rather than hardcoding paths.

Key methods#

Method

Description

create_build()

POST to create a build record; returns the presigned upload URL

upload_tarball()

PUT the tarball to the presigned URL

complete_upload()

PATCH to signal upload complete; returns the queue job URL

get_queue_job()

GET the queue job status

wait_for_job()

Poll the queue job with exponential backoff until terminal state

Upload workflow#

Listing 10 Full upload using the Python client#
from docverse.client import DocverseClient, create_tarball

async with DocverseClient(
    base_url="https://docverse.lsst.io",
    token=os.environ["DOCVERSE_TOKEN"],
) as client:
    # 1. Create a tarball from the built docs directory
    tarball_path, content_hash = create_tarball("_build/html")

    # 2. Create a build record (returns presigned upload URL)
    build = await client.create_build(
        org="rubin",
        project="pipelines",
        git_ref="tickets/DM-12345",
        content_hash=content_hash,
    )

    # 3. Upload the tarball to the presigned URL
    await client.upload_tarball(
        upload_url=build.upload_url,
        tarball_path=tarball_path,
    )

    # 4. Signal upload complete (enqueues processing)
    job_url = await client.complete_upload(build.self_url)

    # 5. Wait for processing to finish
    job = await client.wait_for_job(job_url)
    print(f"Build {build.id} processed: {job.status}")

Authentication#

The client authenticates via a Gafaelfawr token passed either as the token constructor argument or through the DOCVERSE_TOKEN environment variable. The token is sent as a bearer token in the Authorization header on every request. For CI pipelines, this token is typically a bot token stored as a CI secret (see Authentication and authorization for how bot tokens are provisioned).

Tarball creation#

The create_tarball() utility function creates a .tar.gz archive from a directory and computes its SHA-256 content hash:

def create_tarball(source_dir: str | Path) -> tuple[Path, str]:
    """Create a .tar.gz of source_dir; return (path, 'sha256:...' hash)."""

The tarball is written to a temporary file. The content hash is computed during archive creation (streaming through a hashlib.sha256 digest) so the file is read only once.

Job polling#

wait_for_job() polls the queue job endpoint with exponential backoff (initial interval 1 s, max interval 15 s, jitter). It returns when the job reaches a terminal state (completed or failed). On failure, it raises BuildProcessingError with the failure details from the job response.

Command-line interface#

The docverse upload CLI command wraps the Python client library for use in shell scripts and CI pipelines where the GitHub Action is not applicable (e.g., Jenkins, GitLab CI, local builds).

Usage#

$ docverse upload \
    --org rubin \
    --project pipelines \
    --git-ref tickets/DM-12345 \
    --dir _build/html

Options#

Flag

Env var

Default

Description

--org

DOCVERSE_ORG

Organization slug

--project

DOCVERSE_PROJECT

Project slug

--git-ref

DOCVERSE_GIT_REF

current Git HEAD

Git ref for the build

--dir

DOCVERSE_DIR

Path to the built documentation directory

--token

DOCVERSE_TOKEN

Gafaelfawr authentication token

--base-url

DOCVERSE_BASE_URL

https://docverse.lsst.io

Docverse API base URL

--alternate-name

DOCVERSE_ALTERNATE

Alternate name for scoped editions

--no-wait

wait enabled

Return immediately after signaling upload

Exit codes#

Code

Meaning

0

Build uploaded and processed successfully

1

Failure (authentication error, upload error, job failed)

2

Upload succeeded but job had warnings (partial success)

When --no-wait is used, exit code 0 means the upload was accepted and processing was enqueued — the CLI does not wait for the job to finish.

Task queue design#

Design philosophy#

Docverse interacts with the job queue through a backend-agnostic abstraction layer (see Queue backend abstraction). The initial implementation uses Arq via Safir’s ArqQueue with Redis as the message transport. The queue backend handles delivery, retries, and worker dispatch. All orchestration and parallelism within a job is handled by Docverse’s service layer using standard Python asyncio. This minimizes coupling to any specific queue technology and keeps the business logic testable with plain async functions.

Each user-facing operation that triggers background work results in a single background job. The job’s worker function calls through the service layer, which coordinates the steps internally. Where steps are independent, the service layer uses asyncio.gather() to parallelize them.

QueueJob table#

Docverse maintains its own QueueJob table in Postgres as the single source of truth for job state and progress. This table serves the user-facing queue API, operator dashboards, and internal coordination (e.g., detecting conflicting concurrent edition updates). The queue backend’s internal state is not queried directly for status — Docverse treats the backend as a delivery mechanism only. See QueueJob in the database schema section for the column reference within the full schema.

Column

Type

Description

id

int

Internal PK

public_id

int

Crockford Base32 serialized in API

backend_job_id

str (nullable)

Reference to the queue backend’s job ID (e.g., Arq UUID)

kind

enum

build_processing, edition_update, dashboard_sync, lifecycle_eval, git_ref_audit, purgatory_cleanup, credential_reencrypt

status

enum

queued, in_progress, completed, completed_with_errors, failed, cancelled

phase

str (nullable)

Current phase: inventory, tracking, editions, dashboard

org_id

FK → Organization

Scoped to org (for operator filtering)

project_id

FK → Project (nullable)

Set for build/edition jobs

build_id

FK → Build (nullable)

Set for build processing jobs

progress

JSONB (nullable)

Structured progress data, phase-specific

errors

JSONB (nullable)

Collected error details

date_created

datetime

When enqueued

date_started

datetime (nullable)

When a worker picked it up

date_completed

datetime (nullable)

When finished

Queue backend abstraction#

The queue backend is accessed through a protocol interface, following the same hexagonal architecture pattern as the object store and CDN abstractions. This keeps the service layer decoupled from any specific queue technology and allows backend swaps without disrupting application logic.

Protocol definition#

from typing import Protocol


class QueueBackend(Protocol):
    """Protocol for queue backend implementations."""

    async def enqueue(
        self,
        job_type: str,
        payload: dict,
        *,
        queue_name: str = "default",
    ) -> str | None:
        """Enqueue a job for background processing.

        Returns the backend-assigned job ID (str), or None if the
        backend does not assign IDs synchronously.
        """
        ...

    async def get_job_metadata(
        self, backend_job_id: str
    ) -> dict | None:
        """Retrieve metadata about a job from the backend.

        Returns backend-specific metadata (e.g., status, result),
        or None if the job is not found. Used for diagnostics only —
        the QueueJob table is the authoritative state store.
        """
        ...

    async def get_job_result(
        self, backend_job_id: str
    ) -> object | None:
        """Retrieve the result of a completed job.

        Returns the job result, or None if not available.
        """
        ...

Implementations#

ArqQueueBackend wraps Safir’s ArqQueue for production use. Arq uses UUID strings as job IDs, which are stored in the backend_job_id column of the QueueJob table. The worker functions are standard async functions that receive the job payload and call through the service layer.

MockQueueBackend wraps Safir’s MockArqQueue for testing. Jobs are executed in-process, making tests deterministic without requiring a running Redis instance.

Both implementations are constructed by the factory and injected into services, consistent with Docverse’s dependency injection pattern.

Infrastructure#

Arq requires a Redis instance as its message broker. In Phalanx deployments, Redis is a standard in-cluster service. The QueueJob Postgres table remains the authoritative state store — Redis holds only transient message data. If Redis state is lost, in-flight jobs can be re-enqueued from QueueJob records with status = 'queued'.

Progress tracking#

The service layer writes progress to the QueueJob table at each phase transition and within phases where granular tracking is useful. These are lightweight single-row UPDATEs.

Phase transitions#

At each major phase boundary, the service updates the phase column and resets/initializes the progress JSONB:

await queue_job_store.update_phase(job_id, "inventory")
await inventory_service.catalog_build(build)

await queue_job_store.update_phase(job_id, "tracking")
affected_editions = await tracking_service.evaluate(build)

await queue_job_store.start_editions_phase(job_id, affected_editions)
results = await asyncio.gather(
    *[self._update_single_edition(e, build, job_id)
      for e in affected_editions],
    return_exceptions=True,
)

await queue_job_store.update_phase(job_id, "dashboard")
await dashboard_service.render(project)

await queue_job_store.complete(job_id)

Live edition progress via conditional JSONB merge#

During the editions phase, multiple edition update coroutines run concurrently via asyncio.gather(). Each coroutine updates the progress JSONB when it completes, using Postgres jsonb_set to atomically move its edition slug between the editions_in_progress and editions_completed (as a structured object including the edition’s published_url), editions_skipped, or editions_failed arrays:

-- Mark edition completed
UPDATE queue_job
SET progress = jsonb_set(
    jsonb_set(
        progress,
        '{editions_completed}',
        (progress->'editions_completed') || :completed_entry::jsonb
    ),
    '{editions_in_progress}',
    (progress->'editions_in_progress') - :edition_slug
)
WHERE id = :job_id

Where completed_entry is {"slug": "__main", "published_url": "https://pipelines.lsst.io/"}.

For failures, the slug is moved to editions_failed as a structured object with error context:

-- Mark edition failed
UPDATE queue_job
SET progress = jsonb_set(
    jsonb_set(
        progress,
        '{editions_failed}',
        (progress->'editions_failed') || :failed_entry::jsonb
    ),
    '{editions_in_progress}',
    (progress->'editions_in_progress') - :edition_slug
)
WHERE id = :job_id

Where failed_entry is {"slug": "DM-12345", "error": "R2 timeout after 3 retries"}.

For skipped editions (superseded by a newer build; see Cross-job serialization), the slug is moved to editions_skipped as a structured object with the reason:

-- Mark edition skipped
UPDATE queue_job
SET progress = jsonb_set(
    jsonb_set(
        progress,
        '{editions_skipped}',
        (progress->'editions_skipped') || :skipped_entry::jsonb
    ),
    '{editions_in_progress}',
    (progress->'editions_in_progress') - :edition_slug
)
WHERE id = :job_id

Where skipped_entry is {"slug": "v2.x", "reason": "superseded by build 01HQ-3KBR-T5GN-8W"}.

Postgres serializes the row locks, but since these are sub-millisecond metadata writes against a single row, contention is negligible compared to the actual edition update work (KV writes, cache purges, or object copies).

The service layer wraps each edition update coroutine:

async def _update_single_edition(self, edition, build, job_id):
    try:
        skipped = await self._edition_service.update(edition, build)
        if skipped:
            await self._queue_store.mark_edition_skipped(
                job_id, edition.slug, reason="superseded"
            )
        else:
            await self._queue_store.mark_edition_completed(
                job_id, edition.slug, edition.published_url
            )
    except Exception as e:
        await self._queue_store.mark_edition_failed(
            job_id, edition.slug, str(e)
        )
        raise

Progress JSONB structure#

The progress JSONB is phase-specific. During the editions phase:

{
  "editions_total": 3,
  "editions_completed": [
    { "slug": "__main", "published_url": "https://pipelines.lsst.io/" }
  ],
  "editions_skipped": [
    { "slug": "v2.x", "reason": "superseded by build 01HQ-3KBR-T5GN-8W" }
  ],
  "editions_failed": [
    { "slug": "DM-12345", "error": "R2 timeout after 3 retries" }
  ],
  "editions_in_progress": []
}

The editions_skipped and editions_failed arrays already used structured objects with contextual fields (reason and error, respectively). Promoting editions_completed to a structured object with published_url makes the shape consistent across all terminal-state arrays and enables clients — particularly the GitHub Action’s PR comment feature (Pull request comments) — to discover published URLs directly from job progress without additional API calls. The editions_in_progress array remains a plain string array since in-progress editions have no published URL yet.

For other phases, progress can carry simpler metadata (e.g., {"message": "Cataloging 1,247 objects"} during inventory).

Cross-job serialization#

Several background jobs can race on the same project’s resources. Two rapid build uploads from the same branch can produce two build_processing jobs that both try to update the same edition concurrently. A build_processing job and a dashboard_sync job can both try to write the same project’s dashboard files at the same time. An edition_update job and a build_processing job can both write the same edition’s metadata JSON. Since asyncio.gather() parallelizes edition updates within a job, and multiple workers can process different jobs simultaneously, these concurrent mutations can lead to interleaved KV writes, partial cache purges, torn dashboard HTML, or inconsistent metadata JSON.

Docverse prevents this with Postgres advisory locks at two granularities — per-edition and per-project — combined with a stale build guard for edition updates.

Lock namespacing#

Advisory locks use the two-argument form pg_advisory_lock(classid, objid) to namespace by resource type, avoiding key collisions between edition and project PKs (which come from different sequences):

  • pg_advisory_lock(1, edition.id) — edition-level lock, serializes edition content updates and per-edition metadata JSON writes.

  • pg_advisory_lock(2, project.id) — project-level lock, serializes project-wide dashboard renders (__dashboard.html, __404.html, __switcher.json).

Both services acquire locks through a shared advisory_lock async context manager that wraps the acquire/release pair, making the lock scope visually explicit and eliminating repeated try/finally boilerplate:

@asynccontextmanager
async def advisory_lock(session, classid, objid):
    """Acquire a Postgres advisory lock for the duration of the block."""
    await session.execute(
        text("SELECT pg_advisory_lock(:classid, :objid)"),
        {"classid": classid, "objid": objid},
    )
    try:
        yield
    finally:
        await session.execute(
            text("SELECT pg_advisory_unlock(:classid, :objid)"),
            {"classid": classid, "objid": objid},
        )

Advisory lock acquisition#

Before performing any mutation, EditionService.update() uses the advisory_lock context manager to hold an advisory lock keyed on the edition’s primary key for the duration of the update:

async def update(self, edition, build) -> bool:
    """Update edition to point to build. Returns True if skipped."""
    async with advisory_lock(self._session, 1, edition.id):
        current_build = await self._get_current_build(edition)
        if current_build and current_build.date_created > build.date_created:
            return True  # Skipped — edition already has a newer build

        # ... perform KV write / copy-mode update ...
        # ... log to EditionBuildHistory ...
        # ... write __editions/{slug}.json metadata ...
        return False

The underlying pg_advisory_lock() call blocks (rather than failing) if another session holds the lock for the same key. This means a competing job simply waits its turn — no job is rejected or fails due to contention.

Stale build guard#

After acquiring the lock, the method compares the candidate build’s date_created against the edition’s current build. If the edition already points to a newer build (because a more recent job acquired the lock first), the update is skipped. The caller logs the skip in the job’s progress JSONB via mark_edition_skipped, and the edition slug appears in the editions_skipped array rather than editions_completed.

This guarantees the edition never regresses to an older build, regardless of the order in which competing jobs acquire the lock.

Project-level lock for dashboard renders#

DashboardService.render(project) acquires a project-level advisory lock before writing the project-wide dashboard files. After releasing the project lock, it acquires each edition’s lock in turn to write per-edition metadata JSON, serializing against any concurrent EditionService.update() that writes the same file.

async def render(self, project):
    """Render all dashboard outputs for a project."""
    # Project-wide files under project lock
    async with advisory_lock(self._session, 2, project.id):
        await self._write_dashboard_html(project)
        await self._write_404_html(project)
        await self._write_switcher_json(project)

    # Per-edition metadata under individual edition locks
    for edition in await self._get_editions(project):
        async with advisory_lock(self._session, 1, edition.id):
            await self._write_edition_metadata_json(edition)

No stale guard is needed for dashboard renders. The dashboard output is deterministic from the current database state, so the last render to complete always produces the correct output. The per-edition metadata writes can be parallelized across editions (different lock keys), but each individual write serializes against any concurrent EditionService.update() for the same edition.

Why this works#

  • No concurrent mutation: The edition-level advisory lock serializes all updates to a given edition, whether from build_processing or edition_update jobs. The project-level lock serializes all dashboard renders for a project, whether from build processing, edition updates, template syncs, or manual re-renders.

  • No failures: pg_advisory_lock() blocks until the lock is available — the job waits rather than failing.

  • Correct final state: If Build B (newer) is processed before Build A (older) due to lock acquisition order, Build A’s update is skipped by the stale guard. The edition always reflects the most recent build. Dashboard renders are deterministic from database state, so the last render to complete is always correct.

  • Compatible with asyncio.gather(): Each edition’s lock is independent, so parallel updates of different editions within the same job proceed without contention. Only updates to the same edition across jobs serialize. Similarly, dashboard_sync jobs that re-render multiple projects in parallel acquire independent project-level locks.

  • Covers all code paths: Placing the edition lock inside EditionService.update() covers both the build_processing parallel edition phase and the edition_update manual reassignment path. Placing the project lock inside DashboardService.render() covers all dashboard render triggers. Per-edition metadata JSON is protected in both locations — inside EditionService.update() and during DashboardService.render()’s per-edition loop.

Connection impact#

The advisory lock holds a database session open for the duration of the locked operation. For edition updates in pointer mode (~2 seconds for KV write + cache purge) this is negligible. For copy mode (longer due to object copies), the session is held longer but this is acceptable given expected concurrency levels — at most a few concurrent builds per project. The project-level dashboard lock is held only for the duration of writing the three project-wide files (HTML + JSON), which is sub-second — significantly shorter than edition content updates.

Operator queries#

The QueueJob table provides a single place for operators to understand system state across all workers:

  • Backlog depth: SELECT count(*), kind FROM queue_job WHERE status = 'queued' GROUP BY kind

  • Active work: SELECT * FROM queue_job WHERE status = 'in_progress' — shows what every worker is doing, which phase each job is in, and per-edition progress

  • Edition update activity: SELECT * FROM queue_job WHERE status = 'in_progress' AND project_id = :pid AND phase = 'editions' — shows concurrent edition work for a project. Advisory locks (see Cross-job serialization) handle serialization automatically; this query is for observability

  • Error rates: SELECT count(*) FROM queue_job WHERE status IN ('failed', 'completed_with_errors') AND date_completed > now() - interval '1 hour'

  • Per-org throughput: SELECT org_id, count(*) FROM queue_job WHERE status = 'completed' AND date_completed > now() - interval '1 hour' GROUP BY org_id

  • Slow jobs: SELECT * FROM queue_job WHERE status = 'in_progress' AND date_started < now() - interval '10 minutes'

Job types#

Build processing (build_processing)#

Triggered when a client signals upload complete (PATCH .../builds/:build with status: uploaded). This is the primary job type.

The service layer executes the following steps inside the single background job:

  1. Inventory (sequential) — catalog the build’s objects from the object store into the BuildObject table in Postgres (key, content hash, content type, size). This is a listing + metadata operation against the object store.

  2. Evaluate tracking rules (sequential) — determine which editions should update based on the build’s git ref, the project’s edition tracking modes, and the org’s rewrite rules. Auto-create new editions if needed (e.g., new semver major stream, new git ref). Returns a list of affected editions.

  3. Update editions (parallel via asyncio.gather()) — for each affected edition, update the edition to point to the new build. In pointer mode this writes a new KV mapping and purges the CDN cache; in copy mode this performs the ordered diff-copy-purge sequence. Each edition update also logs the transition to the EditionBuildHistory table. If one edition update fails, the others continue to completion; failures are collected and reported via the QueueJob progress JSONB.

  4. Render project dashboard (sequential, runs once after all edition updates complete) — re-render the project’s dashboard and 404 pages using the current edition metadata from the database and the resolved template. A single build may update multiple editions, but the dashboard reflects the project’s full edition list and only needs to be rendered once.

  5. Update job status (sequential) — mark the QueueJob as completed, completed_with_errors (if some editions failed), or failed. Also update the build’s status accordingly.

Edition reassignment (edition_update)#

Triggered when an admin PATCHes an edition with a new build field (manual reassignment or rollback). Simpler than build processing — a single background job that:

  1. Updates the edition to point to the specified build (pointer mode KV write or copy mode diff-copy).

  2. Logs the transition to EditionBuildHistory.

  3. Renders the project dashboard.

Dashboard template sync (dashboard_sync)#

Triggered by a GitHub webhook when a tracked dashboard template repository is updated. A single background job that syncs the template files from GitHub to the object store, then re-renders dashboards for all affected projects (all projects in the org for an org-level template, or a single project for a project-level override), using asyncio.gather() to parallelize across projects. See the Dashboard templating system section for the full sync flow.

Lifecycle evaluation (lifecycle_eval)#

Scheduled periodically (see Periodic job scheduling). A single background job that scans all orgs and projects for editions and builds matching lifecycle rules (stale drafts, orphan builds). Soft-deletes matching resources and moves object store content to purgatory. Uses asyncio.gather() to parallelize across orgs.

Git ref audit (git_ref_audit)#

Scheduled periodically (see Periodic job scheduling). A single background job that verifies git refs tracked by editions still exist on their GitHub repositories. Flags or soft-deletes editions whose refs have been deleted (if the ref_deleted lifecycle rule is enabled). Catches cases where GitHub webhook delivery for ref deletion events was missed.

Purgatory cleanup (purgatory_cleanup)#

Scheduled periodically (see Periodic job scheduling). A single background job that hard-deletes object store objects that have been in purgatory longer than the org’s configured retention period. Simple listing + batch delete per org.

Credential re-encryption (credential_reencrypt)#

Scheduled periodically (see Periodic job scheduling). A single background job that iterates over all organization_credentials rows and calls CredentialEncryptor.rotate() on each encrypted_credential value. This re-encrypts every token under the current primary Fernet key. Unlike Vault’s vault:vN: prefix, Fernet tokens don’t indicate which key encrypted them, so the job processes all rows unconditionally — MultiFernet.rotate() is idempotent. This ensures that after a key rotation, all stored credentials are migrated to the new key without ever exposing plaintext. Parallelized across orgs via asyncio.gather().

Periodic job scheduling#

Periodic jobs are scheduled using Kubernetes CronJobs rather than the queue backend’s built-in scheduling features (e.g., Arq’s cron). This keeps scheduling decoupled from the queue backend, enabling backend swaps without changing scheduling infrastructure. Kubernetes CronJobs are well-understood, observable, and already used throughout Phalanx.

Each periodic job type gets a CronJob that runs a thin CLI command to create a QueueJob record and enqueue it via the queue backend:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: docverse-lifecycle-eval
spec:
  schedule: "0 3 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: docverse-enqueue
              image: ghcr.io/lsst-sqre/docverse:latest
              command: ["docverse", "enqueue", "lifecycle_eval"]
          restartPolicy: OnFailure

The docverse enqueue CLI command connects to the database and Redis, creates a QueueJob record with status: queued, enqueues the job via the queue backend, and exits. The actual work is performed by the Docverse worker process.

Schedule table#

Job type

Default schedule

Description

lifecycle_eval

Daily at 03:00 UTC

Evaluate edition and build lifecycle rules

git_ref_audit

Daily at 04:00 UTC

Verify git refs tracked by editions

purgatory_cleanup

Daily at 05:00 UTC

Hard-delete expired purgatory objects

credential_reencrypt

Weekly (Sunday 02:00)

Re-encrypt credentials under current primary Fernet key

Schedules are configurable per-deployment via Phalanx Helm values. Operators can adjust frequencies, add maintenance windows, or disable specific jobs without code changes.

Failure and retry#

The queue backend handles job-level retries. With Arq, retry behavior is configured per job type via the worker’s job definitions. The retry policy varies by job type:

  • build_processing and edition_update: retry with backoff, up to 3 attempts. Jobs are idempotent at each step — inventory upserts, tracking evaluation is deterministic, edition updates use diffs (already-updated editions show no changes on re-run). On retry, the job re-runs from the beginning but completed steps are effectively no-ops. The QueueJob progress is reset at the start of each attempt.

  • Periodic jobs (lifecycle_eval, git_ref_audit, purgatory_cleanup): retry once, then wait for the next scheduled run. These are self-correcting — anything missed on one run will be caught on the next.

Within a build processing job, individual edition update failures (in the asyncio.gather()) do not fail the entire job. The service layer uses return_exceptions=True, collects results, and marks the job as completed_with_errors if some editions failed while others succeeded. Failed editions are recorded in the progress JSONB with error messages for diagnosis. A subsequent retry or manual edition PATCH can address individual failures.

Job retention#

Completed and failed QueueJob records are retained for a configurable period (default: 7 days) before being cleaned up by the purgatory cleanup job. The queue API returns 404 for expired jobs.

REST API design#

Conventions#

The Docverse REST API follows SQuaRE’s REST API design conventions, and specifically follows patterns established in Ook and Times Square:

  • No URL versioning: the API has a single set of paths (no /v1/ prefix). Breaking changes would be handled by introducing new endpoints alongside deprecated ones. A lot of design work would have to go into supporting multiple API versions simultaneously, so assuming path-based API versioning seems presumptive at this stage.

  • HATEOAS-style navigation: all resource representations include self_url and relevant navigation URLs (e.g., project_url, org_url, builds_url, editions_url). Clients navigate the API via these provided URLs rather than constructing their own.

  • Collections are top-level arrays: collection endpoints return a JSON array of resource objects. Pagination metadata is carried in response headers, not a wrapper object.

  • Keyset pagination: all collection endpoints use keyset pagination via Safir’s pagination library. Pagination cursors and links are returned in headers.

  • Errors: all error responses follow safir.models.ErrorModel — a detail array of {type, msg, loc?} objects, compatible with FastAPI’s built-in validation error format.

  • Background job URLs: when a POST or PATCH enqueues background work (e.g., build processing, edition updates), the response includes a queue_url pointing to the job status resource.

Resource identifiers#

Resources use human-readable identifiers in URLs rather than auto-generated database primary keys:

Resource

URL identifier

Format

Example

Organization

slug

lowercase alphanumeric + hyphens

rubin, spherex

Project

slug

URL-safe, matches doc handle

sqr-006, pipelines

Edition

slug

URL-safe, derived from git ref via rewrite rules

__main, DM-12345, v2.x

Build

Crockford Base32 ID

12 chars + checksum, hyphenated

01HQ-3KBR-T5GN-8W

Queue job

Crockford Base32 ID

12 chars + checksum, hyphenated

0G4R-MFBZ-K7QP-5X

Org membership

composite key

{principal_type}:{principal}

user:docverse-ci-rubin, group:g_spherex

Build and queue job IDs use the Crockford Base32 implementation from Ook, backed by base32-lib. IDs are stored as integers in Postgres but serialized as base32 strings with checksums in the API via Pydantic. Build IDs are randomly generated (not ordered). Queue job IDs are Docverse-owned public identifiers that map internally to the queue backend’s job IDs via the backend_job_id column in the QueueJob table.

Ingress and authorization mapping#

Two Gafaelfawr ingresses protect the API:

Ingress path

Required scope

Purpose

/admin/*

admin:docverse

Superadmin operations (create/delete orgs)

/*

exec:docverse

All other operations

All org-level authorization (admin vs uploader vs reader) is enforced at the application layer via OrgMembership checks. Gafaelfawr ingresses cannot express “user X has role Y in org Z” — that granularity requires application-level logic. The ingress layer ensures the user is authenticated and has basic Docverse access; the application layer checks the specific role required for each endpoint.

Endpoint catalog#

Root#

GET  /                                              → API metadata and navigation URLs

Returns API version, available org URLs, and links to documentation. No authentication required beyond the base exec:docverse scope.

Superadmin — organization management#

These endpoints are separated under /admin/ to enable the admin:docverse Gafaelfawr scope at the ingress level.

POST   /admin/orgs                                  → create organization
DELETE /admin/orgs/:org                              → delete organization

Organizations#

GET    /orgs                                         → list organizations (reader+)
GET    /orgs/:org                                    → get organization (reader+)
PATCH  /orgs/:org                                    → update org settings (admin)

The GET /orgs/:org response includes navigation URLs and the slug rewrite rules:

{
  "self_url": "https://docverse.../orgs/rubin",
  "projects_url": "https://docverse.../orgs/rubin/projects",
  "members_url": "https://docverse.../orgs/rubin/members",
  "slug": "rubin",
  "title": "Rubin Observatory",
  "slug_rewrite_rules": [
    { "type": "ignore", "glob": "dependabot/**" },
    { "type": "prefix_strip", "prefix": "tickets/", "edition_kind": "draft" }
  ]
}

Slug preview#

POST   /orgs/:org/slug-preview                       → preview slug resolution (admin)

Tests the org’s (or a specific project’s) edition slug rewrite rules against a git ref without creating any resources. See the Edition slug rewrite rules section for request/response details.

Org membership#

GET    /orgs/:org/members                            → list memberships (admin)
POST   /orgs/:org/members                            → add membership (admin)
GET    /orgs/:org/members/:id                        → get membership (admin)
DELETE /orgs/:org/members/:id                        → remove membership (admin)

Membership :id uses a composite key format: user:jdoe or group:g_spherex. This is self-documenting in URLs and corresponds directly to the principal_type:principal pair which is unique within an org.

Projects#

GET    /orgs/:org/projects                           → list projects (reader+, paginated)
POST   /orgs/:org/projects                           → create project (admin)
GET    /orgs/:org/projects/:project                  → get project (reader+)
PATCH  /orgs/:org/projects/:project                  → update project (admin)
DELETE /orgs/:org/projects/:project                  → soft-delete project (admin)

The GET /orgs/:org/projects/:project response includes navigation URLs:

{
  "self_url": "https://docverse.../orgs/rubin/projects/pipelines",
  "org_url": "https://docverse.../orgs/rubin",
  "editions_url": "https://docverse.../orgs/rubin/projects/pipelines/editions",
  "builds_url": "https://docverse.../orgs/rubin/projects/pipelines/builds",
  "slug": "pipelines",
  "title": "LSST Science Pipelines",
  "doc_repo": "https://github.com/lsst/pipelines_lsst_io",
  "slug_rewrite_rules": null
}

The slug_rewrite_rules field is null when the project inherits the org’s rules, or contains a project-specific rule list when overridden.

Editions#

GET    /orgs/:org/projects/:project/editions         → list editions (reader+, paginated)
POST   /orgs/:org/projects/:project/editions         → create edition (admin)
GET    /orgs/:org/projects/:project/editions/:ed     → get edition (reader+)
PATCH  /orgs/:org/projects/:project/editions/:ed     → update edition (admin)
DELETE /orgs/:org/projects/:project/editions/:ed     → soft-delete edition (admin)

PATCH supports setting build to reassign an edition to a specific build (used for rollback or manual reassignment). This enqueues an edition update task and returns a queue_url.

The POST to create an edition accepts the following request body:

{
  "slug": "usdf-dev--main",
  "title": "USDF Dev — main",
  "kind": "alternate",
  "tracking_mode": "alternate_git_ref",
  "tracking_params": {
    "git_ref": "main",
    "alternate_name": "usdf-dev"
  }
}

Field

Type

Required

Description

slug

string

yes

URL-safe edition identifier. Must be unique within the project.

title

string

yes

Human-readable title for dashboards and metadata.

kind

string

yes

Edition kind (main, release, draft, major, minor, alternate). Controls dashboard grouping and lifecycle rule targeting.

tracking_mode

string

yes

One of the supported tracking modes (see Projects, editions and builds). Determines which builds update this edition.

tracking_params

object

no

Mode-specific parameters (e.g., {"git_ref": "main"} for git_ref mode, or {"git_ref": "main", "alternate_name": "usdf-dev"} for alternate_git_ref mode). Required for parameterized tracking modes.

GET    /orgs/:org/projects/:project/editions/:ed/history → edition-build history (reader+, paginated)

The history endpoint returns a paginated list of builds the edition has pointed to, ordered by position (most recent first). Each entry includes the build reference, timestamp, and position.

The GET /orgs/:org/projects/:project/editions/:ed response:

{
  "self_url": "https://docverse.../orgs/rubin/projects/pipelines/editions/__main",
  "project_url": "https://docverse.../orgs/rubin/projects/pipelines",
  "build_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
  "history_url": "https://docverse.../orgs/rubin/projects/pipelines/editions/__main/history",
  "published_url": "https://pipelines.lsst.io/",
  "slug": "__main",
  "kind": "main",
  "tracking_mode": "git_ref",
  "tracking_params": { "git_ref": "main" },
  "date_updated": "2026-02-08T12:00:00Z"
}

Builds#

GET    /orgs/:org/projects/:project/builds           → list builds (reader+, paginated)
POST   /orgs/:org/projects/:project/builds           → create build (uploader+)
GET    /orgs/:org/projects/:project/builds/:build    → get build (reader+)
PATCH  /orgs/:org/projects/:project/builds/:build    → signal upload complete (uploader+)
DELETE /orgs/:org/projects/:project/builds/:build    → soft-delete build (admin)

The POST to create a build returns a single presigned upload URL for the tarball:

Request:

{
  "git_ref": "main",
  "alternate_name": "usdf-dev",
  "content_hash": "sha256:a1b2c3d4...",
  "annotations": { "kind": "release" }
}

The alternate_name field is optional — most projects don’t use it. When present, it scopes the build to alternate-aware editions. Builds with alternate_name are not matched by generic git_ref-only tracking editions — the alternate name acts as a namespace. See Compound slug derivation for alternate-scoped builds in the projects section for details on how alternate names interact with edition tracking and slug derivation.

Response:

{
  "self_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
  "project_url": "https://docverse.../orgs/rubin/projects/pipelines",
  "id": "01HQ-3KBR-T5GN-8W",
  "status": "uploading",
  "git_ref": "main",
  "upload_url": "https://storage.googleapis.com/docverse-staging-rubin/__staging/01HQ...tar.gz?sig=...",
  "date_created": "2026-02-08T12:00:00Z"
}

The upload_url is a presigned PUT URL for the tarball. It points to either the staging bucket (if configured) or the publishing bucket’s __staging/ prefix. The client uploads the tarball via HTTP PUT to this URL, then signals upload complete via PATCH.

The content_hash field allows the server to verify tarball integrity after upload. The files field from the per-file upload model is removed – the file manifest is discovered during the unpack step.

The PATCH to signal upload complete updates the build status and enqueues background processing:

Request:

{
  "status": "uploaded"
}

Response:

{
  "self_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
  "queue_url": "https://docverse.../queue/jobs/0G4R-MFBZ-K7QP-5X",
  "status": "processing"
}

Queue jobs#

GET    /queue/jobs/:job                              → get job status (authenticated)

Queue jobs provide status tracking for background operations (build processing, edition updates, dashboard rendering). The job resource is identified by a Docverse-owned Crockford Base32 ID. See the Task queue design section for the QueueJob table schema and progress tracking implementation.

{
  "self_url": "https://docverse.../queue/jobs/0G4R-MFBZ-K7QP-5X",
  "id": "0G4R-MFBZ-K7QP-5X",
  "status": "in_progress",
  "kind": "build_processing",
  "build_url": "https://docverse.../orgs/rubin/projects/pipelines/builds/01HQ-3KBR-T5GN-8W",
  "date_created": "2026-02-08T12:00:00Z",
  "date_started": "2026-02-08T12:00:01Z",
  "date_completed": null,
  "phase": "editions",
  "progress": {
    "editions_total": 3,
    "editions_completed": [
      { "slug": "__main", "published_url": "https://pipelines.lsst.io/" }
    ],
    "editions_failed": [],
    "editions_in_progress": ["v2.x", "DM-12345"]
  }
}

Database models#

Docverse stores all state in a PostgreSQL database accessed through SQLAlchemy (async, via Safir’s database utilities). This section provides a centralized reference for the database schema. Individual sections describe the behavioral design around each table; this section focuses on column definitions and relationships.

Entity-relationship diagram#

        erDiagram
    Organization ||--o{ Project : "has"
    Organization ||--o{ OrgMembership : "has"
    Organization ||--o{ organization_credentials : "has"
    Organization ||--o{ DashboardTemplate : "has"
    Organization ||--o{ QueueJob : "scoped to"
    Project ||--o{ Build : "has"
    Project ||--o{ Edition : "has"
    Project ||--o{ QueueJob : "scoped to"
    Project ||--o{ DashboardTemplate : "overrides"
    Build ||--o{ BuildObject : "inventories"
    Build ||--o{ EditionBuildHistory : "logged in"
    Build ||--o{ QueueJob : "tracked by"
    Edition ||--o{ EditionBuildHistory : "logged in"
    Edition }o--o| Build : "current build"

    Organization {
        int id PK
        string slug UK
        string title
        string base_domain
        enum url_scheme
        string root_path_prefix
        JSONB slug_rewrite_rules
        JSONB lifecycle_rules
        datetime date_created
        datetime date_updated
    }

    Project {
        int id PK
        string slug
        string title
        int org_id FK
        string doc_repo
        JSONB slug_rewrite_rules
        JSONB lifecycle_rules
        datetime date_created
        datetime date_updated
        datetime date_deleted
    }

    Build {
        int id PK
        string public_id
        int project_id FK
        string git_ref
        string alternate_name
        string content_hash
        enum status
        string staging_key
        int object_count
        bigint total_size_bytes
        string uploader
        JSONB annotations
        datetime date_created
        datetime date_uploaded
        datetime date_completed
        datetime date_deleted
    }

    Edition {
        int id PK
        string slug
        string title
        int project_id FK
        enum kind
        enum tracking_mode
        JSONB tracking_params
        int current_build_id FK
        bool lifecycle_exempt
        datetime date_created
        datetime date_updated
        datetime date_deleted
    }

    OrgMembership {
        UUID id PK
        int org_id FK
        string principal
        enum principal_type
        enum role
    }

    organization_credentials {
        UUID id PK
        int organization_id FK
        string label
        string service_type
        string encrypted_credential
        datetime created_at
        datetime updated_at
    }

    BuildObject {
        int id PK
        int build_id FK
        string key
        string content_hash
        string content_type
        bigint size
    }

    EditionBuildHistory {
        int id PK
        int edition_id FK
        int build_id FK
        int position
        datetime date_created
    }

    DashboardTemplate {
        int id PK
        int org_id FK
        int project_id FK
        string github_owner
        string github_repo
        string path
        string git_ref
        string store_prefix
        string sync_id
        datetime date_synced
    }

    QueueJob {
        int id PK
        int public_id
        string backend_job_id
        enum kind
        enum status
        string phase
        int org_id FK
        int project_id FK
        int build_id FK
        JSONB progress
        JSONB errors
        datetime date_created
        datetime date_started
        datetime date_completed
    }
    

Core domain tables#

These tables define the primary domain model for Docverse.

Organization#

The organization is the top-level resource and the sole infrastructure configuration boundary. All projects within an org share the same object store, CDN, root domain, URL scheme, and default dashboard templates. See Organizations design for the full behavioral design.

Column

Type

Description

id

int

Primary key

slug

str (unique)

URL-safe identifier (e.g., rubin, spherex)

title

str

Human-readable name

base_domain

str

Root domain for published URLs (e.g., lsst.io)

url_scheme

enum

subdomain or path_prefix — determines how project URLs are constructed

root_path_prefix

str

Path prefix for path-prefix URL scheme (e.g., /documentation/)

slug_rewrite_rules

JSONB

Ordered list of edition slug rewrite rules (see Edition slug rewrite rules)

lifecycle_rules

JSONB

Default lifecycle rules for projects in this org (see Projects, editions and builds)

date_created

datetime

Creation timestamp

date_updated

datetime

Last modification timestamp

Infrastructure connections (object store, CDN, DNS, staging store) are configured through the organization_credentials table and additional org-level configuration fields.

Project#

A documentation site with a stable URL and multiple versions (editions). Projects belong to an organization and inherit its infrastructure and default configuration, with optional per-project overrides for slug rewrite rules and lifecycle rules. See Projects, editions and builds for the full behavioral design.

Column

Type

Description

id

int

Primary key

slug

str

URL-safe identifier, unique within org (e.g., sqr-006, pipelines)

title

str

Human-readable name

org_id

FK → Organization

Owning organization

doc_repo

str

GitHub repository URL for the documentation source

slug_rewrite_rules

JSONB (nullable)

When set, completely replaces the org-level rules for this project

lifecycle_rules

JSONB (nullable)

When set, overrides the org-level lifecycle rules for this project

date_created

datetime

Creation timestamp

date_updated

datetime

Last modification timestamp

date_deleted

datetime (nullable)

Soft-delete timestamp; null when active

Build#

A discrete upload of documentation content for a project. Builds are conceptually immutable after processing and carry metadata about their origin. Builds are identified externally with a Crockford Base32 ID. See Projects, editions and builds for the upload flow and processing pipeline.

Column

Type

Description

id

int

Internal primary key

public_id

Crockford Base32

Externally-visible identifier (e.g., 01HQ-3KBR-T5GN-8W)

project_id

FK → Project

Owning project

git_ref

str

Git branch or tag that produced this build

alternate_name

str (nullable)

Deployment/variant scope (e.g., usdf-dev); see Compound slug derivation for alternate-scoped builds

content_hash

str

SHA-256 hash of the uploaded tarball for integrity verification

status

enum

pending, uploading, processing, completed, failed

staging_key

str

Object store key for the uploaded tarball (e.g., __staging/{build_id}.tar.gz)

object_count

int

Number of files extracted from the tarball (populated during inventory)

total_size_bytes

bigint

Total size of extracted content (populated during inventory)

uploader

str

Username of the authenticated uploader

annotations

JSONB

Client-provided metadata about the build

date_created

datetime

When the build record was created

date_uploaded

datetime (nullable)

When the client signaled upload complete

date_completed

datetime (nullable)

When background processing finished

date_deleted

datetime (nullable)

Soft-delete timestamp; null when active

Edition#

A named, published view of a project’s documentation at a stable URL. Editions are pointers — they represent a specific build’s content served at an edition-specific URL path (e.g., /v/main/, /v/DM-12345/). See Projects, editions and builds for tracking modes, edition kinds, and auto-creation behavior.

Column

Type

Description

id

int

Primary key

slug

str

URL-safe identifier, unique within project (e.g., __main, DM-12345, v2.x)

title

str

Human-readable name for dashboards

project_id

FK → Project

Owning project

kind

enum

main, release, draft, major, minor, alternate — classifies for dashboards and lifecycle rules

tracking_mode

enum

Determines which builds update this edition (see Projects, editions and builds for the full list)

tracking_params

JSONB

Mode-specific parameters (e.g., {"git_ref": "main"}, {"major_version": 2})

current_build_id

FK → Build (nullable)

The build currently served at this edition’s URL; null before first build

lifecycle_exempt

bool

When true, this edition is never deleted by lifecycle rules

date_created

datetime

Creation timestamp

date_updated

datetime

Last update timestamp (changes when build pointer moves)

date_deleted

datetime (nullable)

Soft-delete timestamp; null when active

Supporting tables#

These tables support the core domain model with membership, credentials, object inventory, history tracking, dashboard templates, and background job management. Each is discussed in detail in its respective section.

OrgMembership#

Maps users and groups to roles within organizations. See Authentication and authorization for the full authorization model, role definitions, and resolution algorithm.

Column

Type

Description

id

UUID

Primary key

org_id

FK → Organization

The organization

principal

str

A username or group name

principal_type

enum

user or group

role

enum

reader, uploader, or admin

organization_credentials#

Stores Fernet-encrypted credentials for organization infrastructure services (object stores, CDNs, DNS providers). See Organizations design for the encryption scheme, key rotation, and credential management.

Column

Type

Description

id

UUID

Primary key

organization_id

FK → Organization

Owning organization

label

str

Human-friendly name (e.g., “Cloudflare R2 production”)

service_type

str

Provider identifier (e.g., cloudflare, aws_s3, fastly)

encrypted_credential

str

Fernet token containing the encrypted credential value

created_at

datetime

Creation timestamp

updated_at

datetime

Last modification timestamp

Unique constraint on (organization_id, label).

BuildObject#

Inventories every file extracted from a build’s tarball. Populated during the inventory phase of build processing. See Projects, editions and builds for how the inventory enables diff-based edition updates and orphan detection.

Column

Type

Description

id

int

Primary key

build_id

FK → Build

Owning build

key

str

Object store path (e.g., __builds/{build_id}/index.html)

content_hash

str

ETag or SHA-256 hash of the object content

content_type

str

MIME type (e.g., text/html, image/png)

size

bigint

Object size in bytes

EditionBuildHistory#

Logs every build that an edition has pointed to, enabling rollback and orphan detection. See Projects, editions and builds for the rollback API and how lifecycle rules reference history position.

Column

Type

Description

id

int

Primary key

edition_id

FK → Edition

The edition

build_id

FK → Build

The build that was served

position

int

Ordering position (1 = most recent)

date_created

datetime

When this history entry was recorded

DashboardTemplate#

Tracks dashboard template sources (GitHub repos) and their sync state. See Dashboard templating system for the template directory structure, sync flow, and rendering pipeline.

Column

Type

Description

id

int

Primary key

org_id

FK → Organization

Owning organization

project_id

FK → Project (nullable)

If set, this is a project-level override

github_owner

str

GitHub organization or user

github_repo

str

Repository name

path

str

Path within repo (default "" for root)

git_ref

str

Branch or tag to track

store_prefix

str (nullable)

Object store prefix for current synced template files

sync_id

str (nullable)

Current sync version identifier (timestamp-based)

date_synced

datetime (nullable)

Last successful sync timestamp

Unique constraint on (org_id, project_id) — at most one template per org (where project_id is null) and one per project.

QueueJob#

Tracks all background jobs as the single source of truth for job state and progress. The queue backend (Arq/Redis) handles delivery; this table is the authoritative state store. See Task queue design for progress tracking, cross-job serialization, and operator queries.

Column

Type

Description

id

int

Internal primary key

public_id

int

Crockford Base32 serialized in API

backend_job_id

str (nullable)

Reference to the queue backend’s job ID (e.g., Arq UUID)

kind

enum

build_processing, edition_update, dashboard_sync, lifecycle_eval, git_ref_audit, purgatory_cleanup, credential_reencrypt

status

enum

queued, in_progress, completed, completed_with_errors, failed, cancelled

phase

str (nullable)

Current processing phase (e.g., inventory, tracking, editions, dashboard)

org_id

FK → Organization

Scoped to org for filtering

project_id

FK → Project (nullable)

Set for build/edition jobs

build_id

FK → Build (nullable)

Set for build processing jobs

progress

JSONB (nullable)

Structured progress data, phase-specific

errors

JSONB (nullable)

Collected error details

date_created

datetime

When the job was enqueued

date_started

datetime (nullable)

When a worker picked it up

date_completed

datetime (nullable)

When the job finished

GitHub Actions action (docverse-upload)#

The docverse-upload action is a native JavaScript GitHub Action published to the GitHub Marketplace from the lsst-sqre/docverse-upload repository. It provides the same upload workflow as the Python client (see Client-server monorepo) but is purpose-built for GitHub Actions runners.

docverse-upload/                    # lsst-sqre/docverse-upload
├── action.yml
├── src/
│   └── index.ts
├── generated/
│   └── api-types.ts                # openapi-typescript output
├── openapi.json                    # pinned OpenAPI spec from docverse
├── package.json
├── tsconfig.json
└── dist/
    └── index.js                    # ncc-bundled output

Why native JavaScript instead of wrapping the Python client#

A composite action that installs Python and then calls docverse upload would work, but carries overhead:

  • Python setup cost: actions/setup-python adds 15–30 seconds to every job. For documentation builds that already have Python, this is free — but for projects using other languages, or for workflows where the docs build runs in a separate job, the setup time is wasted.

  • Node 20 is guaranteed: GitHub Actions runners always have Node 20 available. A JavaScript action runs immediately with zero setup.

  • Native toolkit integration: the @actions/core toolkit provides first-class support for step outputs, job summaries, annotations, and failure reporting. Wrapping a CLI subprocess requires parsing its output to surface these features.

  • Independent versioning: the action is versioned via Git tags (v1, v1.2.0) following GitHub Actions conventions. It can release on its own cadence without coupling to the Python client’s PyPI release cycle.

Usage#

- name: Upload docs to Docverse
  uses: lsst-sqre/docverse-upload@v1
  with:
    org: rubin
    project: pipelines
    dir: _build/html
    token: ${{ secrets.DOCVERSE_TOKEN }}

Usage with PR comments enabled:

- name: Upload docs to Docverse
  uses: lsst-sqre/docverse-upload@v1
  with:
    org: rubin
    project: pipelines
    dir: _build/html
    token: ${{ secrets.DOCVERSE_TOKEN }}
    github-token: ${{ github.token }}

Inputs#

Input

Required

Default

Description

org

yes

Organization slug

project

yes

Project slug

dir

yes

Path to built documentation directory

token

yes

Gafaelfawr token for authentication

base-url

no

https://docverse.lsst.io

Docverse API base URL

git-ref

no

${{ github.ref }}

Git ref (auto-detected from workflow context)

alternate-name

no

Alternate name for scoped editions

wait

no

true

Wait for processing to complete

github-token

no

GitHub token for posting PR comments with links to updated editions. Typically ${{ github.token }}. When omitted, PR commenting is disabled.

Outputs#

Output

Description

build-id

The Docverse build ID

build-url

API URL of the created build

published-url

Public URL where the edition is served

job-status

Terminal queue job status

editions-json

JSON array of all updated editions with slugs and published URLs (e.g., [{"slug": "__main", "published_url": "https://pipelines.lsst.io/"}])

Pull request comments#

When github-token is provided, the action posts or updates a comment on the associated pull request summarizing the build and linking to all updated editions. This gives PR reviewers immediate, clickable access to staged documentation without navigating the Docverse API or dashboards.

PR discovery#

How the action finds the PR number depends on the workflow trigger event:

  • pull_request / pull_request_target events: the PR number is read directly from github.event.pull_request.number.

  • push events: the action queries the GitHub API (GET /repos/{owner}/{repo}/pulls?head={owner}:{branch}&state=open) to find open PRs for the pushed branch. If multiple PRs match, the action comments on all of them. If none match, the comment step is skipped silently.

  • Other events (workflow_dispatch, schedule): the comment step is skipped silently.

Comment format#

The comment uses a Markdown table to list all updated editions with their published URLs:

<!-- docverse:pr-comment:rubin/pipelines -->
### Docverse documentation preview

| Edition | URL |
| ------- | --- |
| `__main` | https://pipelines.lsst.io/ |
| `DM-12345` | https://pipelines.lsst.io/v/DM-12345/ |

Build `01HQ-3KBR-T5GN-8W` processed successfully.

Edition data is extracted from the completed job’s editions_completed progress array (see Task queue design). For partial failures (job status completed_with_errors), successful editions appear in the main table; failed and skipped editions are listed in a collapsible <details> block below.

Comment deduplication#

A hidden HTML marker <!-- docverse:pr-comment:{org}/{project} --> at the top of the comment body identifies the comment, scoped by organization and project. On each build the action:

  1. Lists existing comments on the PR and searches for the marker.

  2. If found: updates the existing comment via PATCH /repos/{owner}/{repo}/issues/comments/{comment_id}.

  3. If not found: creates a new comment via POST /repos/{owner}/{repo}/issues/{pr_number}/comments.

Multi-project PRs (repositories that publish to multiple Docverse projects) get one comment per project, each independently updated.

Edge cases#

  • Job failed: the comment reports the failure status and build ID instead of an edition table.

  • No editions updated: the comment notes that no editions were updated and includes the build ID.

  • Partial failure (completed_with_errors): successful editions appear in the main table; failed and skipped editions are listed in a collapsible <details> block.

  • Token lacks permissions: the GitHub API returns 403; the action logs a warning via core.warning() but does not fail the step (the upload itself succeeded).

  • No PR context: the comment step is skipped silently; the build proceeds normally.

Permissions#

The github-token requires pull-requests: write permission. Workflows must declare this explicitly:

permissions:
  pull-requests: write

steps:
  - name: Upload docs to Docverse
    uses: lsst-sqre/docverse-upload@v1
    with:
      org: rubin
      project: pipelines
      dir: _build/html
      token: ${{ secrets.DOCVERSE_TOKEN }}
      github-token: ${{ github.token }}

Implementation details#

OpenAPI-driven TypeScript development#

The action’s TypeScript types are generated from the Docverse server’s OpenAPI spec, creating a cross-repo type safety chain:

  1. Pydantic models in the docverse monorepo’s client package define the API contract.

  2. FastAPI generates an OpenAPI spec from those models; monorepo CI publishes the spec as a versioned artifact.

  3. The docverse-upload repository pins a copy of the spec as openapi.json.

  4. openapi-typescript generates TypeScript types (generated/api-types.ts) from the pinned spec.

  5. openapi-fetch provides a type-safe HTTP client that uses those generated types.

When the API contract changes, a developer updates openapi.json in the action repository (either manually or via a Dependabot-style automation). Because the spec is committed, the diff in the pull request makes every schema change explicitly visible — field renames, added enum values, or new required properties are all reviewable before the action code is updated to match. This provides a deliberate review gate that catches unintended breaking changes before they ship.

Build and bundle#

The action is built with TypeScript and bundled into a single dist/index.js file using ncc. The bundled output is committed to the repository (standard practice for JavaScript GitHub Actions) so that the action runs without a node_modules install step. The action targets the Node 20 runtime.

Tarball creation and upload#

The action uses the Node.js tar package to create .tar.gz archives and computes a SHA-256 hash during creation. The tarball is uploaded to the presigned URL via the Fetch API.

GitHub Actions integration#

The action uses the @actions/core toolkit for runner integration:

  • Step summary: on success, a Markdown summary is written to $GITHUB_STEP_SUMMARY showing the build ID, queue job status, and published URL.

  • Warning annotations: if the queue job completes with warnings (partial success), the action emits warning annotations visible in the workflow run UI.

  • Step failure: if the queue job fails, the action calls core.setFailed() with the failure reason, marking the step as failed.

  • Outputs: build ID, build URL, published URL, and job status are set as step outputs for downstream workflow steps to consume.

  • PR comments: when github-token is provided, posts a summary comment on the associated pull request with links to all updated editions (see Pull request comments).

Development workflow#

Action development#

The docverse-upload action uses a standard Node.js development workflow:

  • npm install — install dependencies.

  • npm run generate-types — regenerate generated/api-types.ts from openapi.json (runs openapi-typescript).

  • npm test — run unit tests with Vitest.

  • npm run build — compile TypeScript and bundle with ncc into dist/index.js.

Release workflow#

  • GitHub Action (docverse-upload repo): versioned via Git tags following GitHub Actions conventions (v1, v1.2.0). The v1 tag is a floating major-version tag updated on each minor/patch release.

Housing the action in its own repository means its v1/v1.2.0 tags reflect the action’s own input/output contract and release cadence, independent of the server or client Python packages.

Migration from LSST the Docs#

The Rubin Observatory LSST the Docs (LTD) deployment at lsst.io serves ~300 documentation projects for the Rubin Observatory software stack and technical notes. The current deployment runs LTD Keeper 1.23.0 with content stored in AWS S3, served through Fastly CDN, and uploaded via the lsst-sqre/ltd-upload GitHub Action and reusable workflows. The migration moves this deployment to Docverse, targeting Cloudflare R2 for object storage and Cloudflare Workers for the CDN edge.

The migration involves three concerns: data migration (moving object store content and database records), client migration (updating CI workflows that upload documentation), and a phased rollout that minimizes disruption.

Data migration#

The data migration moves documentation content from the LTD object store layout to the Docverse layout and seeds the Docverse database with project, edition, and build records derived from the LTD Keeper database.

Scope: Only builds that are currently referenced by active (non-deleted) editions are migrated. Historical builds that are not pointed to by any edition are discarded. This significantly reduces the data volume — most projects accumulate hundreds of builds over time, but only a handful of editions (and therefore builds) are active at any given moment.

LTD vs. Docverse object store layout#

The LTD and Docverse object store layouts differ in both structure and semantics:

Aspect

LTD layout

Docverse layout

Build storage

{product}/builds/{build_id}/{file}

{project}/__builds/{build_id}/{file}

Edition content

{product}/editions/{slug}/{file} (physical copy of build files)

No edition file copies — editions are pointers resolved at the CDN edge via KV lookup

Edition metadata

None in object store (stored in LTD Keeper database only)

{project}/__editions/{slug}.json (per-edition metadata for client-side JavaScript)

Staging

N/A (files uploaded individually via presigned URLs)

{project}/__staging/{build_id}.tar.gz (tarball staging area)

Dashboard

Static HTML at domain root

{project}/__dashboards/{slug}.html (template-rendered per edition)

The key architectural difference is that LTD physically copies build files into edition paths on every edition update (the S3 copy-on-publish bottleneck described in Documentation hosting), while Docverse stores builds once and resolves editions to builds via edge KV lookups. This means the migration only needs to copy files for the builds themselves — edition content does not need to be duplicated.

Migration tool design#

The migration is implemented as a docverse migrate CLI command in the Docverse client package. The tool reads from the LTD Keeper database and source object store, and writes to the Docverse API and target object store.

        sequenceDiagram
    participant CLI as docverse migrate
    participant LTD_DB as LTD Keeper DB
    participant S3_SRC as Source S3 Bucket
    participant API as Docverse API
    participant Store as Target Object Store
    participant KV as CDN Edge KV

    CLI->>LTD_DB: Query products, editions, builds
    LTD_DB-->>CLI: Product→Edition→Build mappings

    loop For each product
        CLI->>API: Create project (with org mapping)

        loop For each unique build referenced by active editions
            CLI->>S3_SRC: List objects at {product}/builds/{build_id}/
            S3_SRC-->>CLI: Object keys + metadata

            loop For each object in build
                CLI->>Store: Copy to {project}/__builds/{build_id}/{file}
            end

            CLI->>API: Register build (metadata, object inventory)
        end

        loop For each active edition
            CLI->>API: Create edition (slug, tracking mode, kind)
            CLI->>API: Set edition → build pointer
            API->>KV: Write edition→build mapping
        end
    end

    CLI->>API: Trigger dashboard renders
    

The migration tool proceeds in these steps:

  1. Query LTD Keeper database for all products and their active edition→build mappings. For each product, determine which builds are referenced by at least one non-deleted edition.

  2. Create Docverse projects via the API. Map LTD product slugs to Docverse project slugs (typically identical). Associate each project with the appropriate Docverse organization.

  3. Copy build objects for each unique build referenced by an active edition. Objects are copied from {product}/builds/{build_id}/ in the source S3 bucket to {project}/__builds/{build_id}/ in the target object store. The tool populates the BuildObject inventory table during this step by recording each object’s key, content hash, content type, and size.

  4. Create edition records in Docverse with mapped tracking modes (see below) and appropriate edition kinds. Create EditionBuildHistory entries linking each edition to its current build.

  5. Seed CDN edge data store by writing edition→build mappings to the KV store, enabling the CDN to resolve edition URLs to the correct build paths immediately.

  6. Render dashboards and metadata JSON by triggering dashboard render jobs for each migrated project.

Tracking mode mapping#

LTD Keeper tracking modes map directly to their Docverse equivalents. The names have changed slightly (e.g., git_refsgit_ref), but the semantics are preserved:

LTD Keeper mode

Docverse mode

Notes

git_refs

git_ref

Singular form; same behavior

lsst_doc

lsst_doc

Unchanged

eups_major_release

eups_major_release

Unchanged

eups_weekly_release

eups_weekly_release

Unchanged

eups_daily_release

eups_daily_release

Unchanged

Edition kinds are inferred from the tracking mode and edition slug:

  • Editions with slug __main (or equivalent main branch tracking) → kind main

  • Editions with lsst_doc or eups_* tracking → kind release

  • All other editions → kind draft

Rubin-specific notes#

The Rubin migration involves a cross-cloud transfer from AWS S3 to Cloudflare R2. R2 provides an S3-compatible API, so the migration tool can use standard S3 client libraries (boto3/aioboto3) for both source reads and target writes — no Cloudflare-specific SDK is needed.

Estimated data volume: Based on the ~300 LTD products and typical build sizes, the active build data (excluding historical builds not referenced by editions) is estimated at 50–200 GB. The migration tool should support concurrent transfers with configurable parallelism to complete within a reasonable time window (target: under 4 hours for the full corpus).

DNS cutover: The atomic switchover point for Rubin is the DNS change for *.lsst.io from Fastly to Cloudflare. Before the cutover, the migrated content can be verified on a test domain (e.g., *.lsst-docs-test.org). The DNS change is the point of no return — after this, all documentation traffic is served by the Cloudflare Workers stack described in Documentation hosting. Fastly configuration is retained (but not actively serving) for a rollback window.

Client migration#

Client migration updates the CI workflows that upload documentation builds. The key difference is that LTD uses per-file presigned URL uploads authenticated with LTD API tokens, while Docverse uses tarball uploads authenticated with Gafaelfawr tokens (see Projects, editions and builds and GitHub Actions action (docverse-upload)).

The following table summarizes the differences between the current ltd-upload action and the Docverse upload path:

Aspect

ltd-upload (current)

docverse-upload / updated ltd-upload

Authentication

LTD API username + password (repo or org secret)

Gafaelfawr token (org-level secret DOCVERSE_TOKEN)

Upload mechanism

Per-file presigned URLs from LTD API

Tarball upload to presigned URL from Docverse API

Project identifier

product input

project input (+ org input)

API endpoint

ltd-keeper.lsst.codes

docverse.lsst.io

Build registration

POST /products/{slug}/builds/

POST /orgs/{org}/projects/{project}/builds

Key insight: Many Rubin documentation projects do not call ltd-upload directly. Instead, they use the lsst-sqre/rubin-sphinx-technote-workflows reusable workflow (and similar reusable workflows for other document types), which internally calls ltd-upload. Updating the reusable workflow to use the Docverse upload path migrates all downstream projects with zero per-repo changes. Only projects with custom CI workflows that reference ltd-upload directly need individual updates.

Option A: Retrofit the composite action#

Update lsst-sqre/ltd-upload (or create a new docverse-upload action) to target the Docverse API. The action accepts a Gafaelfawr token and uploads to Docverse.

The product input maps to the Docverse project; a new org input specifies the organization (defaulting to rubin for the Rubin deployment). GitHub organization-level secrets (DOCVERSE_TOKEN) provide the Gafaelfawr token to all repositories in the org without per-repo configuration.

Reusable workflows like rubin-sphinx-technote-workflows absorb the interface change — they update their internal ltd-upload usage to pass the new inputs, and all downstream repositories that use the reusable workflow migrate automatically.

For repositories with custom workflows that reference ltd-upload directly, automated PRs update the workflow YAML (see below).

Pros:

  • No new service infrastructure to deploy or maintain.

  • Clear, explicit upgrade path — each repository’s workflow file shows which backend it uses.

  • Reusable workflow pattern covers the majority of Rubin repositories with zero per-repo changes.

  • Remaining custom-workflow repositories get automated PRs.

Cons:

  • Repositories with custom workflows need workflow file changes (but this is automatable).

  • The ltd-upload action name becomes a misnomer once it targets Docverse (mitigated by eventually deprecating it in favor of docverse-upload).

Option B: LTD API compatibility shim#

Deploy a compatibility service at ltd-keeper.lsst.codes that translates LTD API calls to Docverse API calls. Existing workflows continue to call the LTD API endpoints; the shim authenticates LTD credentials and forwards requests to Docverse using a service-level Gafaelfawr token.

Critical problem: upload format translation. The LTD upload flow works as follows:

  1. Client calls POST /products/{slug}/builds/ to register a build.

  2. LTD Keeper returns a list of per-file presigned S3 URLs.

  3. Client uploads each file directly to S3 using the presigned URLs — these uploads bypass the LTD Keeper API entirely.

  4. Client calls PATCH /builds/{id} to confirm the upload is complete.

The shim can intercept steps 1, 2, and 4, but cannot intercept step 3 because the client uploads directly to S3 using presigned URLs. Docverse expects a single tarball upload, not individual file uploads. To bridge this gap, the shim would need to either:

  • (a) Replace presigned S3 URLs with shim-hosted upload endpoints, making the shim a file proxy that receives every file, buffers them, assembles a tarball, and uploads it to Docverse. For a large Sphinx site with thousands of files, this turns the shim into a high-throughput file proxy handling gigabytes of traffic.

  • (b) Monitor the S3 bucket for uploaded files, detect when all files for a build are present, assemble a tarball, and upload it to Docverse. This is fragile (how to detect “all files uploaded”?) and adds significant latency.

Pros:

  • Zero workflow changes during the shim’s lifetime — existing ltd-upload calls work unmodified.

Cons:

  • The upload format translation (per-file → tarball) makes this architecturally complex. The shim is not a thin API translator — it is a stateful file proxy service.

  • Significant development cost for temporary infrastructure that will be decommissioned once migration is complete.

  • The shim must handle the full upload throughput of all documentation builds, adding an operational burden (monitoring, scaling, failure handling).

  • Development cost is never amortized — the shim is throwaway infrastructure.

Recommendation: Option A (retrofit the composite action)#

The shim’s upload format translation complexity is disproportionate to its benefit. What appears at first glance to be a thin API translator is actually a stateful file proxy service, because LTD’s per-file presigned URL upload pattern cannot be transparently mapped to Docverse’s tarball upload pattern without intercepting and buffering all file uploads.

Meanwhile, the reusable workflow pattern means that most Rubin repositories need zero workflow changes — updating rubin-sphinx-technote-workflows (and similar reusable workflows) migrates all downstream repositories automatically. The remaining repositories with custom workflows receive automated PRs. This makes Option A both simpler to implement and less disruptive in practice.

Workflow changes are prepared in advance (PRs opened, reusable workflow branches ready) and merged as part of the cutover maintenance window (see below).

Automated PR generation#

For repositories with custom workflows that reference ltd-upload directly, a migration script scans repositories, identifies ltd-upload usage, generates updated workflow YAML, and opens PRs:

  1. Use the GitHub API to list repositories in the lsst and lsst-sqre organizations.

  2. For each repository, search .github/workflows/*.yml files for references to lsst-sqre/ltd-upload.

  3. Skip repositories that use reusable workflows (these are migrated by updating the reusable workflow itself).

  4. Generate an updated workflow file that replaces the ltd-upload step with the new inputs.

  5. Open a PR with a standardized description explaining the migration.

Before (LTD):

- name: Upload to LSST the Docs
  uses: lsst-sqre/ltd-upload@v1
  with:
    product: pipelines
    dir: _build/html
  env:
    LTD_USERNAME: ${{ secrets.LTD_USERNAME }}
    LTD_PASSWORD: ${{ secrets.LTD_PASSWORD }}

After (Docverse via retrofitted action):

- name: Upload to Docverse
  uses: lsst-sqre/ltd-upload@v2
  with:
    org: rubin
    project: pipelines
    dir: _build/html
    token: ${{ secrets.DOCVERSE_TOKEN }}

Or, using the dedicated docverse-upload action directly:

- name: Upload to Docverse
  uses: lsst-sqre/docverse-upload@v1
  with:
    org: rubin
    project: pipelines
    dir: _build/html
    token: ${{ secrets.DOCVERSE_TOKEN }}

Migration phases#

The migration proceeds in four phases, each with a clear milestone and rollback strategy:

Phase

Description

Key milestone

Rollback strategy

0: Preparation

Deploy Docverse alongside LTD. Create organizations, configure object stores and CDN, provision Gafaelfawr tokens, validate with test projects.

Docverse deployed and validated with test projects

Remove Docverse deployment; no user impact (LTD unchanged)

1: Data migration

Run docverse migrate for all products. Verify migrated content on a test domain. LTD remains authoritative for production traffic.

All active builds and editions migrated and verified

Discard migrated data and re-run; LTD still serving production

2: Client preparation

Prepare all workflow changes without activating them: update ltd-upload/docverse-upload action, update reusable workflows on branches, open automated PRs for custom-workflow repos. Provision DOCVERSE_TOKEN org-level secrets. Test against Docverse staging.

All PRs open and tested; org secrets provisioned

Close PRs; no user impact (LTD unchanged)

3: Cutover

Short maintenance window: run final data sync to capture builds since Phase 1, merge all workflow PRs and reusable workflow changes, switch *.lsst.io DNS from Fastly to Cloudflare. New CI builds now flow to Docverse.

Production DNS on Cloudflare, all repos uploading to Docverse

Revert DNS to Fastly and revert workflow merges; Fastly configuration retained for rollback window

Phase 0 and Phase 1 proceed with no user-visible changes — LTD continues to serve all production traffic. Phase 2 is preparation — PRs are opened and tested but not merged, so LTD remains fully operational. Phase 3 is the atomic cutover within a short maintenance window. Documentation remains readable throughout (LTD serves until DNS propagates). The window only affects new uploads, which are briefly paused while workflow changes and DNS propagate. A final data sync at the start of Phase 3 ensures Docverse has all builds up to the cutover moment. Estimated window: 1–2 hours.

Risk mitigation#

Risk

Impact

Likelihood

Mitigation

Data corruption during migration

Incorrect documentation served

Low

Verify migrated content against LTD originals using content hash comparison; test domain validation before DNS cutover

DNS cutover causes outage

Documentation unavailable

Low

Test DNS configuration in advance; retain Fastly configuration for rapid rollback; use low TTL during cutover window

Incomplete build migration

Missing pages or broken links

Medium

Migration tool validates object counts per build against LTD inventory; flag discrepancies for manual review

Gafaelfawr token provisioning issues

CI uploads fail

Medium

Provision and test org-level secrets during Phase 2 preparation; validate with test uploads before cutover

Builds in flight during cutover

Some CI builds fail or upload to wrong backend

Low

Announce maintenance window in advance; re-trigger any builds that fail during the cutover window

Reusable workflow update breaks downstream repos

CI failures across many repos

Low

Test reusable workflow changes against a representative sample of downstream repos before merging; reusable workflow versioning allows rollback