#4175 closed enhancement (fixed)

Provide basic monitoring of critical services

Reported by: btlogy Owned by:
Priority: normal Milestone: undecided
Component: dev-infrastructure Version: n/a
Keywords: Cc: hacklschorsch
Launchpad Bug:

Description (last modified by btlogy)

Scope

AsIs: Some of the critical services powering the Tahoe-LAFS project (mainly this Trac instance) can become unavailable w/o any active member of the community being notified.

In many occasions, downtime have been reported by visitors reaching on IRC (or elsewhere) asking if someone with the proper access could take action.

ToBe: Implement a basic monitoring solution tracking the availability of the critical services and allowing relevant people to be notified as soon as one of them is detected as unavailable.

We are proposing to use Upptime to achieve this, and the end result can already be seen here.

Value

  • Contributors would be able to see past and ongoing downtime's.
  • Maintainers would be able to be notified to take corrective action earlier.
  • Statistics about the availability of the services will be publicly available and support future changes.

Requirements

  • Transfer the existing git repository (already provisioned with Upptime, CI and pages) from LeastAuthority? to Tahoe-LAFS org. on GH.
  • Reconfigure owner/org name where needed.

Additional information

This enhancement is a very nice to have for the execution of the MoveOffTrac project, in which it is planned to replace the issue tracking, wiki and web landing page solution, and hopefully improve their availability.

Change History (7)

comment:1 Changed at 2025-05-12T10:30:06Z by btlogy

  • Summary changed from Provide basic monitoring to Provide basic monitoring of critical services

comment:2 Changed at 2025-05-12T14:48:45Z by btlogy

  • Description modified (diff)

comment:3 Changed at 2025-05-14T08:59:43Z by btlogy

During the last N&B (13 of May), it was obvious (to me at least) that Tahoe-LAFS could use some basic monitoring: the Trac server was unresponsive for a few hours and no one in the call was aware of it.

I've demonstrated quickly how one could use the "Watch" feature (Watch > Custom > [x] Issues) to get notified by GH whenever one of the monitored services goes down.

And since no one had voiced any concerns, I propose we move forward:

  1. initiate the transfer the upptime repo. from LeastAuthority to Tahoe-LAFS org. on GH.
  2. ask an org. owner to accept the transfer and re-establish me and Flo as maintainer
  3. push some commits to replace references about LeastAuthority to Tahoe-LAFS

I've asked Chris to help me with the 2nd step.

For the record: the monitoring is covering CodeBerg.org but Tahoe-LAFS is not (yet) relying on this 3rd party. The purpose of this is to measure its availability before we decide to use it (or not).

comment:4 Changed at 2025-05-15T06:32:00Z by btlogy

The repository has been transferred and the status page is live (in GH domain):

A few chores left:

  • references from the infrastructure repo.
  • document how to get notifications
  • plan how to handle the secrets on the long term (today a personal GH_PAT token requested by me and approved by chris) - this is likely a wider topic.

comment:5 Changed at 2025-05-15T11:48:13Z by hacklschorsch

I think this is very useful, thank you!

I proposed to remove infrastructure that isn't ours to maintain probably in the wrong place - here: https://github.com/tahoe-lafs/infrastructure-upptime/issues/372

comment:6 Changed at 2025-05-20T09:11:15Z by btlogy

Aside from some welcome future improvements, I do think the solution works as expected.

comment:7 Changed at 2025-05-20T09:11:26Z by btlogy

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.