Opened at 2025-05-12T10:29:29Z
Closed at 2025-05-20T09:11:26Z
#4175 closed enhancement (fixed)
Provide basic monitoring of critical services
Reported by: | btlogy | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | undecided |
Component: | dev-infrastructure | Version: | n/a |
Keywords: | Cc: | hacklschorsch | |
Launchpad Bug: |
Description (last modified by btlogy)
Scope
AsIs: Some of the critical services powering the Tahoe-LAFS project (mainly this Trac instance) can become unavailable w/o any active member of the community being notified.
In many occasions, downtime have been reported by visitors reaching on IRC (or elsewhere) asking if someone with the proper access could take action.
ToBe: Implement a basic monitoring solution tracking the availability of the critical services and allowing relevant people to be notified as soon as one of them is detected as unavailable.
We are proposing to use Upptime to achieve this, and the end result can already be seen here.
Value
- Contributors would be able to see past and ongoing downtime's.
- Maintainers would be able to be notified to take corrective action earlier.
- Statistics about the availability of the services will be publicly available and support future changes.
Requirements
- Transfer the existing git repository (already provisioned with Upptime, CI and pages) from LeastAuthority? to Tahoe-LAFS org. on GH.
- Reconfigure owner/org name where needed.
Additional information
This enhancement is a very nice to have for the execution of the MoveOffTrac project, in which it is planned to replace the issue tracking, wiki and web landing page solution, and hopefully improve their availability.
Change History (7)
comment:1 Changed at 2025-05-12T10:30:06Z by btlogy
- Summary changed from Provide basic monitoring to Provide basic monitoring of critical services
comment:2 Changed at 2025-05-12T14:48:45Z by btlogy
- Description modified (diff)
comment:3 Changed at 2025-05-14T08:59:43Z by btlogy
comment:4 Changed at 2025-05-15T06:32:00Z by btlogy
The repository has been transferred and the status page is live (in GH domain):
A few chores left:
- references from the infrastructure repo.
- document how to get notifications
- plan how to handle the secrets on the long term (today a personal GH_PAT token requested by me and approved by chris) - this is likely a wider topic.
comment:5 Changed at 2025-05-15T11:48:13Z by hacklschorsch
I think this is very useful, thank you!
I proposed to remove infrastructure that isn't ours to maintain probably in the wrong place - here: https://github.com/tahoe-lafs/infrastructure-upptime/issues/372
comment:6 Changed at 2025-05-20T09:11:15Z by btlogy
Aside from some welcome future improvements, I do think the solution works as expected.
comment:7 Changed at 2025-05-20T09:11:26Z by btlogy
- Resolution set to fixed
- Status changed from new to closed
During the last N&B (13 of May), it was obvious (to me at least) that Tahoe-LAFS could use some basic monitoring: the Trac server was unresponsive for a few hours and no one in the call was aware of it.
I've demonstrated quickly how one could use the "Watch" feature (Watch > Custom > [x] Issues) to get notified by GH whenever one of the monitored services goes down.
And since no one had voiced any concerns, I propose we move forward:
I've asked Chris to help me with the 2nd step.
For the record: the monitoring is covering CodeBerg.org but Tahoe-LAFS is not (yet) relying on this 3rd party. The purpose of this is to measure its availability before we decide to use it (or not).