Incident handling

Handle consumer issues' consequences

Why

An issue in the consumer (see the comments of https://binnenland.atlassian.net/browse/OP-2869 for more details) is causing the type of administrative units to be missing in Loket once is a while. This then blocks some processes between Loket and Kalliope. Some people receive emails warning that something is off and we need to take an action for it.

Strategy

The strategy to temporarily fix this missing data in all the consumers at once, before we fix the consumers themselves, is to:

  1. Delete the type of the problematic admin unit in the producer's graph

  2. Run a healing, which will have as a side effect that the type will be re-added to the producer's graph and a delta file will be created

  3. The consumers will ingest the delta file, re-adding the type to their database, fixing the issue

How

Until now I was taking care of it, but as I'm going to be away soonish, here's how you can fix it:

  1. @felix will share the URIs of the problematic besturen

  2. On the prod server (ssh root@organisaties.abb.vlaanderen.be), create a little migration. You can basically copy config/migrations/20231214114200-remove-type-in-producer-graph.sparql and update the URI(s) you need to fix

  3. Then we need to restart the migrations service to have the data being fixed and manually run a healing

drc restart migrations # Check migrations logs to wait until the migration ran
dr inspect app-organization-portal_delta-producer-bg-jobs-initiator-public_1 | grep "IPAddress"
curl -X POST <ip>/healing-jobs
drc logs -f --tail=100 delta-producer-pub-graph-maintainer-public # It should heal, you'll see some `Hitting database http://virtuoso:8890/sparql with expensive query` console logs. It takes a while, 15min maybe

Last updated