Emergency Response

Micro (Only for Legacy Clients)
Unlimited
Basic
Unlimited
Standard
Unlimited
Enhanced
Unlimited

Overview

We offer Emergency Response services for our Core Coverage clients. When there is an “emergency flaw”, we promise to jump in and investigate within our

timeline.

We may or may not be able to fix the issue immediately or entirely within Core Coverage – some issues take longer or outside our scope.

Process

👉
When we start a new Core Coverage agreement we should exchange emergency contact information with the client, and that is what we/they would use below.

When we identify an emergency flaw

If someone on our team identifies an emergency flaw they should alert the

immediately by phone, using the information in the .

If the

cannot be reached, they should try the PM, then the lead developer, according to the hierarchy in the .

Once the

is alerted they should confirm the issue and reach out to the relevant people on our side to fix.

If the issue turns out not to be an emergency or we need more time to fix it, it should be logged in Asana appropriately.

Once the resolution is in progress, the

should alert the client using their emergency contact information in the . Try phone first, then texting, slack, and email, in that order. This alert should be relatively informal since time is of the essence. Highlight the specific issue that happened, what we believe caused it, what the steps to resolve are, and what our timeline is.

When the client reports an emergency flaw

The client should reach out first to the

. If they can’t reach the
Strategist
Strategist
they are advised to try and reach the PM, then the lead dev.

Whomever is able to answer should triage the problem. If it is confirmed, they should reach out to the rest of the team as appropriate to resolve. If the issue turns out not to be an emergency, the issue should be logged in Asana as appropriate.

Upon Resolution

💡
The general rule of thumb needs to be Detailed Issue + Proactive Solution steps = Effective Communication.

After resolution, the

should inform the client that we have solved the problem. The update should include the fixes put in place, the hours we spent doing so, and and expected ramifications or effects as a result of the issue (IE sporadic outages, loss of data, or similar).

We should also prepare the client for any potential side-effects moving forward - this includes but is not limited to slow build times, image inconsistencies, issues logging into things, delayed form responses etc. Anything we think is relevant to the client should be shared, but we should be careful not to unnecessarily confused, concern or panic our clients as a result.

The report should be informative without being potentially confusing - some clients will understand technical explanations best, others need layman's terms, so we should err towards the idea that simpler is better.

We should also ensure we are empowering our clients with the relevant information to allow them to take necessary steps - whether this is letting their stakeholders know of the issue or locking down email services/credit card details, there should be no reason for a client to be delayed or blocked from our end.

Please use the following template for incident reporting during emergency response situations.

Incident Report Template

Summary

THIS ISSUE IS NOW RESOLVED

At approximately 1:12AM EST on 11 March 2024, we experienced a 5 minute downtime for APC. This only impacted https://www.approcess.com - https://www.vletherapeutics.com was unimpacted.

This downtime was detected by StatusCake and reported to Benny Rostron and Ryan Gibbons through email, Slack and text message and appears to be related to a gateway timeout or unresolved gateway.

Benny alerted the team at 8AM EST to the issue, and is actively monitoring for further periods of downtime. The team has been tasked with investigating the earlier error in the meantime.

We resolved this issue at 9:30AM EST by updating the DNS settings - it transpired that one of the gateways was misfiring due to a config setting, and this has now been updated.

Resolution Steps

The team have been asked to investigate this as part of the Core Coverage agreement we have with APC. We are going to start by investigating the current DNS setup, as we recently had to resolve a DNS failure due to the nameservers being reset to default on GoDaddy - this may be the simplest explanation, and would be expected in this instance.

If the DNS setup is not the issue, then we should attempt to recreate the issue from our end. If this is not possible, then we should look to see if there were any unusual traffic spikes that occurred just before the downtime event - this could explain the downtime, and would help inform a fix.

Client Actions

APC Confirmed at 9:15AM EST that they have not drafted or published any changes in the past 3 days that could have resulted in this downtime. As a result, Cantilever focused on the DNS investigation.

Hours Spent

We spent 1.5 Hours on this issue in total. As this falls into the Core Coverage agreement, we have logged this time as Emergency Response.

Internal Debrief

After any emergency, we should carry out an internal debrief. This should be at least a Slack convo if not a full meeting for a major incident. We should go through the issue, our investigation, the resolution, and how we can avoid it in the future. This should be saved or documented (again internally) so that if the client, or a client with the same or similar setup, runs into this issue again, we have a quick way to resolve it.