Pre-production Checklist

This checklist contains points that must be satisfied during implementation and verified prior to release.

Please note that all items in the design checklist that were verified at the end of the design phase, must still be satisfied at release time (e.g. the design doc must be up-to-date, the SLOs must be consistent, …).

🔧 Maintainability

Maintainability affects team productivity and system availability. If maintainability is low, the team will take a longer time to change the service in production and it will cause longer MTTR (Mean Time To Recover) in case of failures.

Check name

Short Description

Level C

Level B

Level A

Unit test

It has unit tests. And the unit tests are running in a CI system.

✅️

✅️

✅️

High Test coverage

Its test coverage is over 80%.

✅️

✅️

Config in env-var

Its config can be overridden via environment variable.

✅️

✅️

dockerignore

It has dockerignore to reduce the Docker image size.

✅️

✅️

No latest tag

Its Docker image tag is not latest or master.

✅️

✅️

Dependabot

Its dependencies are automatically updated.

✅️

✅️

Automated build

Its build process is automated (binary build and Docker image build is in this scope).

✅️

✅️

✅️

Automatic build

Its automated build process is running in CI/CD system.

✅️

Automated deploy

Its deploy process is automated.

✅️

✅️

✅️

Automatic deploy

Its automated deploy process is running in CI/CD system.

✅️

✅️

Gradual deploy

Its deploy can be gradual if you want.

✅️

Automated rollback

Its rollback process is automated.

✅️

✅️

📉 Observability

Observability affects team productivity and system availability. If observability is low, the team will take a longer time to notice/investigate a problem occurred in production. And it will make MTTR (Mean Time To Recover) longer.

Check name

Short Description

Level C

Level B

Level A

Tracing

Its requests are traced by Datadog APM. See Configure Tracing

✅️

✅️

✅️

Timeboard

Its Datadog Timeboard is created.

✅️

✅️

Screenboard

Its Datadog Screenboard is created.

✅️

GCP metrics

Its GCP projects are integrated with Datadog.

✅️

✅️

Actionable alert

Its Datadog Monitors are created. And those alerts are actionable.

✅️

✅️

✅️

Warning alert

Its warning alerts are sent to Slack or a ticket system instead of PagerDuty.

✅️

✅️

Critical alert

Its critical alerts are sent to PagerDuty.

✅️

✅️

OnCall rotation

It has a PagerDuty team, escalation policy, schedules.

✅️

✅️

OnCall playbooks

It has OnCall playbooks.

✅️

Log to STDOUT

Its logs are output to STDOUT/STDERR.

✅️

✅️

✅️

Log as JSON

Its logs are emitted in container log format.

✅️

✅️

✅️

Log with annotation

Its logs have Request ID annotation

✅️

✅️

✅️

Profiling

It is profiled by GCP Stackdriver Profiler.

✅️

✅️

Error tracking

Its errors are tracked by Sentry.

✅️

✅️

✈️ Reliability

Reliability affects availability and productivity. If reliability is low, your system will break down often. The team will have to take time to fix it. Then, both system availability and team productivity will be decreased.

Check name

Short Description

Level C

Level B

Level A

Manual Scale

It can be manually scaled horizontally to handle changes in workload.

✅️

Auto Scale

It automatically scales horizontally to handle fluctuating workloads, its HPA is set as described in the Resource Requests and Limits documentation, and can be scaled manually if needed.

✅️

✅️

CPU req/limit

Its CPU limit and request are set as described in the Resource Requests and Limits documentation.

✅️

✅️

✅️

Memory req/limit

Its memory resource request value is as same as limit value.

✅️

✅️

✅️

Capacity planning

It can handle the expected load: either load test has been performed, or the expected traffic is under control (e.g., by Gateway).

✅️

✅️

Zero downtime deploy

Its deploy process does not cause service degradation or downtime (e.g. error rate does not increase during deploy).

✅️

✅️

Graceful shutdown

It can stop gracefully.

✅️

✅️

Graceful degradation

It keeps working, at least partially, while dependencies (e.g. other service or database) are not working partially or completely.

✅️

✅️

PreStop

It has a preStop. See more on Configure PreStop.

✅️

✅️

PDB

It has a PodDisruptionBudget set as described in the Configure Pod Distription Budget

✅️

✅️

✅️

Liveness Probe

It has a health check (endpoint) for liveness probe. And liveness probe is configured. See more on Configure Liveness Probe.

✅️

✅️

✅️

Readiness Probe

It has a health check (endpoint) for readiness probe. And readiness probe is configured.

✅️

✅️

Timeout

It sets an appropriate timeout for requests over a network.

✅️

✅️

Smart retry

It performs smart retries when interacting with dependencies (e.g. other services or database).

✅️

🔒️ Security

If security is low, customer and company data will be stolen or fabricated (Data breaches).

Check name

Short Description

Level C

Level B

Level A

Security review

It has completed the security design review by security team.

✅️

✅️

Non-root user

Its docker container runs as non-root user

✅️

✅️

✅️

Secrets

Its sensitive configuration is stored in Kubernetes secrets.

✅️

✅️

✅️

Non-sensitive log

It does not write sensitive information to app logs (STDOUT/STDERR).

✅️

✅️

✅️

📋️ Accessibility

Accessibility affects team/organization productivity. If accessibility is low, getting information about the microservice will be difficult for both people inside/outside the team. It reduces organization productivity.

Check name

Short Description

Level C

Level B

Level A

Design Doc

Its design doc is up to date with the implementation.

✅️

✅️

✅️

Description

It has service description.

✅️

✅️

✅️

Contact

It has contact info about the owners.

✅️

✅️

✅️

Source repo

It has links to source repo.

✅️

✅️

✅️

Docs

It has links to docs for users.

✅️

✅️

✅️

SLOs

Its dashboard shows SLOs.

✅️

✅️

✅️

📁 Data Storage

Checks for services that have one or more data stores (e.g. databases, blob storage, …). In addition to these checks, please refer to the service-specific sections below.

Check name

Short Description

Level C

Level B

Level A

Data Replication

Its data is replicated to BigQuery (if required).

✅️

✅️

Minimal Operator Privileges

Personnel has minimal access privileges and accesses are auditable.

✅️

✅️

Recovery

It can be recovered from backup; the procedure has been defined and tested.

✅️

✅️

Fast Recovery

It can be recovered from backup in less than 2 hours; the procedure is described in the OnCall playbook, and it is practiced every 6 months.

✅️

PIT Recovery

Point-in-time recovery from backup can be completed in less than 2 hours.

✅️

Timeboard

Its GCP databases have a Datadog Timeboard.

✅️

✅️

GCP Cloud SQL (MySQL)

Checks specific to services using GCP Cloud SQL (MySQL).

Check name

Short Description

Level C

Level B

Level A

Maintenance Window

Its databases have a defined maintenance window (during core hours).

✅️

✅️

✅️

Regional HA

Its databases have regional HA enabled..

✅️

✅️

Read Replicas

Its databases have one or more read replicas, and it uses them for reads that do not need strict consistency.

✅️

Missing Master

It keeps correctly serving idempotent requests with no side-effects when the master is unavailable (e.g. by sending all reads to the read replicas and returning internal error to all other requests)..

✅️

Failover Tests

It keeps running with minimal disruption during, and fully recovers after, a database maintenance or failover.

✅️

✅️

Operational Guidelines

Its databases are in compliance with the Cloud SQL Operational guidelines, so that they do not fall outside the Cloud SQL SLA.

✅️

✅️

✅️

Automatic Backups

Its databases have automatic backups enabled.

✅️

✅️

✅️

Automatic Storage Increase

Its databases have automatic storage increase enabled.

✅️

✅️

Replication Lag

Alerts should be sent if replication lag (Seconds Behind Master in Stackdriver) is >300s.

✅️

✅️

CPU

CPU usage of each instance (including replicas) should be <50% during peak load, and alerts should be sent if it increases to >80%.

✅️

✅️

✅️

Minimal Data Privileges

It has one or more dedicated MySQL users (not root) that have only the bare minimum set of required privileges (e.g. only SELECT and INSERT, but no UPDATE, DELETE or any other DDL/admin privileges). If the service has both admin and non-admin endpoints, they should use different users with different permissions.

✅️

✅️

GCP Cloud Spanner

Checks specific to services using GCP Cloud Spanner.

Check name

Short Description

Level C

Level B

Level A

Regional Configuration

If it is a service deployed in a single region, its databases are in regional configuration and are deployed in the same region.

✅️

✅️

✅️

Global Configuration

If it is a service deployed in multiple regions, its databases are in multi-regional configuration and they are deployed in the same regions.

✅️

✅️

SLA Exclusions

Its databases are in compliance with the SLA exclusions, so that they do not fall outside of the Cloud Spanner SLA.

✅️

✅️

Automatic Backups

Its databases have scheduled automatic backups.

✅️

✅️

✅️

CPU

CPU usage of each node is monitored and alerts are sent if it is >65% (or >45% for multi-regional instances).

✅️

✅️

✅️

Disk usage

Disk usage of each node is monitored and alerts are sent if it is >75%.

✅️

✅️

✅️

Sessions

Number of sessions on each database+node is monitored and alerts are sent if it is >7500.

✅️

✅️