Capability

DC Failover

When the primary DC fails, DNS reshapes automatically and traffic moves to a healthy data center without human intervention.

TR7 DC Failover ties data center health directly to DNS responses. Health scenarios defined for each DC monitor access, internet reachability, WAN, LAN and maintenance status — when a DC becomes unhealthy, its records are automatically removed from the DNS answer. In this model, failover is not a zone-file edit, a manually triggered script or a midnight operator call. When an HC state changes, the associated scenario is re-evaluated, the bound DNS records are re-rendered and clients are guided to healthy targets according to TTL behavior. Primary, secondary, tertiary or longer DC chains can be configured. During planned maintenance, a DC can be consciously taken offline through maintenance mode. In disaster-recovery scenarios, DR records can be activated only when specific conditions are met. The result: TR7 GTM removes the gap between the monitoring system and the DNS system, unifying health scenario evaluation, DNS response rendering, manual cutover and failback protection into a single decision pipeline.

5
Automatic health check types per DC: WAN, LAN, access, internet, maintenance
3 s
Dynamic config regeneration debounce window
N-DC
Data center priority chain with no theoretical limit

When DC failover is managed through manual DNS changes, RTO is limited by human speed.

In the traditional primary/backup DNS model, a data center outage is detected, the operations team receives an alert, the zone record is updated, the service is reloaded and clients wait for the new DNS answer to propagate. This chain looks straightforward in a runbook; in a real incident, the delays introduced by decision-making, access control, approval and execution stretch RTO considerably.

In many organizations, health checking and DNS operate as separate systems. A monitoring tool sees that a DC is unreachable, but the DNS server continues to answer with the same IP addresses. The bridge between them is typically a script, a manual runbook or a separate automation layer. That gap becomes the weakest link at the moment of failover.

Failback carries equal risk. If a DC bounces in and out quickly, the DNS answer can flip repeatedly — clients are scattered across different data centers and traffic may return before state synchronization is complete. A simple "remove when down, re-add when up" logic is not enough.

The correct model evaluates DC health through boolean scenario logic, reduces flap risk with consecutive success/failure thresholds and makes the DNS response the natural output of that decision. The same model must also cover manual cutover for planned maintenance, a fail-safe answer when all DCs are unhealthy and DR conditions.

TR7 DC Failover delivers this model: it automatically refreshes the DNS answer when a DC health scenario changes and ties the entire failover process to DNS TTL and operator-defined health parameters.

Our approach

TR7 implements DC failover decisions through health scenarios, boolean condition logic, flap protection and a manual cutover mechanism.

Health scenario directly governs DNS response

When the health check state for a DC changes, the associated scenario is re-evaluated. If the scenario result changes, the relevant DNS records are regenerated and the unhealthy DC is removed from the answer.

Boolean conditions can model complex DC health decisions

Condition groups combine with AND logic; groups combine with OR logic. A negative condition can also be defined for each health check, enabling inverse scenarios such as "activate this record when this check is unhealthy."

Stuck-state protection reduces failback oscillation

While a DC is in transition, the previous evaluation result can be preserved. This behavior helps prevent short-lived up/down fluctuations from continuously changing the DNS answer.

Maintenance mode provides manual cutover for planned downtime

During planned maintenance, an operator can take a DC offline with maintenance mode. Even if the DC appears healthy, it can be excluded from the DNS answer so traffic is directed to another DC.

Capabilities

DC Failover is the GTM failover layer that automatically manages DNS responses across multiple data centers based on health state.

N-DC priority chain supports primary, secondary and tertiary flows

TR7 can evaluate DC records as a priority chain ordered by array position. When the primary DC is unhealthy, the secondary takes over; when the secondary is also unhealthy, the tertiary steps in — and longer chains are equally supported. The code model is not theoretically limited to two endpoints. This structure simplifies multi-stage continuity design in financial, government and large-scale SaaS environments.

Five automatic health check types are available per DC

TR7 can evaluate health signals at DC level: wanAccess, lanAccess, access, internet and maintenanceMode. WAN reachability, LAN reachability, general access state, internet access and manual maintenance status are each modeled separately. A DC is therefore assessed across multiple access dimensions, not just a single ping result. The DNS answer reflects a more realistic picture of DC health.

Consecutive success and failure thresholds reduce flap risk

requiredSuccess and requiredFailure determine how many consecutive results are needed before a DC is declared up or down. This model prevents unnecessary DNS changes caused by transient packet loss, brief network interruptions or momentary service slowdowns. Operators can use tighter thresholds for critical services and more tolerant ones for noisier links. RTO is planned together with these thresholds and the check interval.

backupBehavior modes control passive DC behavior

noResponse mode keeps a passive DC silent under normal conditions. onlyNew mode can prevent a DC that has been down for a long time from answering with stale data when it comes back up. This behavior ensures that during failover, only DCs in the correct state produce DNS answers — not merely those that are reachable. It is an important protection layer in environments where stale-data risk is a concern.

DR mode activates disaster-recovery records conditionally

Per-record DR mode allows specific records to become active only when a DR condition is met. The drCond scenario or drIfNoRecords flag triggers the DR record when primary and secondary targets are exhausted. This model keeps remote disaster-recovery IP addresses out of normal DNS answers while holding them on standby for critical situations. The DR strategy becomes controlled at the DNS level.

FailSafe response provides a last resort when all DCs are unhealthy

If no DC is healthy, a response can be generated from the fallbackRecords array. These records can point to a maintenance page, a static emergency endpoint or an alternative recovery service. FailSafe behavior ensures DNS produces a controlled last-resort answer instead of returning nothing. Operators define these records according to their organization's crisis plan.

State persistence preserves evaluation continuity across restarts

TR7 can store local health check and scenario state data at the file level. After a restart or service reload, the previous state is restored so evaluation does not begin from scratch. This approach reduces unnecessary oscillation in failover decisions during a transient restart. It is especially useful for maintaining consistency during maintenance operations that restart the GTM service.

DC reachability is verified through multiple WAN and LAN targets

wanAccess and lanAccess target lists can be defined per DC. Multiple access targets give a more accurate picture of a DC's external and internal reachability. A transient issue with a single target does not necessarily mark the entire DC as down. This structure enables more comprehensive modeling of data center health.

Manual cutover enables controlled traffic transfer during planned maintenance

When maintenanceMode is activated, the relevant DC is consciously taken offline. This is useful during patches, maintenance windows, migrations or controlled DR tests. The operator can remove the DC from the DNS answer — even when it is healthy — and redirect traffic to another DC. When maintenance is complete, the mode is disabled and normal evaluation resumes.

Status enumeration classifies DC failures more clearly

DC state can be expressed as ok, noInternet, noAccess, noWan or noLan. This classification shows which access dimension is problematic rather than just saying "down." Operations teams can distinguish internet egress, WAN reachability and LAN reachability issues more quickly. The reason behind a failover decision becomes more readable.

DNS config regeneration is triggered automatically on health state change

When the health check state changes, the associated scenario can be re-evaluated immediately. Records bound to the scenario enter the dynamic config regeneration pipeline and the DNS answer is updated. This behavior reduces the need for manual zone edits or external scripts. Changes are grouped by a short debounce to prevent unnecessary repeated regeneration.

Master DNS writes in an HA cluster are handled by a single node

In an HA cluster scenario, DNS config writes are controlled through the master role. If the master node fails, the standby node can take over the role after a defined safety period. This model helps prevent two nodes from producing different DNS configs simultaneously. GTM behavior stays aligned with cluster state.

Operational depth

A DC failover operation is planned together with the check interval, consecutive thresholds, HC ID structure, scenario conditions, regeneration pipeline and RTO parameters.

01

DC checker interval

accessPeriod defines how frequently DC health checks run. It can be configured in seconds or minutes. A shorter period provides faster detection; a longer period gives a quieter, lower-noise evaluation.

02

Required success/failure

requiredSuccess defines how many consecutive successes are needed before a DC is considered up. requiredFailure defines how many consecutive failures are needed before a DC is considered down. These two values set the balance between failover speed and flap protection.

03

DC access type

wanAccess and lanAccess lists define the access targets for a DC. This allows evaluation of whether a DC is reachable not only from the outside but also from the internal network. The distinction is particularly important in inter-DC and hybrid routing scenarios.

04

HC ID format

Automatic HC records follow the format `auto||`. When a negative condition is needed, a `!` suffix can be appended to the ID for inverse evaluation. This structure keeps health check references readable within scenarios.

05

Scenario condition structure

Conditions within a group combine with AND logic; groups combine with OR logic. This structure supports a wide range of decision models, from simple primary-down checks to complex multi-dimension DC health scenarios. Operators are not limited to a single check result.

06

Failover decision pipeline

When HC state changes, the scenario is re-evaluated, bound records are identified and dynamic config regeneration is triggered. The pipeline runs with a short debounce so rapid successive changes are merged into a single regeneration pass. The DNS answer is re-rendered according to the current health state.

07

RTO parameter dependencies

RTO depends on accessPeriod, requiredFailure count, regeneration debounce duration and client DNS TTL behavior. Rather than claiming a single fixed time, the failover window should be planned to match service requirements. Critical services benefit from shorter TTL and more frequent checks.

When to use it

Classic active/passive DC pair

DC1 is defined as primary and DC2 as passive standby. When the internet or access scenario for DC1 fails, DC1 records are removed from the DNS answer and DC2 begins responding.

Three-DC priority chain in a financial institution

Financial institutions can build a DC1 → DC2 → DC3 sequential failover chain. Each tier is evaluated by its own health scenario and an unhealthy DC is automatically removed from the DNS answer.

Planned maintenance with manual cutover

At the maintenance window, DC1 is placed into maintenance mode and traffic is directed to DC2. When maintenance is complete, maintenance mode is disabled and normal health evaluation resumes.

Remote disaster-recovery site activation

When primary and secondary DCs are both unhealthy, DR mode records can be activated. In this scenario the remote disaster-recovery site remains passive under normal conditions and is added to the DNS answer only when the defined conditions are met.

Stale-data-protected secondary DC

When a DC that has been down for an extended period comes back online, it may not be desirable for it to respond with outdated data. The onlyNew behavior keeps an out-of-date DC passive, reducing the risk of publishing stale records.

Geofence and failover hybrid routing

The nearest DC is first selected by country or region, then if that DC becomes unhealthy, the standby DC is activated. This model combines performance-based steering with continuity decisions in a single GTM configuration.

Frequently asked questions

When and how is a DC failover decision triggered?
When a health check state changes, the associated scenario is re-evaluated immediately. If the scenario result changes, the bound DNS records enter the dynamic config regeneration pipeline and the DNS answer is updated. Rapid successive changes are grouped by a short debounce into a single regeneration pass, avoiding unnecessary repeated rendering.
How does flap protection work?
requiredSuccess and requiredFailure define how many consecutive successful or failed results are needed before a DC is declared up or down. While a DC is in transition, the stuck-state mechanism preserves the previous evaluation result. These two layers together help prevent short-lived fluctuations from unnecessarily changing the DNS answer.
How long does RTO take?
RTO depends on accessPeriod, the requiredFailure count, the regeneration debounce duration and client DNS TTL behavior. Rather than claiming a single fixed number, these parameters should be tuned to match service requirements. Critical services can shorten the failover window by using a lower TTL and a more frequent check interval.
How does DR mode differ from regular failover?
Normal DC chain logic adds healthy DCs to the DNS answer and removes unhealthy ones. DR mode activates specific records only when a defined DR condition is met. The drCond scenario or drIfNoRecords flag triggers the DR record when primary and secondary targets are exhausted; under normal conditions the DR IP does not appear in the DNS answer.
Is failover state lost if the GTM service restarts?
No. TR7 can store local health check and scenario state at the file level. After a restart, the previous state is restored and evaluation does not begin from scratch. This is especially useful for maintaining GTM consistency during maintenance operations that require a service restart.
How is a DC taken offline for planned maintenance?
The operator sets the maintenanceMode flag, which removes the DC from the DNS answer. Even if the DC is healthy, it will not generate DNS responses while maintenance mode is active and traffic is redirected to another DC. When maintenance is complete, the mode is disabled and normal evaluation resumes.

Bring DC failover down to DNS TTL speed

Health scenario, DNS response and manual cutover unified in a single decision pipeline. Let's walk through a live setup with your own DC architecture.