Build a Disaster Recovery Solution for OKE with RackWare SWIFT

RackWare SWIFT is a fully automated solution to enable backup and disaster recovery between your Oracle Container Engine for Kubernetes (OKE) setups across regions. SWIFT uses disaster recovery policies to plan scheduled backups of your OKE workloads. During outages, failover your workloads into a remote cloud location and be up and running in minutes.

SWIFT’s unique cross-cloud and cross-platform migration technology enables you to seamlessly move applications from one container platform to any other container platform.

RackWare SWIFT brings peace of mind by protecting your stateful and stateless Kubernetes objects. With RackWare SWIFT's flexible backup policies built for large-scale outages, you can plan and decide recovery time objectives/recovery point objectives that meet your needs.

Architecture

This reference architecture describes how you can enable backup and disaster recovery between your OKE setups across regions.

A standby region is configured to transfer OKE clusters in the event of a disaster. This disaster recovery strategy follows the active/passive model. The active/passive disaster recovery model creates a standby region that does not go live in production until a disaster is declared.

The following diagram illustrates this reference architecture.



disaster-recovery-oke-ra.zip

The architecture has the following components:

  • Tenancy

    A tenancy is a secure and isolated partition that Oracle sets up within Oracle Cloud when you sign up for Oracle Cloud Infrastructure. You can create, organize, and administer your resources in Oracle Cloud within your tenancy. A tenancy is synonymous with a company or organization. Usually, a company will have a single tenancy and reflect its organizational structure within that tenancy. A single tenancy is usually associated with a single subscription, and a single subscription usually only has one tenancy.

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Compartment

    Compartments are cross-region logical partitions within an Oracle Cloud Infrastructure tenancy. Use compartments to organize your resources in Oracle Cloud, control access to the resources, and set usage quotas. To control access to the resources in a given compartment, you define policies that specify who can access the resources and what actions they can perform.

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Virtual cloud network (VCN) and subnet

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Load balancer

    The Oracle Cloud Infrastructure Load Balancing service provides automated traffic distribution from a single entry point to multiple servers in the back end.

  • Security list

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

  • Network address translation (NAT) gateway

    A NAT gateway enables private resources in a VCN to access hosts on the internet, without exposing those resources to incoming internet connections.

  • RackWare SWIFT

    In this architecture, RackWare SWIFT discovers OKE clusters in the primary region and syncs them to the standby region.

  • Oracle Cloud Infrastructure Registry (OCIR)

    Oracle Cloud Infrastructure Registry is an Oracle-managed registry that enables you to simplify your development to production workflow.

Recommendations

Use the following recommendations as a starting point. Your requirements might differ from the architecture described here.
  • VCN

    When you create a VCN, determine the number of CIDR blocks required and the size of each block based on the number of resources that you plan to attach to subnets in the VCN. Use CIDR blocks that are within the standard private IP address space.

    Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.

    After you create a VCN, you can change, add, and remove its CIDR blocks.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

  • Security Zones

    For resources that require maximum security, Oracle recommends that you use security zones. A security zone is a compartment associated with an Oracle-defined recipe of security policies that are based on best practices. For example, the resources in a security zone must not be accessible from the public internet and they must be encrypted using customer-managed keys. When you create and update resources in a security zone, Oracle Cloud Infrastructure validates the operations against the policies in the security-zone recipe, and denies operations that violate any of the policies.

  • Load balancer bandwidth

    While creating the load balancer, you can either select a predefined shape that provides a fixed bandwidth, or specify a custom (flexible) shape where you set a bandwidth range and let the service scale the bandwidth automatically based on traffic patterns. With either approach, you can change the shape at any time after creating the load balancer.

  • Oracle Container Engine for Kubernetes

    Oracle Container Engine for Kubernetes (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud. Use OKE when your development team wants to reliably build, deploy, and manage cloud-native applications. You specify the compute resources that your applications require, and OKE provisions them on Oracle Cloud Infrastructure in an existing OCI tenancy.

  • Application Replications
    • Sync Passthrough: Synchronizes container objects and data from the source to the target platform.
    • Stage-1: Synchronizes container objects and data from the source platform to SWIFT. Data is stored in the SWIFT storage pool.
    • Stage-2: Data stored in the SWIFT storage pool is synchronized into the target platform.
  • Registry Replication

    Replicates images from one location to another.

Considerations

Consider the following points when deploying this reference architecture.

  • Sync

    Namespaces need to be created on the secondary cluster before sync. Each sync job supports one-to-one namespace mapping. Before RackWare SWIFT can sync your OKE cluster to the secondary region, an OKE cluster must be created in the secondary region.

  • Infrastructure

    You must create a VCN in the secondary region before you can sync the two regions.

  • Kubernetes Cluster Information

    Notice that RackWare will not replicate node labels, node allocations, control plane definitions, or worker node properties. This means your pod topology, pod distribution, node selectors, and affinity settings need to be maintained manually in the secondary region. Appropriate resource allocation and pod distribution needs to be designed and applied on that secondary location for consistent behavior when a switchover/failover takes place.

Deploy

The example for this reference architecture is available as an image in Oracle Cloud Marketplace.
  1. Go to Oracle Cloud Marketplace.
  2. Click Get App.
  3. Follow the on-screen prompts.

Explore More

Learn more about building a disaster recovery solution with RackWare SWIFT for Oracle OKE Cloud.

Review these additional resources:

Acknowledgments

  • Author: Saul Chavez
  • Contributors: Wei Han