High-Level Architecture

Did you know?

Oakestra is composed of three key building blocks:

Root Orchestrator

The root orchestrator is the centralized control plane that coordinates the participating clusters.

System Manager
Scheduler
Jobs
Root Service Manager
Routes
mongo
celery
mongo net
Root Orchestrator
Root Network Component
Resource Abstractor
Clusters
Root Orchestrator
Cluster Orchestrator
Worker
Worker
Application
Grafana 
System Manager
Scheduler
Jobs
Root Service Manager
Routes
mongo
celery
mongo net
Root Orchestrator
Root Network Component
Resource Abstractor
Clusters
Root Orchestrator
Cluster Orchestrator
Worker
Worker
Application
Grafana 

The image above illustrates the components of the root orchestrator. Each component operates as an independent service, integrated and managed using the Docker Compose plugin.

  • The System Manager serves as the primary interface for users to access the system as an application deployment platform. It provides two sets of APIs:

    1. To receive deployment commands from users via CLI, Dashboard, or directly via REST API.
    2. To handle child Oakestra Clusters.
  • The Scheduler determines the most suitable cluster for deploying a given application.

  • Mongo acts as the interface for database access. The root manager stores aggregated information about its child clusters. Oakestra categorizes this data into:

    1. Static metadata—such as IP addresses, port numbers, cluster names, and locations.
    2. Dynamic data—such as the number of worker nodes per cluster, total CPU cores and memory, disk space, GPU capabilities, etc.
  • The Resource Abstractor standardizes resource management by abstracting generic resources into a unified interface. Whether managing clusters or workers, this abstraction ensures interoperability of scheduling algorithms between root and child clusters. Additionally, it provides an interface for managing the service lifecycle.

  • Grafana offers a dashboard with global system alerts, logs, and performance statistics.

  • The Root Network Component manages Semantic IP and Instance IP addresses for each service and the cluster’s subnetworks. Refer to the Networking concepts and manuals for more details.

Cluster Orchestrator

System Manager
Scheduler
Jobs
Cluster Service Manager
Routes
Nodes
mongo
celery
mongo net
Cluster Orchestrator
Cluster Network Component
Mosquitto MQTT Broker
nodes/*/net/...
nodes/*/information  
nodes/*/job
Root Orchestrator
Cluster Orchestrator
Worker
Worker
Application
Grafana
System Manager
Scheduler
Jobs
Cluster Service Manager
Routes
Nodes
mongo
celery
mongo net
Cluster Orchestrator
Cluster Network Component
Mosquitto MQTT Broker
nodes/*/net/...
nodes/*/information  
nodes/*/job
Root Orchestrator
Cluster Orchestrator
Worker
Worker
Application
Grafana

The cluster orchestrator functions as a logical twin of the root orchestrator but differs in the following ways:

  • Worker Management: Unlike the root orchestrator, the cluster orchestrator manages worker nodes instead of clusters.

  • Resource Aggregation: The cluster orchestrator aggregates resources from its worker nodes and abstracts the cluster’s internal composition from the root orchestrator. At the root level, a cluster appears as a generic resource with a total capacity equal to the sum of its worker node resources.

  • Intra-Cluster Communication: MQTT is used as the communication protocol for the intra-cluster control plane.

Worker Node

The worker node is the component responsible for executing workloads requested by developers.

A worker node is any Linux machine running two essential services:

  • NodeEngine: Manages the deployment of applications based on the installed runtimes.
  • NetManager: Provides the networking components required for seamless inter-application communication.
Node Engine
MQTT
Models
Node
Service
Net API
Jobs
Runtimes
Node Status
Jobs Status
Container Runtime
Net Manager
MQTT
Net API
Translation Table
Proxy Table
Env.
Manager
Service 1 IPs
Service 2 IPs
S1.1
S1.1
Root Orchestrator
Cluster Orchestrator
Worker
Worker
Application
Worker Node
Proxy
Node Engine
MQTT
Models
Node
Service
Net API
Jobs
Runtimes
Node Status
Jobs Status
Container Runtime
Net Manager
MQTT
Net API
Translation Table
Proxy Table
Env.
Manager
Service 1 IPs
Service 2 IPs
S1.1
S1.1
Root Orchestrator
Cluster Orchestrator
Worker
Worker
Application
Worker Node
Proxy
The Node Engine is a single binary implemented using Go. Click to learn about its modules.
  • MQTT: Acts as the communication interface between the worker and the cluster. It handles deployment commands, node status updates, and job updates by transmitting and receiving data.

  • Models: Define the structure of nodes and jobs:

    • Node: Represents the resources reported to the cluster. These are divided into:
      • Static resources—transmitted only during startup (e.g., hardware configuration).
      • Dynamic resources—periodically updated (e.g., CPU and memory usage).
    • Service: Represents the services managed by the worker node, including real-time service usage statistics.
  • Jobs: Background processes that monitor the health and status of the worker node and its deployed applications.

  • Runtimes: Supported system runtimes for workload execution. Currently, containers and unikernels are supported.

  • Net API: A local socket interface used for communication with the Net Manager.

The Net Manager is responsible for managing service-to-service communication within and across nodes. It handles tasks such as fetching balancing policies for each service, setting up virtual network interfaces, balancing traffic, and tunneling packets across nodes. Click to learn about its modules.
  • Environment Manager: Handles the installation of virtual network interfaces, network namespaces, and iptables rules.

  • Proxy: Manages traffic balancing and packet tunneling across nodes.

  • Translation Table: Maintains the mapping of Service IPs to Instance IPs and their associated balancing policies for each service. For more details, refer to the networking concepts documentation.

  • Proxy Table: Serves as a cache for active proxy translations.

  • MQTT Component: Acts as the interface between the Net Manager and the Cluster Network Component. It resolves Service IP translation requests and retrieves the node’s subnetwork during startup.

Resilience and Failure Recovery

Root Orchestrator Failure

A centralized control plane introduces a single point of failure. Oakestra mitigates this risk by enabling clusters to autonomously satisfy SLAs for deployed applications. Consequently, a Root Orchestrator failure impacts only the following:

  • Deployment of new applications: System Manager APIs become unavailable, making it impossible to deploy new workloads.
  • Inter-cluster root discovery: Pre-existing P2P communication between worker nodes remains unaffected, but new inter-cluster routes cannot be established.
  • Onboarding new clusters: New clusters cannot join the infrastructure.

A potential solution involves implementing leader election among cluster orchestrators to designate a new root, though this feature is not yet available in the current release.

Cluster Orchestrator Failure

By design, a cluster orchestrator failure does not impact (i) other clusters or (ii) workloads already running on worker nodes. However, the following limitations arise:

  • The cluster cannot accept new workloads.
  • The cluster cannot onboard new worker nodes.
  • The cluster cannot reschedule failed workloads.
  • Inter-cluster route discovery is disabled, though pre-existing connections are preserved.

Cluster and Root Orchestrator failures can be mitigated by deploying a high-availability setup for the control plane’s microservices.

Worker Node Failure

Worker node failures are common and expected in edge environments. The cluster handles such failures as follows:

  • Workloads affected by the failure are automatically re-deployed on other worker nodes.
  • Failures are detected using a heartbeat mechanism. If a worker node is unresponsive for more than 5 seconds, the cluster scales up and reallocates the workload.

The failure detection threshold is configurable and can be adjusted in the cluster manager settings.