At the essence of Replicante Core is a state events engine: when a state change is detected actions are taken to adapt or to return the system to a desired state.
The idea of orchestration built on events is not new:
Additionally tracking state changes can tell us what is happening to our system and what we need to change as well as what our actions on the system lead to.
The cluster orchestration process continuously evaluates the state of clusters so decisions can be taken, progress tracked and (re)actions triggered.
So how does cluster orchestration work?
orchestrate
component periodically runs at fixed intervals.
The interval should be short as it determines the delay between cluster
needing orchestration and the orchestration being scheduled.orchestrate
run looks for any cluster with an expected next orchestration time in the past.
If no cluster needs to be orchestrated the orchestrate
run does nothing.orchestrate
run schedules an orchestrate task for each cluster that needs to run.now() + orchestrate interval
.Because events are generated from differences in observed states, orchestrating the state of a node from multiple processes at once may lead to duplicate and/or missing events as well as inconsistent aggregations.
Distributed locks are used to ensure a cluster is orchestrated by only one task at a time. Any cluster orchestration attempted while another operation is already running will be discarded.
The above covers how Replicante Core manages orchestration tasks. This section covers how an orchestration task works against an individual cluster.
The orchestration logic performs the following steps sequentially:
The newly fetched cluster information is used to generate an approximate cluster view. This new cluster view is compared to a view based on the last known cluster data to generate events describing changes observed happening to the cluster.
Because the cluster view is approximate node events are always based on reporting from the node themselves (we do not report a node as down if we see it up, even if another node in the cluster think it is down).
Only cluster level events are generated off the top of this views. Actions will also have to check if the state of the cluster matches expectations before they are executed.