Orphan status
Starting with Tarantool version 1.9, there is a change to the
procedure when an instance joins a replica set.
During box.cfg()
the instance tries to join all nodes listed
in box.cfg.replication.
If the instance does not succeed with connecting to the required number of nodes
(see bootstrap_strategy),
it switches to the orphan status.
While an instance is in orphan status, it is read-only.
To “join” a master, a replica instance must “connect” to the master node and then “sync”.
“Connect” means contact the master over the physical network and receive acknowledgment. If there is no acknowledgment after box.replication_connect_timeout seconds (usually 4 seconds), and retries fail, then the connect step fails.
“Sync” means receive updates
from the master in order to make a local database copy.
Syncing is complete when the replica has received all the
updates, or at least has received enough updates that the replica’s lag
(see
replication.upstream.lag
in box.info()
)
is less than or equal to the number of seconds specified in
box.cfg.replication_sync_lag.
If replication_sync_lag
is unset (nil) or set to TIMEOUT_INFINITY, then
the replica skips the “sync” state and switches to “follow” immediately.
In order to leave orphan mode, you need to sync with a sufficient number of instances (bootstrap_strategy). To do so, you may either:
- Reset
box.cfg.replication
to exclude instances that cannot be reached or synced with. - Set
box.cfg.replication
to""
(empty string).
The following situations are possible.
Situation 1: bootstrap
Here box.cfg{}
is being called for the first time.
A replica is joining but no replica set exists yet.
Set the status to ‘orphan’.
Try to connect to all nodes from
box.cfg.replication
. The replica tries to connect for the replication_connect_timeout number of seconds and retries each replication_timeout seconds if needed.Abort and throw an error if a replica is not connected to the majority of nodes in
box.cfg.replication
.This instance might be elected as the replica set ‘leader’. Criteria for electing a leader include vclock value (largest is best), and whether it is read-only or read-write (read-write is best unless there is no other choice). The leader is the master that other instances must join. The leader is the master that executes box.once() functions.
If this instance is elected as the replica set leader, then perform an “automatic bootstrap”:
- Set status to ‘running’.
- Return from
box.cfg{}
.Otherwise this instance will be a replica joining an existing replica set, so:
- Bootstrap from the leader. See examples in section Bootstrapping a replica set.
- In background, sync with all the other nodes in the replication set.
Situation 2: recovery
Here box.cfg{}
is not being called for the first time.
It is being called again in order to perform recovery.
- Perform recovery from the last local snapshot and the WAL files.
- Try to establish connections to all other nodes for the replication_connect_timeout number of seconds. Once
replication_connect_timeout
is expired or all the connections are established, proceed to the “sync” state with all the established connections.- If connected, sync with all connected nodes, until the difference is not more than replication_sync_lag seconds.
Situation 3: configuration update
Here box.cfg{}
is not being called for the first time.
It is being called again because some replication parameter
or something in the replica set has changed.
- Try to connect to all nodes from
box.cfg.replication
, within the time period specified in replication_connect_timeout.- Try to sync with the connected nodes, within the time period specified in replication_sync_timeout.
- If earlier steps fail, change status to ‘orphan’. (Attempts to sync will continue in the background and when/if they succeed then ‘orphan’ status will end.)
- If earlier steps succeed, set status to ‘running’ (master) or ‘follow’ (replica).
Situation 4: rebootstrap
Here box.cfg{}
is not being called. The replica connected successfully
at some point in the past, and is now ready for an update from the master.
But the master cannot provide an update.
This can happen by accident, or more likely can happen because the replica
is slow (its lag is large),
and the WAL (.xlog) files containing the
updates have been deleted. This is not crippling. The replica can discard
what it received earlier, and then ask for the master’s latest snapshot
(.snap) file contents. Since it is effectively going through the bootstrap
process a second time, this is called “rebootstrapping”. However, there has
to be one difference from an ordinary bootstrap – the replica’s
replica id will remain the same.
If it changed, then the master would think that the replica is a
new addition to the cluster, and would maintain a record of an
instance ID of a replica that has ceased to exist. Rebootstrapping was
introduced in Tarantool version 1.10.2 and is completely automatic.