Data Safety in RabbitMQ

We will walk-through the data safety mechanism supported in RabbitMQ.

Mnesia is distributed database system RabbitMQ usages to store information related to queue, binding, exchange, etc.

All queues are persisted to DB. Queue marked durable survives node restart, system crash and network failure.

Clustering with Queue mirroring:

-Join multiple nodes to a cluster. Add additional redundancy through mirroring.This will replicate queue across multiple nodes.

Reads and writes will happen only through master node.

Mirroring is achieved by setting ha policy

ha-mode: all/ exactly, ha-params: 2 (one master and one mirror) /nodes, ha-params: rabbit@node1, rabbit@

-Publisher receives confirmation just when the message is written to disk.

-When a broker dies, the cluster looks for the oldest mirror and make it as master node.

-Synchronization happens with the newly labeled master node to other mirrors.

Detecting Network Partitions:
While a network partition is in place, the two (or more!) sides of the cluster can evolve independently, with both sides thinking the other has crashed. This scenario is known as split-brain.

Nodes determine if its peer is down if another node is unable to contact it for a period of time, 60 seconds by default. If two nodes come back into contact, both having thought the other is down, the nodes will determine that a partition has occurred. This will be written to the RabbitMQ log in a form like:

=ERROR REPORT==== 15-Oct-2012::18:02:30 ===
Mnesia(rabbit@hostname): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, hare@hostname}

rabbitmqctl cluster_status shows {partitions, # => {partitions,[{rabbit@smacmullen,[hare@smacmullen]},

RabbitMQ also three ways to deal with network partitions automatically: pause-minority mode, pause-if-all-down mode and autoheal mode. The default behaviour is referred to as ignore mode.
pause-minority mode RabbitMQ will automatically pause cluster nodes which determine themselves to be in a minority.
pause-if-all-down mode, RabbitMQ will automatically pause cluster nodes which cannot reach any of the listed nodes. In other words, all the listed nodes must be down for RabbitMQ to pause a cluster node.
Autoheal mode: Instead, the cluster decides for itself which side of the partition must throw away its data. This mode is nice for availability and low overhead administration, but potentially worse for data loss.

System Design | Implementation Rule Book | CheatSheet

Saturday, November 2, 2019

Data Safety in RabbitMQ

No comments:

Post a Comment