On 04-09-2008 12:11:49 -0400, McKennirey.Matthew wrote:
We are trying to create a deployment with as much redundancy (failover) as we can. We assume hardware will fail (drives, hardware network interfaces, memory, etc, etc) and we may lose a machine, a switch, etc.
understandable
(As an aside my understanding is that merovingian starts and monitors mserver5 processes on the same machine - I do not see a way to configure merovingian to start mserver5 processes on other machines.)
(it can't start, but it *does* discover neighbour databases)
Our plan is to use multiple instances of MonetDB (running on multiple machines) each serving an architecturally distinct portion of the system's data such that the failure of one instance would not prevent other parts of the system from functioning. However, we would dearly like to have a failover capability on each instance of MonetDB. Again, only one instance of MonetDB would be interacting with a specific dbfarm and dbname at a time, but if it (or the machine it is on) failed to respond, we would redirect the work to a 'backup' instance on another machine. The merovingian daemon of the 'backup' would be started but there would be no activity until needed.
Here an interesting opportunity is for the merovingian "network". Each merovingian does announcing and listening to others. This makes remote databases known at the local merovingian. The current branch has code to also list this remote information (instead of peeking in merovingian's logs). Currently, it is a very simple idea: a database is announced, and as such stored by other merovingians that receive the message. Each database received can be redirected to. Merovingian will transparantly do that when a remote database name is requested. The rules of "resolving" are simple: always first find a local database, and if not present, look in the remote list. This remote list can be in any order and can contain duplicates. First one is taken. Currently no proprities are encoded in here. However, it is not impossible to think of a priority scheme (like DHCP authority, or WINNT PDC master negociations) in this picture. It would allow to have the same database being installed on more machines, but the primary always be the first in merovingians remote list. As such a stand-alone merovingian could do the fail-over step once the primary falls out.
So I guess the question is, when instance 1 of merovingian or a mserver5 process locks a dbfarm and dbname when does it release the lock? and if it fails (software or hardware failure) I presume the locks still exist preventing instance 2 from using that dbfarm and dbname? In which case we are out of luck.
The operating system should release all locks as soon as the program is terminated. The lock is only active as long as the filedescriptor is held open, and the OS closes all file descriptors when it cleans up a terminated or crashed process. Locks cannot be "stored", so that should be safe too.