Let's assume you have round robin setup on domain A which points to two nodes which handle load balancing, etc. That's cool until one of your load balancers goes down and now you have clients who have cached the IP of your downed node. I assume that within a few hours (maybe < 1 hour) you'll be able to either bring up that node again or take it out of the round robin. Why not automate this? Here's my idea:
- Have Uptime running on Modulus or AppFog. Also, Uptime now has the ability to use plugins.
- Add both load-balancers as checks in Uptime.
- Create a plugin which deletes the A record in CloudFlare (using their API) if one of the nodes goes down.
- If/when the node comes back up, have Uptime make a call to CloudFlare to add back that A record.
Although this isn't a true "heartbeat" type fail-over deal, you can have the checks occur every 10s which means, in theory, your bad IP would be pulled within that time which is pretty darn fast.
Issues you see with this setup? Aside from the obvious (the Uptime app crashes [I've been running it for quite some time and it seems pretty stable] or the Modulus or AppFog service providers have downtime).