Auto-negotiation of pool updates in RightScale VMs

By | August 15, 2012

UPDATE [2012-10-03]: RightScale has ongoing issues with the stability of their command-line tools. In particular:

  • rs_run_right_script sometimes fails with a timeout error. You can’t just retry, because sometimes the script you were trying to queue was queued despite the error, and there’s no way to tell.
  • rs_run_right_script sometimes fails with an error “Failed to process request: Failed to instantiate an executable bundle for script – are all required environment inputs available?” when it should have succeeded. To make matters worse, when this error is displayed, rs_run_right_script exits with a status of zero rather than non-zero, so there’s no way for the caller to reliably detect that an error has occurred.
  • Sometimes the rs_tag command (on which the update mechanism described below depends) fails.
  • Sometimes it can take upwards of 45 seconds after a RightScript is queued before it actually executes. This wreaks havoc on the update mechanism described below, which relies on all of the instances being updated to run a script in close proximity to each other so they can synchronize.

I have been in correspondence with RightScale about these issues, and they have been unable to address any of them. As such, I no longer consider their scripting and tagging platform to be stable enough to use for production updates. I therefore no longer recommend the use of the mechanism described below.


At my current gig, our app’s middleware is hosted on a pool of Amazon EC2 instances managed by RightScale. One of the challenges we face is hardly uncommon in this business: how do we easily and reliably roll out a patch without any downtime?

There are two commonly used strategies. Either you roll out the changes gradually to your existing instances, pulling them out of the pool while they are upgraded and putting them back in afterward, or you gradually replace the old instances with new ones running the new code. The latter is very easy if your pool is auto-scaled… you can just reconfigure the pool so that newly created instances are running the new code, and then gradually terminate the instances running the old code so that they are automatically replaced as needed with new ones.

Some day we’ll be so successful that our pool will be auto-scaled, but for the time being, we’re maintaining it by hand, so updating the instances in place is easier then replacing them. Here’s how we were doing that until recently:

  1. Pull half of the servers out of the pool.
  2. Update the pulled servers.
  3. Put the updated servers back into the pool and pull the remaining servers out.
  4. Update the remaining servers.
  5. Put the updated servers back into the pool.

This cries out for automation. An update should be one step, not five. I want to push a button, sit back, and watch the update happen automatically.

The usual solution to this problem is to write a script or internal web app which uses Amazon or RightScale APIs and/or remote shell commands to enumerate the servers, divide them into batches, and pull / update / push each batch. This works well enough, but it violates one of the cardinal principles of the cloud: decentralization. Is there a way to have the middleware servers negotiate amongst themselves to decide who is going to update and in what order, and then step through the updates automatically, without the involvement of an outside controller? Yes there is, and the key to making it work is RightScale tags, because tags attached to an instance are visible to all other instances in the same deployment.

Here’s the basic idea:

  1. Launch the update script on all of the middleware instances through the RightScale admin app.
  2. Each instance tags itself with a “pending update” tag.
  3. All of the instances wait for the pending update tags to settle, i.e., for some time to elapse with no new tags being created on any instance.
  4. Each instance tags itself with a sorted list of all the instances being updated, i.e., the contents of all the pending update tags.
  5. All of the instances wait for the update list tags to settle, i.e., for all of the instances to tag themselves with the same list of instances to be updated.

Once the update list tags have settled, the instances have successfully negotiated the update list. There are potential race conditions in this process, but they are unlikely, and all of them result in obvious failure modes (you do need to monitor the update logs; you can’t just launch an update and walk away!) which do not result in any down-time and which can be corrected simply by rerunning the update.

Once the negotiation is finished, all that’s left is for the instances to update themselves, in order, while ensuring that only half of them are being updated at any time.

Without further ado, here’s the code. Read and understand all of it before using it!

#!/usr/bin/python
 
import math
import os
import re
import subprocess
import socket
import sys
import time
 
# If you have Python 2.7 or later, you can use
# subprocess.check_output, but the RightScale image we are
# using has Python 2.6, so I had to write my own wrapper
# around subprocess.Popen for fetching the output of a
# command.
 
def process_output(cmd, shell=False):
    """Run a command and return its output.
 
    For convenience, if the output is only one line long,
    the final newline is removed. Otherwise, it is
    preserved."""
 
    proc = subprocess.Popen(cmd, shell=shell,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.STDOUT)
    output = proc.communicate()[0]
    proc.wait()
    if proc.returncode != 0:
        raise Exception('%s returned non-zero status' % cmd)
    lines = output.split('\n', 3)
    if len(lines) == 2 and lines[1] == '':
        return lines[0]
    else:
        return output
 
# Our middleware is called "qexec", so I use
# "qexec_update:pending" and "qexec_update:list" as the
# RightScale tags our instances use for their
# negotiation. Obviously, you can change the tag names to
# anything else you want with a simple find/replace on the
# script. Just make sure you format your tags properly; see
# http://support.rightscale.com/12-Guides/RightScale_Methodologies/Tagging
# .
 
def get_pending():
    output = process_output(('rs_tag', '-q',
                             'qexec_update:pending'))
    pending = sorted(re.findall(r'"qexec_update:pending=(.*)"',
                                output))
    return pending
 
# Trivial class for unbuffering output so that we can
# redirect it into a log file and watch what's happening in
# real time. Thanks to
# http://stackoverflow.com/questions/107705/python-output-buffering
# for the tip.
 
class Unbuffered:
   def __init__(self, stream):
       self.stream = stream
   def write(self, data):
       self.stream.write(data)
       self.stream.flush()
   def __getattr__(self, attr):
       return getattr(self.stream, attr)
 
# Paranoia: Make sure all the external executables we need
# are in our search path.
 
os.environ['PATH'] = '/sbin:/bin:/usr/bin:' + os.environ['PATH']
 
# We use papertrail (https://papertrailapp.com/) for our log
# aggregation. The log file opened below is aggregated to
# papertrail automatically. Adjust as appropriate for your
# environment so that you can easily monitor the progress of
# the update on all of your instances.
 
sys.stdout = Unbuffered(open('/var/log/safe_qexec_update.log', 'a'))
sys.stderr = sys.stdout
 
# The purpose of the following block of code is to
# automatically roll the upgrade through all the servers in
# the pool, such that only half of the servers are being
# upgraded at any given time. To do that, we need to first
# execute a handshake protocol to ensure that all of the
# servers that are upgrading know about each other, and that
# any other servers which try to jump in late in the game
# will realize they're latecomers and give up.  Here's how
# we do that:
#    
# 1. Check if any servers have the tag qexec_update:list. If
#    so, then we missed the boat, so log an error and abort.
# 2. Add the tag qexec_update:pending=[our-hostname] to
#    ourselves.
# 3. Once per second, fetch a list of all pending servers.
# 4. When the server list hasn't changed for 30 seconds,
#    sort and concatenate it into qexec_update:list
# 5. Once per second, fetch the qexec_update:list setting
#    for all servers.
# 6. If it hasn't converged to the same value for everyone
#    within 10 seconds, then log an error and abort.
#
# At this point all of the servers that are being upgraded
# have mutually agreed on who they are and on the order in
# which they will upgrade themselves. Now, here's how we
# decide when it's our turn to upgrade:
#
# 1. Let n be the count of servers in qexec_update:list
#    divided by 2, rounding up.
# 2. Fetch and sort all the existing qexec_update:pending
#    tags.
# 3. If we fall within the first n servers in that list,
#    we're allowed to update. Otherwise, wait 5 seconds and
#    try again.
#
# Once we update successfully, we remove our
# qexec_update:pending and qexec_update:list tags, thus
# cleaning up and allowing additional servers to update
# themselves.
 
if process_output(('rs_tag', '-f', 'text', '-q',
                   'qexec_update:list')):
    sys.stderr.write('Found unexpected qexec_update:list; aborting.\n')
    sys.exit(1)
 
hostname = re.sub(r'\..*', '', socket.gethostname())
tag1 = 'qexec_update:pending=%s' % hostname
print('Adding tag %s' % tag1)
subprocess.check_call(('rs_tag', '-a', tag1),
                      stdout=sys.stdout, stderr=sys.stdout)
 
print('Waiting for pending tags to converge')
changed = time.time()
joined = hostname
# 30 seconds because I've seen delays of nearly that long
# between when a RightScript is queued and when it is
# actually started.
while time.time() - changed < 30:
    time.sleep(1)
    pending = get_pending()
    new_joined = ' '.join(pending)
    if joined != new_joined:
        joined = new_joined
        changed = time.time()
 
tag2 = 'qexec_update:list=%s' % joined
print('Adding tag %s' % tag2)
subprocess.check_call(('rs_tag', '-a', tag2),
                      stdout=sys.stdout, stderr=sys.stdout)
 
print('Waiting for list tags to converge')
converged = False
start = time.time()
while time.time() - start < 10:
    output = process_output(('rs_tag', '-q',
                             'qexec_update:list'))
    lists = re.findall(r'"qexec_update:list=(.*)"', output)
    if len(lists) != len(pending):
        continue
    for l in lists:
        if l != joined:
            conintue
    converged = True
    break
 
if not converged:
    sys.stderr.write('qexec_update:list failed to converge\n')
    sys.exit(1)
 
# You can adjust this easily if you only want 1/3, or 1/4,
# or whatever, of your instances to be out of the pool at a
# time.
 
# Here's an example of why you need to be monitoring the
# update, not just setting and forgetting it. If the update
# of individual instances is failing for whatever reason,
# then this is going to loop forever.
 
updatable_count = int(math.ceil(len(pending) / 2.0))
while get_pending().index(hostname) >= updatable_count:
    print('Waiting for permission to update')
    time.sleep(5)
 
# Here's where you do your update! Do whatever you need to
# do to pull your instance out of the pool (we use iptables
# to block the port that our front end uses to submit new
# jobs to the middleware), wait for active jobs to finish,
# do the update, and then put the instance back into the
# pool.
 
print('Removing update tags')
subprocess.check_call(('rs_tag', '-r', tag1))
subprocess.check_call(('rs_tag', '-r', tag2))
 
# All done!
Share

Leave a Reply

Your email address will not be published. Required fields are marked *