Reproducing OpenStack Grenade gate job results

stevelle, Wed 15 March 2017, Code

This is just a really quick post to capture something learned.

Imagine your CI job fails. Imagine you can’t tell why easily because there are so many moving parts in the CI. Imagine you are asked to fix the broken build.

This post is about how I got past that.

Step 1: Set the stage

Provision a new VM which could run a gate job.
Prepare it with any personal customizations and tailor to your tool preferences appropriate to a disposable working environment.

Step 2: Prepare the host to emulate the gate

In the VM

cd /opt # or somewhere you want to put stuff
git clone https://github.com/JohnVillalovos/devstack-gate-test.git
cd devstack-gate-test
./vm-setup.sh
su - jenkins

Step 2.5: [optional]

Snapshot the instance here for convenience

Step 3: Execute the reproduction event

In a browser

click the link of the correct gate job from the review you want to reproduce
find the reproduce.sh script in the logs directory
copy link location for this script

In the VM

cd ..
wget [paste-url]
chmod u+x reproduce.sh
./reproduce.sh

Now just kick back and watch the carnage!

Write once, Read and rewrite many

stevelle, Fri 03 February 2017, Code

process, python, refactoring, style

“Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. …[Therefore,] making it easy to read makes it easier to write.”

—Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship

Most of my time in software development is spent reading and thinking about code [1] rather than writing it. This is not a groundbreaking revelation [2] but it is particularly true for me.

I tend to balance a little more on the side of preferring quality over speed of delivery. That means when I write code I will often want to look at it and shuffle the logic. The editing process is rewarding for me. I thrive on the puzzle, in rearranging the pieces, shaving here and gluing there to make the shape serve this purpose or that one better.

I got to exercise that recently when I was reviewing a change and find the following logic proposed by folks, somewhat in jest: [3] [4]

# Tao of Python says:
#   if the implementation is hard to explain, it's a bad idea
return reduce(lambda x, y: x or y,
              map(lambda x: x.has_migrations(), migrations))

Myself, I like functional style solutions to problems. But in Python it can take a form that looks unlike many others. The lambda syntax is a bit verbose. The primitive building blocks to functional style like map, filter and reduce functions are relegated to the corner and the alternatives are promoted (moved from builtin space to functools in Python 3). All of this holds with the Zen of Python [5].

The map usage here is basic. If you can’t read that the purpose is to generate a list of True of False values, one for each entry in the migrations list. It is the equivalent to either of the following:

# Tao of Python says:
#   if the implementation is easy to explain, it may be a good idea
def like_map(migrations):
  result = []
  for m in migrations:
    result.append(m.has_migrations())
  return result

# Tao of Python says: flat is better than nested
[x.has_migrations() for x in migrations]

But what is that reduce operation doing? To read that at a first pass it helps to have experience working in functional languages in which case you might have seen the pattern. Reduce will take N items as input and reduce them to a single output, in this case True or False. This reduce expression (x or y) then will just or the list returned by the map operation. If any m.has_migrations() for m in migrations, then the final result is True, otherwise False. We could simplify that logic to either of the following:

# Tao of Python says: readability counts
pending_migrations = [x.has_migrations() for x in migrations]
return True in pending_migrations

# Tao of Python says: simple is better than complex
for m in migrations:
  if m.has_migrations:
    return True
return False

Either of these would be a great solution. They are both explicit, simple, and easy to explain.

The first is also flat and readable. The list comprehension [6] is the most complex yet concise element here, but with a basic familiarity with comprehensions it reads very well.

The second relies on syntax that anyone who has done a couple weeks of programming in nearly any language can decipher (I avoided the use of for-else because while it would be technically correct it is an unnecessary use of that language feature and more verbose) but it has non-linear flow control and while the logic is simple it doesn’t convey meaning concisely.

We can do better.

# Tao of Python says: beautiful is better than ugly
return any([x.has_migrations() for x in migrations])

Now the code reads beautifully. If you forgive the syntax and a bit of the dialect of writing software, it expresses an idea simply:

“Does any x has_migrations for each x in migrations?”

“Does any object in this list have migrations to perform?”

When someone comes back to read this, it should take very little time to comprehend regardless of their experience level with the language. When we strive toward any of these last two sets of solutions and use the concise and unambiguous elements of our language we place a lower working memory load [7] on ourselves and others. We are not likely to spend less time reading code, in the end for a few reasons but if we apply this refinement technique in some parts of a project that frees us to focus on the hard parts that really are complex.

[1] MSDN Blogs: What do programmers really do anyway?

[2] MSDN Blogs: Code is read much more often than it is written, so plan accordingly

[3] OpenStack Change-Id: Ie839e0f240436dce7b151de5b464373516ff5a64

[4] This logic is not in a tight loop, and doesn’t operate over large data sets so the concerns of efficiency, performance, or memory optimization are not paramount in this case so I’m not going to mention them.

[5] Python PEP 20 — The Zen of Python

[6] Python PEP 202 — List Comprehensions

[7] Wikipedia: Working Memory

Disposable Development Environments

stevelle, Mon 23 January 2017, Code

environment, productivity, tools

I might be taking things too far with my development environments, but I really don’t like the idea that my development environment might be special. That could mean many things.

Consider the case where my development environment has lots of stuff installed which might not be enumerated in the Developer Documentation or Getting Started guide or README file that makes things work for me but keeps others from being able to quickly have the same success with whatever environment they might be starting from. That is a rotten way to onboard new team members or welcome new contributors, so keeping yourself aware of what it takes to go from zero to developing is important.

I deteste the similar case where my development environment has lots of stuff installed which cause my environment to behave differently from an automated testing environment or any place it might be deployed. There are a lot of ways to get an advantage over this particular gremlin too.

Lastly, I loath the situation where you have to work on a new or temporary device, or you end up having to nuke and start over with a fresh operating system. Lose all produtivity while you install and customize your working environment during production environment outage due to a critical bug one time and you might feel the same.

So I realize that app containers are the hotness, but none of the apps I work on for OpenStack or for Rackspace include manifests for the dominant container orchestration tools. That isn’t to say nobody has run them in Docker, but I’m not really that interested in dinking around with deploying all the various pieces needed and fixing all the broken windows along the way.

Take the Glance project as an example. A typical deploy of Glance requires MariaDB, RabbitMQ, Glance API, Glance Registry, the ability to run Glance Manage and Glance Cache, and possibly also a Glance Scrubber in daemon mode in order to have a complete ecosystem. That is all needed just to use the filesystem storage driver in the container. I don’t really want to maintain 7 different app containers on my development host box (murdering my battery life as they spin up and down). That is neglecting the need to keep 3 versions of each manifest of the deploy tools tailored to the needs of each branch of Glance (master an 2 stable branches) in service at a time as well as having each manifest accomodate the various customizations needed in service configuration, and keep them all in sync.

This is in part why we have Devstack [1] within the OpenStack community, as it provides a ready-to-eat means of deploying and configuing all the pieces to a single [virtual] host. That could be an OS container [2] (such as LXD [3]) as well, but whatever.

I work from any of two different Mac laptops a Windows desktop, or a Linux workstation, but mostly I work from one of the laptops. The churn of builds and package installs is slower locally and kills my laptop’s battery life, so I use virtual machines in the Rackspace public cloud for almost all of my work. But this requires a fair bit of machinery, I want pip to install the right python package versions of the OpenStack and Nova clients, and their prerequisites. I want the right cloud.yml or open.rc file which are used to contain authorization credentials and I want to ensure my SSH private key is used for authentication. And even then, I don’t want to use the OpenStack or Nova client directly, when there are only two to three things I might want different between each virtual machine instance I work from, (name, flavor, image).

So I go one step further. I install VirtualBox and Vagrant directly on the laptop, and I pull down one private git repository in order to get the laptop set up as a development environment. From there it’s as easy as changing directory into the repository and entering one command.

$ vagrant up && vagrant ssh

The git repository has a Vagrantfile which specifies a current distro release to use as a development jumpbox. The provisioning scripting in the Vagrantfile sets up all of the libraries, SSH Agent, and credentials for me under the vagrant user and then pulls down another git repository [4] which contains a few more shortcuts to simplify my work (at time of writing I have a bunch of changes on my private git hosts which I haven’t cloned to github so what’s visible may not even work but I assure you I have a git source which does should the laptop need to be nuked) including setting up my shell, vim, etc. preferences inside the cloud VM.

I can spin up new development environments for any project I want to work on after that, isolating each project along with it’s system and language-specific package requirements, and the language specific tooling. Sometimes that is done with Ansible playbooks, sometimes using project-specific bootstrapping scripts, (all helpfully cloned into the Vagrant VM by the provisioning scripts) from the Vagrant VM.

To recap: I navigate from the laptop where I do most of my work, to a VM on the host, to a VM in the cloud where my workspace lives. It’s a bit convoluted but the battery drain isn’t too bad (compared to just invoking ssh directly from the laptop, which is an option but not always as convenient), all the bits are highly agnostic to host OS, and the steps needed to get myself into a productive mode on any given environment are really minimal and stable.

On a regular basis I seem to blow away the VM on the laptop and rebuild for one reason or another and this has been remarkably stable over time, with only one or two things I tweak every few months as I come up with more customizations or resolve a new issue. Most recently I found that my VirtualBox upgraded to a version more recent than that supported by Vagrant, so I just updated that and everything started to hum again. On the other hand when I end up with any kind of dependency hell on the jumpbox VM it’s never further away than:

$ vagrant destroy -f && vagrant up

All things considered, I could simplify this set up considerably by eliminating the jumpbox VM with the use of a virtual environment to contain the bits needed to connect to the various OpenStack clouds I might operate my development VMs on. The problem there is of course that this sort of refactoring usually happens at highly irregular intervals and I just haven’t found the time [5].

[1]

Devstack

[2]	LinuxContainers.org

[3]

LXD

[4]	github.com/stevelle/instancer

[5]	The cobblers children have no shoes

Making It Right

stevelle, Fri 20 January 2017, Code

ci, integration, openstack, quality, testing, zuul

We broke a downstream project in the last week before their release deadline. This is about making it right for them.

Today members of the Glace contributors team were alerted [1] that we had broken and blocked OpenStackClient with our change [2] to support community images [3]. Folks were already in the process of beginning to diagnose the issue when my day started.

It became clear that we had lots of cooks in this particular kitchen so I moved over to another VM where I was testing changes [4] to the python-glanceclient project related to the community images feature.

A candidate fix for the breakage became available [5] from another contributor. The change was small, one change to logic, and a new functional test. I switched contexts to begin reviewing it and testing it. Zuul [6] was reporting a long queue, as lots of projects are feeling the crunch of the clients release deadline, Ocata-3 milestone and feature and string freezes coming quickly. Because of this I expected Jenkins gates to take their time coming back with a vote so I started automated tests featuring just the additional test and without the community images commit as a baseline. While that was running I started combing over the changes in detail.

I got distracted from this by an email that came in signaling something in Rackspace that needed me to respond quickly. That dispatched I returned to see that the baseline testing showed the additional test should pass before the community images feature, which is what was expected. I smashed the keys a bit and started a test run to test the fix with the community images feature. The work email from before resurfaced again, and so I hammered out another response, caught up on IRC traffic, and returned to that code review.

When I completed the review, I tabbed over to look at the tests again and as hoped everything passed. I updated folks on IRC and we proceeded to tag the as ready to be merged. All the while, Zuul was busy grinding away on that backlog of work before it could start the first pass of testing for this fix.

Fast forward

Jenkins reports that the fix failed to pass the multiple scenarios in testing. Inspection reports a few broken functional tests, 23 in all. What did I do wrong? At this point I see the same set of tests failing under different scenarios, and those tests are all related to the v1 API. That’s why I know I screwed up the review and approved a bad patch. My suspicion is that I made a mistake somewhere in set up of that second test run.

At this point, it’s more important to get things fixed because downstream projects are still broken. It’s a Friday afternoon and I’m sure at this point we have shot any productivity down there but nobody wants to come in to find the world broken on a Monday morning on the week of a deadline, so I’m expecting to kill non-critical distractions and get into it again. I know I’m not alone in the desire to do that, but the original author of the fix has finished up his workday and checked-out so we have one less core, and one less set of eyeballs familiar with the context.

The failing tests are all functional tests, and all focused on the v1 API. The failures seem to highlight a failure to create images (note: the candidate fix that causes the failures was addressing an issue with updating images). That leads me to suspect that the issue is with the specific change of the candidate fix. I started combing through the file (related to the sqlalchemy db storage engine, as opposed to the simple db storage engine which is the only other concrete db engine supported. The change which broke things only touched the sqlalchemy engine code, and the sqlalchemy engine code is specifically exercised by some if not all of the failing tests, so that helps me choose which engine to fix, but I inspected both as a means of contrasting them. I’m staring at a function that runs if something is true, and some other thing that happens otherwise… the pieces are coming together.

The important point here is that clues about the scope of the breakage are invaluable in pinpointing potential causes. Identifying what is common among a set of tests which break is a helpful step. In this case it was many tests breaking because they all used a common bit of code to create an image in Glance as part of setting the tests preconditions.

The whole thing started because there was a gap in the existing automated testing, both functional and unit testing could have identified the problem with the initial community images implementation but code coverage is always a balancing act. The more tests you have, the safer you might be but time to imagine them is limited when there are other features or bugs you could spend that time on. And in any long-lived project the coverage you have tends to never be good enough. I find it helpful to just keep my expectations low and hope that the tests will catch stuff but never be surprised when a gap is discovered. Expect that you are working with incomplete information.

The original candidate fix was crafted with the help of a functional test which was used to first model the bug before the cause was identified. This is a great way to begin pinning down the problem, not only because it allows you to capture the information from a bug report but it gives you leverage toward identifying the cause as you can quickly drive toward the point of the failure by running the test, combined with debug breakpoints or trace logging as breadcrumbs leaving a trail through the code. Finally having that test allows you to verify your fix and address the gap in test coverage gives confidence that you won’t have to repeat this work later.

One final lesson is to be found in investigating why my testing of the first candidate fix passed when they failed in Jenkins. For that I had to inspect my shell history. The gist of it is that when I was juggling states with git I ended up with an incomplete application of the candidate fix in my workspace. Specifically I ended up without the change to the sqlalchemy engine at all. In that case, good git habits and workspace hygiene is important. Managing distractions is the other side of this because the point where I was interrupted by email the first time is when I made my mistake in setting up the workspace for the second run.

The fix is merged and the downstream projects are unblocked. Maybe Monday will be less exciting.

[1] [openstack-dev] [osc][openstackclient][glance] broken image functional test

[2] OpenStack Change-Id: I94bc7708b291ce37319539e27b3e88c9a17e1a9f

[3] Glance spec: Add community-level image sharing

[4] OpenStack Change-Id: If8c0e0843270ff718a37ca2697afeb8da22aa3b1

[5] OpenStack Change-Id: I996fbed2e31df8559c025cca31e5e12c4fb76548

[6] OpenStack Zuul — common continuous integration service

stevelle.me

Recent Posts

Reproducing OpenStack Grenade gate job results

Write once, Read and rewrite many

Disposable Development Environments

Making It Right

Fast forward