Developing and deploying Python in private repos

February 12, 2018
Engineering

At Hosted Graphite, most of our deployed services are written in Python, and run across a large installation of Ubuntu Linux hosts.

Unfortunately, the Python packaging and deployment ecosystem is something of a tire fire, particularly if your code is in private Git repositories. There are quite a few ways to do it, and not many of them work well.

This post tells the story of what we have tried, where we are now, and what we recommend to programmers in a similar situation.

TL;DR

  • Avoid ad-hoc solutions. Everything you do that is weird—from the mainstream Python packaging point of view—may lead to hours of debugging dependency issues.
  • Consider using virtualenvs as the deployable objects for each service you run. Package and ship them to target systems however works best for you.
  • Consider building a local PyPI so you don’t have to get into the business of having strong git+ssh dependencies. That’s a fairly big step though (and a new service to run), so if you’re using git+ssh dependencies for libraries, try to stick to properly written setup.py rather than requirements.txt, and read up on the experience of others.

Swimming upstream

Upstream
Image source: https://flic.kr/p/55HmEN (CC BY 2.0)

Python vs. the monorepo

In the Before Times, we had a monorepo, maintained in Github. Separate projects lived somewhere within that source tree, and we manually built packages by specifying a debian/directory for each. This told Ubuntu where to put our code and what other packages it depended on.

Later, we started to use stdeb to automatically build Debian packages, with dependencies being inferred from install_requires in each project’s setup.py.

Possibly our most ad-hoc system was used for deploying our web projects: basically just a “git pull” + a horrible Puppet "god module" to install Python package dependencies. While it was a configuration management headache on a number of occasions, this worked surprisingly well.

Next, we introduced CircleCI for continuous integration. This clarified—or forced—an issue we had been struggling with for a while: the monorepo was a disaster because

  • each continuous integration cycle built and tested everything, which became painful as our team grew, and
  • the open source Python tooling ecosystem was built around the idea of “one repo per child” and didn’t support our weird use-cases well.

So, we broke our projects out into separate repos, depending on each other, and got them building and testing on each commit. That highlighted another problem with Python tooling: private Git repository dependencies – particularly transitive – are not well supported.

Python vs. the private repo

We had issues where one package we were building downloaded a project of the same name from PyPI rather than our private Git repo as specified in setup.py’s dependency_links.

This is pip brokenness: even with --process-dependency-links specified, for example in requirements.txt, you still can get version conflicts between PyPI and your dependency_links. See https://stackoverflow.com/questions/11032125 and the sad cluster of similar questions.

We were able to work around this in that case by getting pip out of the build flow and relying on python setup.py install, which managed dependency_links properly.

Later though, an apparently unrelated change completely broke our builds with transitive dependencies – another straw for the camel’s back. :o)

See also the discussion / private repos support group in https://github.com/pypa/pip/issues/2124.

Python vs. system packaging

At the same time, with more people making changes to more services, we were struggling with Debian package dependency issues among our projects: since they were installed into the “system Python”, upgrading a particular dependency (e.g. a utility library) on a host could affect all services running there, even though the change might be targeted at only one.

A still and silent pool

A still and silent pool
Image source: https://flic.kr/p/c5F5hE (CC BY 2.0)

 Problems and solutions

At this point we had identified two main sources of friction in our build / test / deploy cycle:

  1. We were depending directly on private Github repositories (“git+ssh://...” deps) in setup.py;
  2. We were building each library or deployable as a separate Debian package, deployed into the system Python.

virtualenv to the rescue for the latter problem: it is the way to build a “somewhat hermetic” environment with a particular set of Python package versions. Combined with dh_virtualenv we can build deployable, self-contained, versioned Debian packages for each service we run. In the future we expect to do basically the same thing with containers.

To solve the setup.py dependency problem, we needed to get out of the business of using “git+ssh://...” deps entirely. To do this, we set up our own private PyPI server using pyshop, which has the access control and caching features we wanted. Where we have wanted new features, they’ve been easy to add.

State of the (imperfect) art

We now have the following setup for our main and any new projects:

  • We develop each project in a private Github repository;
  • Each project may be
  • a library that acts as a dependency for others, or
  • a deployable, e.g. a service we run on our machines;
  • Each is built and tested under continuous integration;
  • The “deploy” phase of our continuous integration system, for a green build on master
  • uploads libraries to our private PyPI service, or
  • uploads deployables as dh_virtualenv packages to our private Debian package repository.

Developers need to (one time) setup ~/.pip/pip.conf and ~/.pypirc such that they can access pyshop for package dependencies, using read-only “installer” credentials.

The CI system does the same, but with read-write credentials so it can upload packages.  

setup.py for any particular project can specify install_requires etc. as if it were using upstream PyPI.  

We build new projects off a “base project” template for cookiecutter that includes all of the CI and project boilerplate, so engineers just have to setup a few credentials in each new CI build.

The base template includes a Makefile that does local builds using the same Docker image that CI uses.

Shippily ever after

Houseboat
‍Image source: https://flic.kr/p/21tyusq (CC BY 2.0)

So after all, we’re in a much better place! If you’re curious about other examples, Nylas’ experience report reaches a similar conclusion.

Caveats

We still burn the occasional hour or three on a dependency issue. For example, we’ve found we sometimes need to tell our dh_virtualenv packages that they depend on a new non-Python dependency, like a MySQL library version.

Pyshop is not (yet) perfect. We’ve seen issues when it’s operated as a read-through cache. It can time out while downloading the metadata for e.g. a popular project with many versions on upstream PyPI, like requests. However, when this gets annoying enough, we will fix it. :o)

setuptools doesn’t handle .pypirc quite right, see here. Generally we use

$ pip install -e .

to get all the deps for a particular project and work on it.

One potential gotcha that we haven’t explored a lot is changing a library and then needing to rebuild all services that depend upon it. While awkward, this is a necessary outcome of the intended result—independent dependency management for each service. We see this as a problem with the tooling around our CI rather than packaging per se.

If you have any questions or corrections on this post, we’d love to hear from you – get in touch or send us a tweet.

 

This post was written by Cian Synnott, Programmer at Hosted Graphite.


Cian Synnott

Programmer at Hosted Graphite

Related Posts

See why thousands of engineers trust Hosted Graphite with their monitoring

LEARN MORE