Thursday, February 17, 2011

Python subprocess vs os.popen overhead

Let's say you're writing a Python program and you want to run an external command and read its output. The right thing to do used to be to:

 import os
 p = os.popen("command")
 output = p.read()

But there were a lot of ways to run programs depending on what kind of output you wanted to read (if any) what kind control you wanted (if any) and so on. Thus was born the subprocess module. In the current, 2.7 documentation for the os module, there's a note on popen:
Deprecated since version 2.6: This function is obsolete. Use the subprocess module. Check especially the Replacing Older Functions with the subprocess Module section.
Well, that's pretty definitive, right? Unfortunately, not so much.

Typically, process creation overhead doesn't matter a great deal. If a program needs to run another program, then the startup time involved in the creation of the process is probably an order of magnitude (often several) less than the time that the new program takes to do its work. So, you only typically care about process creation overhead when you're creating a very large number of parallel children.

Unfortunately, I work in the world of system monitoring, and in that world, creating a few hundred or thousand programs a second during peak times (subordinate monitoring tools) is not a rarity, and even when the load is much lighter, large amounts of process creation overhead isn't always ignorable. For example, if your system is doing a lot of IO, then large memory operations during process creation might reduce the amount of caching the system can do.

All of these factors lead me to test the subprocess module against os for a simple case: I want to run a process under a shell with standard output being captured. With the os module, my test looked like this:

 import os
 for i in range(10000):
    f = os.popen("exit 0")
    f.close()

With the subprocess module, my test looked like this:

 import subprocess
 for i in range(10000):
    f = subprocess.Popen("exit 0", shell=True)
    f.wait()

I timed the two scripts and came to a surprising conclusion: subprocess has about a 40% process creation overhead over os.popen! That's an awful lot of increase, so what could be going on? My next step was to use strace to determine what could be taking up that extra time. Here's a partial strace fo the subprocess example under Linux:

pipe([3, 4])                            = 0
fcntl(4, F_GETFD)                       = 0
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
clone(...) = 3306
close(4)                                = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f358acc6000
read(3, "", 1048576)                    = 0
mremap(0x7f358acc6000, 1052672, 4096, MREMAP_MAYMOVE) = 0x7f358acc6000
close(3)                                = 0
munmap(0x7f358acc6000, 4096)            = 0
wait4(3306, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3306
The os.popen example had one less fcntl and an extra fstat, both of which are fairly light weight. The real culprit here are the mmap, mremap and munmap calls that subprocess is doing. Why are those there, I wondered. In looking at the subprocess code, it seems that these operations may be the result of thread creation which subprocess uses to manage reads and writes on subprocess inputs and outputs, but I'm not sure. What is clear is that the subprocess module is about 1,300 lines long while popen was a builtin supplied by the interpreter.

Conclusion

subprocess is a valiant attempt to make a complex snarl of library calls into a uniform tool. The problem is that process creation is one of the most fundamental operations that a language performs, and when a simple task like running an child process and reading its output becomes too heavy, a language suffers for it. Perhaps subprocess should be simplified and its convenience routines re-written as low-level operations that are optimized per-platform. Or perhaps os.popen should be undeprecated. After all, I'm willing to bet that managing fork, pipe and exec operations from Python will never be as low-impact as calling the C library popen(3) function.