Sunday, February 21, 2010

Perl and Python documentation or Pydoc considered harmful

If you've read my programming posts before you know that I've transitioned from being mostly a Perl programmer to mostly a Python programmer (and from C to Perl before that). For the most part, this has been a painless transition. Python tends to take itself too seriously (which leaves me thinking: odd choice of name), but other than that it's a good language. Perl has its strengths, to be sure, but so does Python. One area, however, that causes me no end of frustration is reading Python module and program documentation. Coming from Perl, as I do, I'm used to flowing, descriptive documentation, the crafting of which is as much a part of the authoring of a piece of public code as the source.

In the Python world, it's another story. There are projects with excellent documentation like django, but I've never seen one that used the built-in documentation system in Python that was worth the time it took to create. (more...)



In Perl, the WWW::Mechanize module is used for Web scraping, automation and testing. Its documentation is nearly perfect, walking the user through its use by example and as a reference document at the same time.

In Python, I just came across scrape.py which is meant as a replacement for Beautiful Soup. Beautiful Soup's documentation is excellent, though not as good a reference document as WWW::Mechanize's, it's still a very good document for the primary purpose: explaining how to use the library. scrape.py's documentation, on the other hand is terrible. It documents everything with equal weight and has no flow at all. What's the difference between Beautiful Soup and scrape.py? scrape uses Python's built-in documentation system (source code download link) while Beautiful Soup uses HTML.

So, do Perl's POD documentation and HTML have some hidden advantage over Pydoc or is this just a cherry-picked comparison of unequal documentation styles? In my experience, thus far, it's the former. Python documentation comes in two flavors: pydoc and useful. I've yet to see the exception, though I'm sure someone, somewhere has used Pydoc to create some beautiful and useful documentation (just as someone used staples to create beautiful art on a wall).

What makes Pydoc so bad? A few things. First of all, it's not free-form at all. It's a tool for documenting code for maintenance programmers to read (not shocking, given the motivation behind writing Python in the first place). This means that the author never has to stop to transition from "coding mode" to "documentation mode" and documentation ends up reading like source. Frankly, if I wanted to read source code, I would. What I really need when I'm going to the docs is for the programmer to put the code down, take a deep breath, and think about what their software means to the user. Perhaps I'm just a bad programmer, but I can't do that while I'm coding.

OK, so I'm advocating for going back and writing the documentation once the code is done. I can do that in Pydoc, right? Well... yes, but then you get to the second problem: flow. Read WWW::Mechanize's documentation. It begins with some examples, then gets into how to instantiate the core class using new() then touches on a related startup method and then talks about a stand-alone helper function that you can use in conjunction with that method. This is documenting the use of the library, not the use of a specific class. The code might be organized very differently, but the documentation is laid out as it makes sense for the audience, not the source.

In Pydoc, documentation is data that lives in the code. It is fundamentally tied to the source in a way that prevents you from making any structural changes to the resulting document without also making those changes to the source. This means that you have to risk damaging the functionality of the program (and, of course, damage much of the source control history of the source) in order to change the flow of the documentation. That results in a high degree of documentation inertia.

You also cannot easily transition from reference style to descriptive style of documentation. If, for example, you wanted to have a section of a security module that discussed the statistics behind its choices, you might want that to be introduced after the general documentation about the module's use and its primary classes or functions. However, after that, you might then delve into deeper topics like helper functions/methods and other topics that are important, but certainly less weighty than the first two sections. In Pydoc you cannot do this. Documentation is sectioned in a way that the programmer cannot control.

Pydoc has one thing going for it: everything ends up in the documentation because source code auto-generates the framework for the documentation. This is a good thing, and I'd like to see a way to bring the two styles together, but as it stands I always dread having to look at the documentation for anything in Python unless I already know that it's been done using an external tool. That really should not be the case.

On a side note, here's a fun trick to try. To get documentation on a specific builtin function in Perl, do this:
perldoc -f sort
Try the equivalent in Python:
pydoc -k sorted
I'm sure I'm missing some trick, but on my system the latter results in a traceback which kindly explains that "'NoneType' object has no attribute 'get'" which, while technically true, is just about the worst excuse for user feedback I've ever seen. I'll probably need to write a followup article at some point on why tracebacks are a destructive and rainforest clear-cutting.

8 comments:

  1. Aaron, nice article. A few quickie comments. Sorry for not responding more fully.

    doc for a specific definition is help(definition). e.g., import os; help(os.system)

    The comparison with POD doesn't work well for me since I don't know POD format. Can you associate doc with definitions as well as with the module itself? In Python you can write free-form doc in your module-level pydoc and that will show up in addition to the pydoc for specific definitions. e.g., help(os)

    It is definitely sad that the built-in HTML generation in pydoc is so bad. Though I haven't really checked it since Python 2.4 since we use epydoc. epydoc was better at the time we switched.

    ReplyDelete
  2. Carl, I do understand that associating code with documentation in a programmatic way has value. That was a bit of where I was going in my comment toward the end. That said, I don't think it's worth sacrificing the quality of the overall documentation.

    As far as being able to write a free-form blurb at the start, that's great, but it means that you have, at best, a documentation part and a reference part which have a strict and unmodifiable relationship to each other.

    It's sad that the only practical solution right now is to ignore pydoc and write free-form documentation, but look at all of the python projects that do just that. They throw up a pro forma pydoc somewhere and ignore it while generating their real documentation in some other medium, entirely dissociated from the code. Maybe that's the right way to go. Maybe if the documentation ends up being so much better when a few loose conventions are applied to free-form text, then we should just stop trying to associate our documentation with our source code.

    It's just that I know the programming by contract guys are going to be really mad ;-)

    PS: Never read your own essay in Tracy Jordan's voice from 30 Rock. The results are very much unflattering....

    ReplyDelete
  3. It sounds like in addition to the tool itself being deficient in various ways, there's a social problem afoot. People are using the fact that they get pydoc 'for free' as an excuse to write API level docs for their code without giving any thought or putting any effort toward enabling humans to actually *use* their work.

    I've seen this in the Java and Ruby camps as well, though from what you say rdoc and javadoc are better tools for the same job than pydoc is.

    ReplyDelete
  4. I suppose this is no different from the traceback problem in Python, since you put it that way, Chris. Traceback support by default seems like a good idea, but then people get lazy and every error message is accompanied by a page of stack frames.

    ReplyDelete
  5. Exactly! This exact same problem exists in JavaSpace as well. People for a long time thought that a stack trace was a fine substitute for a human readable error meessage.

    I'm always a big fan of including the stack trace *somewhere* where geeks who are so inclined can find it, but there's nothing quite like trying to explain what's happening to my wife when she gets a faceful of stack trace out of a web app she's using.

    (Channeling Crash)Fire all the bad programmers!(/Crash)

    ReplyDelete
  6. Good article

    The most important place for documentation is interfaces. It is probably harmful to most people to generate documentation for internal classes. Increases the noise.

    ReplyDelete
  7. it's good to see this information in your post, i was looking the same but there was not any proper resource, thanx now i have the link which i was looking for my research.

    ReplyDelete
  8. I've just come across your article while searching the web for ways to get around the strict formatting of pydoc. I'm trying to document a command-line tool I'm writing, which is not a use that seems well supported by pydoc. I'm disappointed to hear that I'm searching in vain... I really wish I had something more like POD at my fingertips for this.

    ReplyDelete