Sunday, January 9, 2011

Evaluating the "readability" of programming languages

Being a Perl programmer (among many other languages), I've often entered into the debate over what "readability" is. People who don't program in Perl and many who do, find it difficult to read. The problem is that those who know the language typically don't mean the same thing at all when they say, "Perl is unreadable," as those who don't know the language. Perl isn't unique in having readability concerns directed at it. C, C++, Lisp, FORTRAN, Haskell, Ruby, PHP, Bourne Shell and friends, AWK, and dozens of others have had the same complaints leveled against them, and in many cases, the same dichotomy exists between those who know the language and those who don't.

So, if you don't program in a language, what makes it "readable?" Presumably, it "reads" as if it were a language you know. For example, if many of your operators are English words, and you seek to minimize the amount of punctuation in a language, then people who don't know the language might feel more comfortable with it than if you re-purpose ever bit of punctuation that one particular extended keyboard symbol set provides (yes, I'm looking at you, APL). Now, once you learn that symbol set, of course, you have an entirely different problem on your hands: can you understand the code that you can now map to a sequence of operations in your mind?

It turns out that this latter question is where many users of Perl over the years have staked out a more sophisticated set of concerns. For example, the language provides no fundamental object model, just the tools with which to build one. So, when you see code that uses objects, you have to wonder, "what is this doing?" When you have to ask that, then I think it's fair to criticize code as "unreadable."

On the other hand, I've had a debate with someone recently about this snippit of Perl 6:

 1, 1, *+* ... *

This is the infinite Fibonacci sequence in Perl 6, and while it's an obscure looking thing to those who don't work with Perl 6 on a regular basis, to someone who knows the language, there is no ambiguity at all. You never find yourself wondering what the hidden mechanics are, here, because there are none. It's simply a series of numbers, generated by adding the previous two values together to get the next value.

Perl 6 has its problems with readability, to be sure, but until we have a very large base of programmers using it on a regular basis, I don't know that we'll have a clear handle on what those are. I think the adverbial tagging of expressions will either end up being a boon or a substantial hindrance to readability, for example. The ability to build mini-languages ala Lisp is extremely powerful, but to say that it's open to levels of abuse that could make the International Obfuscated C Programming Contest look like a poetry slam is an understatement of epic proportions. Will this be a problem in reality? Maybe.

Moving past Perl, however, I'd love to have a generic set of metrics to apply to any language so that we can understand what it is that we're talking about. I see there being 5:

First, a definition: a "symbol" is any sequence of 1 or more characters that have a defined meaning in the language. In Perl, for example, "my", "+" and "die" are all symbols. The name of a function is not a symbol because its meaning is not part of the definition of the language.

Granularity - This is a measure of code abstraction. High granularity means that it takes many symbols to represent a given concept while low granularity means that a smaller number of symbols is required. One measure of granularity is the number of symbols in an average piece of code vs. cyclomatic complexity, but I don't believe that this is sufficient in modern programming languages.

Density - A simple average of the number of characters (in whatever character set you like) per symbol the language uses.

Vocabulary - The number of symbols in the language.

Textiness - The ratio of symbols which are comprised of "letters" (which may vary in meaning by character set) to those comprised of non-letters or a mix.

Morphability - The ease with which a programmer may change the behavior of the language or invoke behavior which is ambiguous at compile-time. This includes everything from textual macros to operator overloading to Lisp-style macros to simple polymorphism. Every way in which a piece of code may have multiple valid meanings or the meaning might change depending on the behavior of previously evaluated code. It's important to understand that "if a then b else c" has two behaviors that depend on the value of a, but if we could expect this code to compute pi because of something I've done ahead of time, then it would demonstrate a high morphability.

These five metrics need more specific definitions. We need to understand what their domains are and how different languages map into those domains. I'll tackle that in a latter article...