HOWTO use the Force
Jul. 30th, 2016 03:46 pmSource Code Literacy for Padawans
Preintroduction
This article is a long-term work in progress; the reader might find it worthwhile to check back occasionally to see if anything useful has been added. I started it as a set of notes for a talk of the same name at SeaGL 2016.
Talks on this topic have been done before, no doubt better. The following list contains at least one such talk, and much further material on the subject.
- Resources
- https://github.com/aredridel/how-to-read-code/blob/master/how-to-read-code.md
- Blog post pretty-printed version of above: http://aredridel.dinhe.net/2015/03/29/how-to-read-source-code/
- https://blog.codinghorror.com/learn-to-read-the-source-luke/
- https://news.ycombinator.com/item?id=3769446
- http://www.tutorialspoint.com/developers_best_practices/handy_tools_techniques.htm
- http://www.gigamonkeys.com/code-reading/
- http://himmele.blogspot.com/2012/01/how-do-you-read-source-code.html
- http://matthieu.io/blog/2016/10/23/debugging-101/
- Python
- Lua
- Ruby
- Assembly language
- C or C++
- PostScript
- one value
- another value
- still another value
- foo
- fnord
- bar
Introduction
I've been reading source code, or attempting to, for over 20 years. It's hard to get started, and I never found a good guide on how to do it, so I'm going to attempt to write one even though I'm not remotely qualified to do so. I'm writing this in the hope that it may be useful to those who haven't yet learned the tricks I know, and that others with greater knowledge of the subject, or at least additional knowledge besides that which is presented here, will speak up either in the comments here or elsewhere.
Why is it so hard to read source code?
Programming is an act of translation. Someone gets an idea in the meat head and then translates it into a language. Typically the idea is translated first into a human-intelligible language, and then into a machine-recognizable one.
This is an iterative process. An author will write a fragmentary first draft of part of the program, and concurrently revise already-written parts while adding new parts. By 'concurrently' I mean 'within the space of a commit'. A commit is a discrete set of changes pushed to a version-control system. Not every programmer uses a formal VCS, but most do, and if you are reading source code of a project where VCS is not used then you won't have as much to look at.
Because the writing of source is an iterative process, multiple versions exist of nearly every project. Typically there will be a 'release version' which is regarded as stable, and one or more 'development versions' where new features are added. At any given time there will be a particular version or branch which is most appropriate for checking-out by those who would work on a particular aspect of the program. If you are just reading the code for edification or bug-hunting, then you will want to acquire the most recent stable version, or inquire among the projects' developers as to which version is the canonical one for this purpose.
The source code is the One True Source for documentation of the program to which it compiles. It contains machine-readable parts and human-readable parts (README files and comments, or other inline documentation). The machine-readable parts are those which the compiler takes as input, which input results in the running code. For this reason the machine-readable parts are the real documentation in that they describe in a non-ambiguous way what the software actually does, while the human-intelligible part is some human's idea of what the software is supposed to do.
Frequently authors will skip the human-intelligible part. In my opinion this is a Bad Idea. This is because machine-recognizable languages fall into a category of languages which is more restrictive than human-intelligible languages are: any machine-recognizable language must at minimum be recursively enumerable, so that any given statement in that language can be represented by a tree whose nodes represent legal tokens in the language, and whose edges represent the operation of valid generator rules. It's very easy for humans to describe things in human languages, and difficult to describe things in machine languages. IMO it's easier to translate from human language to machine language than vice versa; this is one reason why documentation is difficult to write.
What 'Machine Readable' means
WIP
What can a machine read?
Formal Languages are sets of symbols and rules for combining those symbols into 'tokens' which a certain type of machine can recognize as belonging to the language.
There are four known types of formal language, organized in a nested hierarchy called the Chomsky Hierarchy. The two simplest types, Regular and Context-free languages, can be recognized respectively by a finite automaton and a push-down automaton.
Now you know as much as you did before.
Acquiring Source Code
If you are using Debian then acquiring the source code of any program in Debian's archive is as easy as using the `apt-get source` command, which will fetch the source code from the archive and unzip it into a child of the current directory. If you are using an other operating system than Debian you can get the source code of any Debian Package via packages.debian.net. Source code for software which is not packaged by Debian is available from other sources. The GNU project maintains source code repositories of many programs; others are available from version-control programs of various kinds provided by the authors or by others.
git clone
git clone
is the customary way of acquiring code from GitHub and other git repositories. It downloads and unpacks the current state of the version-control tree and its branches.tarballs
Lots of source code files are available for download as compressed archives with the
*.tar
file extension. This extension indicates that the archive is in TAR format. Use the tar -x
command to extract it (see man tar
for details before proceeding: tar
is a very complex program with many switches.)Many tar files are also compressed. Gzip, bz2 and LZMA are all popular compression formats used with tarfiles. Usually the compression format will be reflected in the file extension:
*.tar.gz
, *.tar.bz2
and the like. Sometimes the format is not reflected in the file extension; use the file
command to discover it.What you get
When you download and extract a source code archive, you get a file tree. It will look something like the following imaginary
ls -R
output:sourcedir:
README main.lang includes/ resources/
sourcedir/includes:
library.lib
sourcedir/resources/:
media.ogg
The README file is arguably the most important, especially when it is absent.
README files often contain build instructions, configuration instructions, and information on how to contact the developers.
The code provided by the project is often in the root directory ('sourcedir' in the example), though sometimes it is in one or more subdirectories.
Most non-trivial programming projects will
#include
or use
code from other projects, in order to save time and prevent re-inventing wheels. This re-used code will often take the form of "libraries" of code snippets written expressly for re-use. For example, jQuery is a Javascript library.Tools
It's dangerous to go alone. Take these.
Shell
Your shell is the program that intermediates your computing experience. When you are using the command line, the shell is the program that writes the prompt, receives your input, and returns the results. When you use a graphical interface, the shell receives input about program interactions from the window manager and sends data about program output to the window manager so that it can be drawn into the appropriate window.
Shells typically implement a command scripting language, allowing the user to automate tasks by placing lists of commands and arguments in a file to be invoked as a unit.
Text Editor
A text editor is similar to a word processor, but different in that it is optimized for syntax transformation (for example, searching and replacing text) rather than for producing pretty-looking documents which look the same when printed out. Text editors are never WYSIWYG, because the whole point of source code is to be transformed into something else.
Editor vs. IDE
Some text editors are 'lightweight', in that they offer a small 'footprint' in the computer's resources and don't offer a great many features. For this reason some editors of this type might be considered useful only for working with configuration files (or as a fancy pager) and not useful for programming. This is entirely a matter of opinion (I have successfully edited source code using only
less
and sed
), but it serves to illustrate a distinction between 'lightweight' editors and "programmer's" editors.A programmer's editor is one which is optimized for programming. It will offer features like syntax highlighting, code folding, or auto-completion. Emacs is an example of a "programmer's editor", although some people might consider it an IDE.
An Integrated Development Environment or IDE is a group of programs which are designed to work together to make programming easier. Typically such a collection of programs will include an editor, an interactive interpreter, and a debugger among other components. Eclipse is an example of an IDE.
Searching and Text Processing
Text editors frequently offer search-and-replace facilities of varying sophistication.
In the shell
Grep and Regular Expressions
Parser searching
Understand the Model
If source code is a representation of an idea, then it may be regarded as a model of that idea. The idea is itself frequently a model of something that exists in the "real world". For example, some computer games feature a character which the player navigates through an imaginary world. The world in these games is frequently called a "map", because that's what it is: a map is a model of a place, in this case a model of a (usually) imaginary place which may or may not contain maps of maps.
Read the Comments
Not here. These comments are garbage. I should know, I write them myself.
The comments in the code are the ones you should read. Each language has its own comment syntax- here are some reasonably common ones.
// One line comment in C, PHP, Javascript
/* This is a
multi-line comment
in C, PHP, Javascript
the extra whitespace at the beginning of lines
is optional.
*/
;; One line comment in Lisp or .ini files
# One line comment in many config files
Comments are there to illustrate the intent of the author, to provide context for code, links to outside resources, or source code for automatically-generated documentation.
Cheat Code: talk to the Developers
The people best qualified to discuss the mapping between a program and its underlying model are those who wrote the program. These people may or may not have time to talk to you or to answer your questions. Attempting conversation with strangers, even over the internet, may be difficult for you, and it's possible that the developers themselves might be difficult. Fortunately, it's possible to ask without asking: lurk on project IRC channels, watch presentations, peruse maillist archives. Once you have a feel for how people on the list respond to emails, you might feel more comfortable contacting them, or you might not. Either way you will learn something.
Know the Language
Obviously it's easier to read the source code if you already know the language. Well-written source code is often easy to read for those who know the language. Other sources are less easy to read.
Some languages are designed for ease of reading, and some programming styles such as Literate Programming or Document-Driven Development emphasize it.
Some programming languages which are widely considered easy to read:
Some programming languages which are considered hard to read:
My own opinions don't necessarily accord with these lists.
Specific things to know about your language
Constants, variables and arrays
Each language handles these things differently. Some languages don't have variables at all, only constants. Sometimes arrays are a subtype of variable, but this is not always.
In this article I'm using the word "array" to describe something which has different (but not unique) names in different languages, and the names mean different things depending on which language it is.
In general there are a few different kinds of arrays. All of them represent some form of key-value tuple. "Lists" usually comprise an index and a series of values like so:
Some languages start their indices at 1 like the list above, and some start at 0.
It's very important to keep track of this in order to avoid a class of errors called "off-by-one".
Types
Some languages are "strongly typed", meaning that every value expressed in that language has a type, whether it's an integer, a TRUE or FALSE, a character, a string of characters or whatever. Many languages have facilities for users to define their own types. It's important to know this syntax so that you can understand the constraints that apply to things named in the program.
Types generally constrain the type of value that can be bound to a name and define the operations that can be performed on those values. For example, it doesn't make sense to "add" one character to another; each language defines differently what "a" + "b" is equal to.
It could be "ab" (concatenation) or it could be "undefined"/NULL/nil or whatever; it could be the ASCII values of each character added together, or something entirely unexpected depending on the language and environment.
What to do if you don't know the language
If you want to learn the language then perhaps https://koanhead.dreamwidth.org/2335.html can be of help.
Quite often well-written source code is intelligible even to those who don't know the language. When it isn't, hopefully some of the pointers below can assist you.
The trick to understanding source is in understanding the underlying model it describes, and that holds true whether you know the language or don't.
If you are familiar with BNF or other methods for specifying context-free grammars, then can be useful to skim the formal specification of the language you are dealing with (if one exists). If you are unfamiliar with CFG then the EcmaScript 262 reference has a nice introduction.
Language specification tend to be quite long, but they contain long stretches of formal production rules that you can skip over if they are not describing the parts of the language currently of interest to you.
What to Look For
Entry Points
C programs usually have a function called `main()`. The main function is the starting point for execution, so it makes sense to start reading C programs at that function, then read the header files mentioned there (if any) and then proceed from there.
Unfortunately, not all programming languages feature such a convenient convention. When there's no obvious starting point, one can sometimes find a useful starting point by reasoning about the context of the program. For example, PHP is a language usually used for generating Web pages. It's embeddable into the pages themselves, and the Web server interprets the embedded code 'on-the-fly'. In such a situation, if you know that your Web server will serve a page called 'index.html' by default, it makes sense to look for a file called 'index.html' or 'index.php'.
Names
Quite often you'll be looking in the source code for something particular, for example to answer the question, "How does this program do $this_thing?"
A naive but effective tactic is to simply search the tree for $this_thing in order to find the places in the code where $this_thing is addressed. There are many ways to accomplish this: see How To Find What You're Looking For below.
Kinds of names
There are many different kinds of names, and it can be helpful to know how the language in question treats them. If your reasoning about $this_thing leads you to believe that you're looking for a function that takes no arguments, and the code is in C, then it might be worthwhile for you to search for "$this_thing()" for example.
Rules about names
Programming languages have rules about what constitutes a valid identifier. For example, in PHP variable names must begin with $ followed by a lower case letter, digit or underscore. A language may have different rules about different kinds of identifiers, which can help you to determine what sort of thing to which a particular name refers.
Inclusions
Lots of files "include" other files by some mechanism or other. Reading those included files will make it easier to understand the one you are looking at.
Include graphs and call graphs
An include graph is a directed graph in which the nodes represent files in the source tree and the arrows point from nodes that include others to the included ones.
A call graph is a directed graph in which the nodes represent functions and the edges point from a calling function to the one which is called.
Doxygen can generate both kinds of graphs.
How to Find What you're Looking For
Using grep
Using a linker
Using Doxygen
Using Egypt
How to Find only what you're looking for
If only I had an answer to this.