Jul. 30th, 2016

Source Code Literacy for Padawans


This article is a long-term work in progress; the reader might find it worthwhile to check back occasionally to see if anything useful has been added. I started it as a set of notes for a talk of the same name at SeaGL 2016.

Talks on this topic have been done before, no doubt better. The following list contains at least one such talk, and much further material on the subject.

  • https://github.com/aredridel/how-to-read-code/blob/master/how-to-read-code.md

  • Blog post pretty-printed version of above: http://aredridel.dinhe.net/2015/03/29/how-to-read-source-code/

  • https://blog.codinghorror.com/learn-to-read-the-source-luke/

  • https://news.ycombinator.com/item?id=3769446

  • http://www.tutorialspoint.com/developers_best_practices/handy_tools_techniques.htm

  • http://www.gigamonkeys.com/code-reading/

  • http://himmele.blogspot.com/2012/01/how-do-you-read-source-code.html

  • http://matthieu.io/blog/2016/10/23/debugging-101/

  • Introduction

    I've been reading source code, or attempting to, for over 20 years. It's hard to get started, and I never found a good guide on how to do it, so I'm going to attempt to write one even though I'm not remotely qualified to do so. I'm writing this in the hope that it may be useful to those who haven't yet learned the tricks I know, and that others with greater knowledge of the subject, or at least additional knowledge besides that which is presented here, will speak up either in the comments here or elsewhere.

    Why is it so hard to read source code?

    Programming is an act of translation. Someone gets an idea in the meat head and then translates it into a language. Typically the idea is translated first into a human-intelligible language, and then into a machine-recognizable one.

    This is an iterative process. An author will write a fragmentary first draft of part of the program, and concurrently revise already-written parts while adding new parts. By 'concurrently' I mean 'within the space of a commit'. A commit is a discrete set of changes pushed to a version-control system. Not every programmer uses a formal VCS, but most do, and if you are reading source code of a project where VCS is not used then you won't have as much to look at.

    Because the writing of source is an iterative process, multiple versions exist of nearly every project. Typically there will be a 'release version' which is regarded as stable, and one or more 'development versions' where new features are added. At any given time there will be a particular version or branch which is most appropriate for checking-out by those who would work on a particular aspect of the program. If you are just reading the code for edification or bug-hunting, then you will want to acquire the most recent stable version, or inquire among the projects' developers as to which version is the canonical one for this purpose.

    The source code is the One True Source for documentation of the program to which it compiles. It contains machine-readable parts and human-readable parts (README files and comments, or other inline documentation). The machine-readable parts are those which the compiler takes as input, which input results in the running code. For this reason the machine-readable parts are the real documentation in that they describe in a non-ambiguous way what the software actually does, while the human-intelligible part is some human's idea of what the software is supposed to do.

    Frequently authors will skip the human-intelligible part. In my opinion this is a Bad Idea. This is because machine-recognizable languages fall into a category of languages which is more restrictive than human-intelligible languages are: any machine-recognizable language must at minimum be recursively enumerable, so that any given statement in that language can be represented by a tree whose nodes represent legal tokens in the language, and whose edges represent the operation of valid generator rules. It's very easy for humans to describe things in human languages, and difficult to describe things in machine languages. IMO it's easier to translate from human language to machine language than vice versa; this is one reason why documentation is difficult to write.

    What 'Machine Readable' means


    What can a machine read?

    Formal Languages are sets of symbols and rules for combining those symbols into 'tokens' which a certain type of machine can recognize as belonging to the language.

    There are four known types of formal language, organized in a nested hierarchy called the Chomsky Hierarchy. The two simplest types, Regular and Context-free languages, can be recognized respectively by a finite automaton and a push-down automaton.

    Now you know as much as you did before.

    Acquiring Source Code

    If you are using Debian then acquiring the source code of any program in Debian's archive is as easy as using the `apt-get source` command, which will fetch the source code from the archive and unzip it into a child of the current directory. If you are using an other operating system than Debian you can get the source code of any Debian Package via packages.debian.net. Source code for software which is not packaged by Debian is available from other sources. The GNU project maintains source code repositories of many programs; others are available from version-control programs of various kinds provided by the authors or by others.

    git clone

    git clone is the customary way of acquiring code from GitHub and other git repositories. It downloads and unpacks the current state of the version-control tree and its branches.


    Lots of source code files are available for download as compressed archives with the *.tar file extension. This extension indicates that the archive is in TAR format. Use the tar -x command to extract it (see man tar for details before proceeding: tar is a very complex program with many switches.)

    Many tar files are also compressed. Gzip, bz2 and LZMA are all popular compression formats used with tarfiles. Usually the compression format will be reflected in the file extension: *.tar.gz, *.tar.bz2 and the like. Sometimes the format is not reflected in the file extension; use the file command to discover it.

    What you get

    When you download and extract a source code archive, you get a file tree. It will look something like the following imaginary ls -R output:

    README main.lang includes/ resources/



    The README file is arguably the most important, especially when it is absent.
    README files often contain build instructions, configuration instructions, and information on how to contact the developers.

    The code provided by the project is often in the root directory ('sourcedir' in the example), though sometimes it is in one or more subdirectories.

    Most non-trivial programming projects will #include or use code from other projects, in order to save time and prevent re-inventing wheels. This re-used code will often take the form of "libraries" of code snippets written expressly for re-use. For example, jQuery is a Javascript library.


    It's dangerous to go alone. Take these.


    Your shell is the program that intermediates your computing experience. When you are using the command line, the shell is the program that writes the prompt, receives your input, and returns the results. When you use a graphical interface, the shell receives input about program interactions from the window manager and sends data about program output to the window manager so that it can be drawn into the appropriate window.

    Shells typically implement a command scripting language, allowing the user to automate tasks by placing lists of commands and arguments in a file to be invoked as a unit.

    Text Editor

    A text editor is similar to a word processor, but different in that it is optimized for syntax transformation (for example, searching and replacing text) rather than for producing pretty-looking documents which look the same when printed out. Text editors are never WYSIWYG, because the whole point of source code is to be transformed into something else.

    Editor vs. IDE

    Some text editors are 'lightweight', in that they offer a small 'footprint' in the computer's resources and don't offer a great many features. For this reason some editors of this type might be considered useful only for working with configuration files (or as a fancy pager) and not useful for programming. This is entirely a matter of opinion (I have successfully edited source code using only less and sed), but it serves to illustrate a distinction between 'lightweight' editors and "programmer's" editors.

    A programmer's editor is one which is optimized for programming. It will offer features like syntax highlighting, code folding, or auto-completion. Emacs is an example of a "programmer's editor", although some people might consider it an IDE.

    An Integrated Development Environment or IDE is a group of programs which are designed to work together to make programming easier. Typically such a collection of programs will include an editor, an interactive interpreter, and a debugger among other components. Eclipse is an example of an IDE.

    Searching and Text Processing

    Text editors frequently offer search-and-replace facilities of varying sophistication.

    In the shell

    Grep and Regular Expressions

    Parser searching

    Understand the Model

    If source code is a representation of an idea, then it may be regarded as a model of that idea. The idea is itself frequently a model of something that exists in the "real world". For example, some computer games feature a character which the player navigates through an imaginary world. The world in these games is frequently called a "map", because that's what it is: a map is a model of a place, in this case a model of a (usually) imaginary place which may or may not contain maps of maps.

    Read the Comments

    Not here. These comments are garbage. I should know, I write them myself.

    The comments in the code are the ones you should read. Each language has its own comment syntax- here are some reasonably common ones.

    // One line comment in C, PHP, Javascript
    /* This is a
    multi-line comment
    in C, PHP, Javascript
    the extra whitespace at the beginning of lines
    is optional.

    ;; One line comment in Lisp or .ini files

    # One line comment in many config files

    Comments are there to illustrate the intent of the author, to provide context for code, links to outside resources, or source code for automatically-generated documentation.

    Cheat Code: talk to the Developers

    The people best qualified to discuss the mapping between a program and its underlying model are those who wrote the program. These people may or may not have time to talk to you or to answer your questions. Attempting conversation with strangers, even over the internet, may be difficult for you, and it's possible that the developers themselves might be difficult. Fortunately, it's possible to ask without asking: lurk on project IRC channels, watch presentations, peruse maillist archives. Once you have a feel for how people on the list respond to emails, you might feel more comfortable contacting them, or you might not. Either way you will learn something.

    Know the Language

    Obviously it's easier to read the source code if you already know the language. Well-written source code is often easy to read for those who know the language. Other sources are less easy to read.

    Some languages are designed for ease of reading, and some programming styles such as Literate Programming or Document-Driven Development emphasize it.

    Some programming languages which are widely considered easy to read:

    • Python

    • Lua

    • Ruby

    Some programming languages which are considered hard to read:

    • Assembly language

    • C or C++

    • PostScript

    • My own opinions don't necessarily accord with these lists.

      Specific things to know about your language

      Constants, variables and arrays

      Each language handles these things differently. Some languages don't have variables at all, only constants. Sometimes arrays are a subtype of variable, but this is not always.

      In this article I'm using the word "array" to describe something which has different (but not unique) names in different languages, and the names mean different things depending on which language it is.

      In general there are a few different kinds of arrays. All of them represent some form of key-value tuple. "Lists" usually comprise an index and a series of values like so:

      1. one value

      2. another value

      3. still another value

      4. foo

      5. fnord

      6. bar

      Some languages start their indices at 1 like the list above, and some start at 0.
      It's very important to keep track of this in order to avoid a class of errors called "off-by-one".


      Some languages are "strongly typed", meaning that every value expressed in that language has a type, whether it's an integer, a TRUE or FALSE, a character, a string of characters or whatever. Many languages have facilities for users to define their own types. It's important to know this syntax so that you can understand the constraints that apply to things named in the program.

      Types generally constrain the type of value that can be bound to a name and define the operations that can be performed on those values. For example, it doesn't make sense to "add" one character to another; each language defines differently what "a" + "b" is equal to.
      It could be "ab" (concatenation) or it could be "undefined"/NULL/nil or whatever; it could be the ASCII values of each character added together, or something entirely unexpected depending on the language and environment.

      What to do if you don't know the language

      If you want to learn the language then perhaps https://koanhead.dreamwidth.org/2335.html can be of help.

      Quite often well-written source code is intelligible even to those who don't know the language. When it isn't, hopefully some of the pointers below can assist you.

      The trick to understanding source is in understanding the underlying model it describes, and that holds true whether you know the language or don't.

      If you are familiar with BNF or other methods for specifying context-free grammars, then can be useful to skim the formal specification of the language you are dealing with (if one exists). If you are unfamiliar with CFG then the EcmaScript 262 reference has a nice introduction.

      Language specification tend to be quite long, but they contain long stretches of formal production rules that you can skip over if they are not describing the parts of the language currently of interest to you.

      What to Look For

      Entry Points

      C programs usually have a function called `main()`. The main function is the starting point for execution, so it makes sense to start reading C programs at that function, then read the header files mentioned there (if any) and then proceed from there.

      Unfortunately, not all programming languages feature such a convenient convention. When there's no obvious starting point, one can sometimes find a useful starting point by reasoning about the context of the program. For example, PHP is a language usually used for generating Web pages. It's embeddable into the pages themselves, and the Web server interprets the embedded code 'on-the-fly'. In such a situation, if you know that your Web server will serve a page called 'index.html' by default, it makes sense to look for a file called 'index.html' or 'index.php'.


      Quite often you'll be looking in the source code for something particular, for example to answer the question, "How does this program do $this_thing?"

      A naive but effective tactic is to simply search the tree for $this_thing in order to find the places in the code where $this_thing is addressed. There are many ways to accomplish this: see How To Find What You're Looking For below.

      Kinds of names

      There are many different kinds of names, and it can be helpful to know how the language in question treats them. If your reasoning about $this_thing leads you to believe that you're looking for a function that takes no arguments, and the code is in C, then it might be worthwhile for you to search for "$this_thing()" for example.

      Rules about names

      Programming languages have rules about what constitutes a valid identifier. For example, in PHP variable names must begin with $ followed by a lower case letter, digit or underscore. A language may have different rules about different kinds of identifiers, which can help you to determine what sort of thing to which a particular name refers.


      Lots of files "include" other files by some mechanism or other. Reading those included files will make it easier to understand the one you are looking at.

      Include graphs and call graphs

      An include graph is a directed graph in which the nodes represent files in the source tree and the arrows point from nodes that include others to the included ones.

      A call graph is a directed graph in which the nodes represent functions and the edges point from a calling function to the one which is called.

      Doxygen can generate both kinds of graphs.

      How to Find What you're Looking For

      Using grep

      Using a linker

      Using Doxygen

      Using Egypt

      How to Find only what you're looking for

      If only I had an answer to this.


      The Linux kernel


 Hillary Clinton is the Democratic Party's nominee for President. She is running against Donald Trump, who regularly (but not consistently) beats her in the polls. The outcome is uncertain, because many people who are against Trump are also against Hillary. A large subset of these people are supporters of Bernie Sanders' campaign for president, and are strongly coupled with the issues addressed in the Sanders campaign platform as promulgated at https://berniesanders.com/issues/ and the pages linked from that one.

Bernie Sanders and his campaigners have repeatedly stated that the campaign is not about Bernie Sanders but about the issues linked above. Hillary Clinton may or may not support the policies promulgated by the Sanders campaign which address these issues. There is no way for an outside observer, that is to say, anyone other than Hillary Clinton herself, to tell. 

It's reasonable to suppose that, if Hillary Clinton's position on these policies were sufficiently similar to that of the Sanders campaign's supporters, that those supporters would then support Hillary Clinton. To do otherwise would undermine the claim that the campaign itself is not about Bernie Sanders but about the relevant issues.

This is a classic game-theoretic scenario. Hillary Clinton has signalled a limited willingness to co-operate with the Left ( used herein as a catchall term for those who support the relevant issues as described above) but the Left has no way in which to gauge the probability of defection. Insufficient information exists to determine either the actual extent of the proffered cooperation or the extent to which the proffer is binding. As a result members of the class I'm calling "the Left" have no reason to co-operate with Ms Clinton by voting for or otherwise supporting her campaign.

In order for "the Left" to have sufficient information in order to make a decision to support Ms Clinton's campaign, the campaign could provide an unambiguous and binding signal of cooperation. In general such a signal could take many forms. I propose that of a contract.

The two parties in question, the Clinton campaign and the "Revolution" frequently invoked by the Sanders campaign and for which authoritative members of that campaign may be considered able to speak, could mutually and publicly negotiate a contract specifying policies which the prospective President will support to address the issues linked above, actions to be taken to implement those policies by a specific time, and penalties for failure to perform the contract's terms.

It doesn't appear that people in general are willing to trust either candidate. In the absence of trust unambiguous and verifiable signals may serve to signal willingness to co-operate. Hillary Clinton needs the co-operation of as many people as possible in order to win the Presidential race against Trump. A clear set of signals exists that could potentially win the co-operation of a very large number of people. If the Clinton campaign can secure this co-operation then victory will be much more likely than otherwise.
Libreboot is a Free Software BIOS replacement based on coreboot. Coreboot is an Open Source project to replace BIOS. BIOS is the vendor-provided software that provides initial hardware information to the Operating System. BIOS implementations are, with a sole exception, non-Free software. BIOS is available only as a binary blob. Coreboot is Open Source software, but it contains blobs for various hardware. Libreboot respects your freedoms and contains enhancements to make it easier for newcomers to install and use.


What libreboot is

A free-software version of coreboot. It supports only hardware that can run without blobs. It lives in CMOS, initializes RAM, builds a table called coreboot-table, then launches a payload. Default payload is GRUB2. Also SeaBIOS and the linux kernel are available.

Why you should care about it

Libreboot represents one way to eliminate (or at least minimize) opaque software blobs from your computing. Blobs represent potential security vulnerabilities and other problems.

Relationship to coreboot

Deliberately non-forking. Tracks main branch of coreboot (or some branch). If you want to add hardware support to libreboot, get it into coreboot first and then Free it.
Talk will include a demonstration of installing or upgrading (depending on the state of available target hardware) libreboot using software flashing or an external programmer.
The talk also introduces SILLY, a local project to provide Libre laptops for activists and training in Libreboot and related things.

Expand Cut Tags

No cut tags



October 2016


Most Popular Tags

Style Credit

Page generated Oct. 20th, 2017 03:50 pm
Powered by Dreamwidth Studios