binary problem in source code management

DevOps culture understands the basics of source code management (SCM) well. Teams collectively do a good job of keeping program texts in shape. When it comes to JSON, XML, and other not entirely readable formats, though, even subtle missteps can have big costs.

Binaries under source control

Suppose you maintain a tiny website, with a few static HTML pages, an CSS source, and a folder with the two dozen images and icons end-users need to see. You’re professional, so you make sure that everything is under revision control.

That’s a problem, though. The images are binaries. Small HTML sources typically occupy a few kilobytes, but while images can be tiny, typical photographs, diagrams or design elements are tens or a few hundreds of kilobytes. They blow up the footprint of a repository — and what it takes to back it up — by a considerable factor.

There’s worse, though. For a previous generation of revision control systems, including Apache Subversion (SVN), the effect of including binaries was simply to take up more space. With more modern SCM, though, like Git, large binaries irreversibly distort usage.

As Robin Winslow, among others, explained several years ago, a fundamental premise of distributed version control systems (DVCS) is that it’s “cheap and easy to clone and navigate” repositories. Binaries clog the operations of DVCS, as well as at least partially nullify their benefits.

What to do? Binaries that are small and infrequently updated — a logo, for instance — fit quite well in all SCMs. Those are no hazard. In contrast, multimedia files, high-resolution photographs, and especially anything that changes frequently, including thumbnails with short lifetimes, deserve a better home.

Those binaries should generally be in a dedicated server or even full-blown content management system (CMS), with a reference to their content appearing in SCM:

<img src = ‘$TOP_STORY’> ...

The lifecycles of content and application source are different, and a scheme of references like this decouples them so each can be managed more rationally.

Manage Your Source Code Securely

Hosted and on-prem solutions for Git, SVN and Perforce

Newlines

That’s not all. How you handle text also makes a difference.

Consider a specific example, with two different representations:

<items>
<item id="0001" type="icecream">
<name>Rocky Road</name>
<schedule>ScheduleA</schedule>
<price>Level B</price>
<ingredients>
<ingredient id="1001">vanilla extract</ingredient>
<ingredient id="1009">cocoa power</ingredient>
<ingredient id="1033">chopped pecans</ingredient>
...
</ingredients>
<topping id="5001">none</topping>
<topping id="5002">sprinkles</topping>
<topping id="5005">Oreos</topping>
<topping id="5006">melted marshmallows</topping>
...
</item>
...
<items>

vs.

<items><item id="0001" type="icecream"><name>Rocky Road</name><schedule>ScheduleA</schedule><price>Level B</price><ingredients><ingredient id="1001">vanilla extract</ingredient><ingredient id="1009">cocoa power</ingredient><ingredient id="1033">chopped pecans</ingredient> ... </ingredients><topping id="5001">none</topping><topping id="5002">sprinkles</topping><topping id="5005">Oreos</topping><topping id="5006">melted marshmallows</topping> ... </item> ... <items>

The content of these two XML fragments is identical. Their sizes are 633 and 469 bytes, respectively. That makes the latter a better format, right?

Wrong. Or, at least, problematic: SCM tools typically privilege newlines in a way you need to understand.

Suppose you change <price> from Level B to Level C in each of these documents. Reports of such a change typically look like this:

- <price>Level B</price>
+ <price>Level C</price>

for the former, but for the latter, it’s:

-<items><item id="0001" type="icecream"><name>Rocky Road</name><schedule>ScheduleA</schedule><price>Level B</price><ingredients><ingredient id="1001">vanilla extract</ingredient><ingredient id="1009">cocoa power</ingredient><ingredient id="1033">chopped pecans</ingredient> ... </ingredients><topping id="5001">none</topping><topping id="5002">sprinkles</topping><topping id="5005">Oreos</topping><topping id="5006">melted marshmallows</topping> ... </item> ... <items>

+<items><item id="0001" type="icecream"><name>Rocky Road</name><schedule>ScheduleA</schedule><price>Level C</price><ingredients><ingredient id="1001">vanilla extract</ingredient><ingredient id="1009">cocoa power</ingredient><ingredient id="1033">chopped pecans</ingredient> ... </ingredients><topping id="5001">none</topping><topping id="5002">sprinkles</topping><topping id="5005">Oreos</topping><topping id="5006">melted marshmallows</topping> ... </item> ... <items>

Do you see the problem? The comparison of the first case immediately makes the point that the update had to do with <price>. For the second case, most humans can’t spot the change without considerable and error-prone effort. Even though the data content is identical in the two cases, formatting without newlines makes typical changes much harder to read and analyze.

SCM tools calculate differences in terms of newlines, and lose much of their value in the absence of newlines. Massive newline-free documents become as difficult to manage as binaries, even though they are nominally text.

The conclusion: If at all possible, “pretty-print” any XML, JSON, CSV and other formats that appear in your repository with newlines so that the results are more readable to humans. You’ll simultaneously find that many of the SCM’s tools work better as well.

Gross test comparisons

A final tip for wrangling binaries around SCM: Make tests manageable.

It might happen, for instance, that the desired outcome of a particular test sequence is a specific binary pattern in memory. Comparison with that pattern makes for a good test, of course. A better test, though, probably results when you serialize the binary sequence into something humans can read, understand and more easily maintain. Also, as previously mentioned, when humans can read it, the SCM tooling will probably work better with it.

Another aspect of manageability with tests is to lighten them. It can be tempting to take log or event sequences from production and convert them to tests. While that approach has benefits, indiscriminate use of production data often leads to test sets with gigabytes of data, when only a fraction — maybe only a few kilobytes, in extreme cases — of more carefully chosen data exercise the code just as well. It’s common to find applications whose source barely fills a megabyte but hundreds of times as much space goes to test data.

Sometimes gigabytes of test data represent real value: careful comparisons that help keep the application healthy. Sometimes they just represent accidental bloat. Take a little time to judge the benefits your test data bring your application.

All-in-one Test Automation

Cross-Technology | Cross-Device | Cross-Platform

About the Author

Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron's favorite applications are for farm automation.

You might also like these articles