Big data at scale
It’s the time of big data. And that means we need to focus on big test data. It’s crucial to get our approaches right for testing and managing these huge volumes of data.

Let’s start with a little background. “Big data” generally means something like “analysis of data sets so large that conventional tools are inadequate.” Like “modern” or “legacy,” the meaning of “big data” shifts through time: what was large in 2005 is a classroom exercise in 2019. Still, so many organizations speak of big data that it’s useful to pin it down a bit more.

For 2019, big data almost certainly combines some or all of MapReduce, NoSQL, data lakes, machine learning, HPC (high-performance computing), and distributed file systems and data stores.

Testing at scale, then, has at least a couple of distinct meanings:
● Large-scale test data for conventional applications
● Test data for big data applications

The first piece of good news about testing at scale is that, in many cases, not much scale is required. Consider a concrete example of a realistic requirement: A particular service analyzes files from customers, except that a file over 5 gigabytes simply returns, “This file of size S is larger than the current limit of 5 gigabytes; contact our Professional Services department for possible alternatives.”

A simple-minded implementation and test of this requirement might look something like this (the source appears here in Python as a convenient pseudocode):

def analyze(file):
filesize = size(file)
if filesize > 5 * GIGABYTE:
raise FileError(f"This file of size {filesize} bytes is larger ...")
...
def test_sizeRequirement(self):
standard_result = ...
ok_file = create_file(4 * GIGABYTE):
result = analyze(ok_file)
self.assertEqual(result, standard_result)
too_big_file = create_file(6 * GIGABYTE)
self.assertRaises(FileError, analyze, too_big_file)

While this is a desirable test, it might be infeasible; a complete analysis, or even the creation, of multi-gigabyte files might take so long that developers begin to ignore the test result.

An effective way to change this is to generalize the specification to something like:

def analyze(file, size_limit=5*GIGABYTE):
filesize = size(file)
if filesize > size_limit:
raise FileError(f"This file of size {filesize} bytes is larger ...")
...
def scaled_test(self, scale):
standard_result = ...
ok_file = create_file(4 * scale):
result = analyze(ok_file, 5 * scale)
self.assertEqual(result, standard_result)
too_big_file = create_file(6 * scale)
self.assertRaises(FileError, analyze, too_big_file, 5 * scale)
def test_sizeRequirementSmall(self):
scaled_test(self, KILOBYTE)

def test_sizeRequirementFull(self):
scaled_test(self, GIGABYTE)

With this scheme, the general operation of the size limitation can be tested in milliseconds rather than many minutes with test_sizeRequirementSmall(). Presumably, exercise of test_sizeRequirementFull then becomes an acceptance- or integration-level test, which is allowed a longer latency.

Quite a few large-data requirements can be tested this way, with a frequent small-scale test using “small” data combined with a specific full-scale test that runs for a longer span. This is an example of a more general principle that certain requirements become more testable when properly abstracted. While finding the right abstraction is generally a hard problem, testing specifically for scale frequently rewards this approach.

Peculiarities of MapReduce

Big data applications often can be tested with surprisingly little difficulty. While the meaning of “big data” varies from one organization to the next, it’s often equated with Hadoop and MapReduce.

Crucial to the scalability of these technologies is that their algorithmic behavior is manifest for small data sets. Unit testing of big data programs, at least, only requires a few lines. The Hadoop Team illustrates this with an example of how to apply MRUnit that nearly fits on a single typed page.

Big data applications still deserve integration- or acceptance-level tests at full scale. Validation of algorithmic correctness, though, should be no harder than unit testing any conventional application.

Many zeros, quickly

Consider cases in which large files truly are needed. There are still fast ways to create them. Under Unix, for instance:

GIBIBYTE=1073741824
dd if=/dev/zero of=my-big-file-of-zeros bs=1 count=$GIBIBTYE

This quickly creates a file named my-big-file-of-zeros with just over a trillion zeros in it. “Quickly” here is relative: on indifferent hardware it still takes more than an hour to write all those zeros. Linux also offers /dev/random and other special files that can effectively give content other than just zeros.

That’s not all, though. Linux offers a truncate command that almost instantaneously finishes

truncate -s 1G $FILENAME

The action is to fill $FILENAME with zeros.

Conclusion: A need for large files as formulaic test data can be met dynamically. It’s practical to generate at least some large test files at the time of testing, without keeping a copy in a repository of test data.

Realistic content

Tests of more valuable algorithms generally require more meaningful content, though. A function designed to parse a text into terms, and report on its findings, for instance, deserves test data more interesting than just a lot of zeros.

One natural inclination is to use customer data; a quick source of a volume of textual content might be all email messages received in 30 days. Don’t do this. While good tests require realistic data, real data involve prohibitive privacy and security risks.

A few obvious alternatives are also impractical. While it’s common enough to experiment with lorem ipsum text and “The quick brown fox jumped over the lazy dog,” these don’t scale well; to enter a megabyte or, worse, a gigabyte of useful content is impractical. Behavior “in the small” often doesn’t adequately predict what a system does at full scale, and it can be essential to test a big data system with actual big data.

Anonymized data can, in principle, make good tests. The main difficulty of this approach is that anonymization is a subtle subject in which few testers are trained. It’s not enough to pull customer records, change all the names to “John615 Smith” and think that the data are adequately anonymized. They’re not. An approach like this too easily leaks correlated information about ages, addresses and so on, to embarrassing effect.Large-scale English-language test content generally comes from a few sources:

  • Well-known, freely available corpora such as the Enron trial documents, Shakespeare’s plays or the Bible—many of these have appeared as sources for academic analysis
  • Closely related to the previous possibility are government publications on a variety of subjects, such as monthly reports of automotive recall information by the National Highway Traffic Safety Administration or publications from the Census Bureau
  • Several more commercial sources of content are sufficiently public and scalable to minimize security hazards, including data sets from Kaggle, feeds from Twitter or Reddit, or results of Google searches

Content in other languages or from other domains, such as accounting data, are also available from specialists.

Summary

When testing big data at scale, remember these ideas:

  • Convert requirements expressed in terms of large data as more general requirements parametrized by scale, then test at various scales
  • Sufficiently simple large-scale data can be generated dynamically, often with common operating-system utilities
  • Large-scale, realistic data can easily become a privacy hazard
  • Requirements for certain systems become visible only at large scale
  • Take advantage of the work others have done to assemble large data sets

To explore the features of Ranorex Studio risk-free download a free 30-day trial today, no credit card required.

About the Author

Cameron Laird is an award-winning software developer and author. Cameron participates in several industry support and standards organizations, including voting membership in the Python Software Foundation. A long-time resident of the Texas Gulf Coast, Cameron's favorite applications are for farm automation.

You might also like these articles