Popular conceptions of computing idolize the algorithm. However, working data scientists recognize that the data side of a project often demands the most effort and ingenuity.
In a testing context, this means that working out the right test data can be hard: Data needs to be realistic while not invading privacy, and inexpensive to manage but representative of real costs.
The good news is that you’re not alone. Plenty of other people need test data, and some of them share their results. Just as open-source software is “best of breed” in embeddable SQL engines, typesetting sophisticated mathematics, time zone management and several other domains, many wonderfully useful data sets are freely available. Here are a few to give you an idea of what’s out there.
Government and NGO Data
As an agency of the United States government itself writes, the data that agencies collect and curate is a national treasure. Data.gov is a clearinghouse for about a quarter of a million distinct freely available data sets. These include everything from Coast Guard records of historical ship routes — where exactly across the Ohio River did barges pass before World War II— to bullying rates in US schools as reported in the School Crime Supplement. The quality and convenience of these data sets vary widely; many references are provided as text, rather than clickable hyperlinks, and tens of thousands of items, especially those related to geography, are only formatted as PDFs. Also, Data.gov isn’t comprehensive; if an item isn’t available through Data.gov, that doesn’t guarantee a different agency doesn’t publish the same item in a convenient form. If nothing else, though, Data.gov is a great place to browse.
Plenty of other government units make useful data sets numbering in the tens of millions available online. Countries such as Estonia and states such as Colorado explicitly prioritize transparency in their measured operations. But even with the best of intentions, plenty of obstacles interfere with such priorities, and self-serving actors have a stake in obscuring corruption or low performance.
Other prominent units with troves of social and political data include the United Nations, the World Bank and the European Union. As of this writing, DataWorld publishes references to 256 non-government organization (NGO) data sets, among its other offerings. Enigma Public is another aggregator you should get to know for government, NGO, nonprofit and other data.
The ethos that motivates governments and charitable organizations to support open data generally acknowledges that citizens own data, that transparency promotes democracy and solidarity, and so on. Science’s relation to data availability is even more fundamental: “Open data means better science”, as one article put it.
Massive individual data sets from such big-science areas as particle physics and genomics increasingly are finding their way online alongside collections of open data pertinent to individual projects or publications.
Data science, or the investigation of information retrieval and management questions particularly suited to digital methods, has a particular culture distinct from the broader realm of all science. For-profit companies like Kaggle and Google aggressively make specific collections of data sets freely available to researchers and students. Dataquest recently published a piece on freely available data sets valuable for practitioners of data science.
Self-organizing communities of advocates, practitioners, and affiliates also occasionally make useful data sets available. A subreddit devoted to data sets can be a great starting point for locating everything from free weaving patterns to chord progressions to birdwatchers’ habitat reports.
Your Data’s Out There
No single comprehensive, centralized authority exists for useful data sets. If you want a data set, you’ll likely have to do at least a little exploration on your own. Moreover, what’s digitized, online and freely available is only a fraction of a fraction of all the data that might be useful. And simple-minded searches with the usual search engines turn up only a fraction of those.
Once you’re clear on these hurdles, though, you can expect at least partial success in any data set projects. Read through the different indexes and libraries of data sets, and familiarize yourself with their practices and conventions. You’ll then have a good grasp of what’s available and where to look for more specific results.
Whatever your domain, chances are good an applicable large-scale test set exists already.