According to our reading of Hadley Wickham’s article “Tidy Data”, a dataset is “tidy” if (1) each variable forms a column, (2) each observation forms a row, and (3) each type of observational unit forms a table. Otherwise, the data is not structured in such a way so that the analyst can easily derive meaning from it (Wickham 1-3). For historians, this “tidy” structure is useful in part because of the sheer intuitiveness. While I was attempting to transcribe a portion of the 1799 George Washington slave ledger, I was struck upon the realization that I was building my spreadsheet to be “tidy” like this without intentionally designing it as such: designating the first column with the variable “name” and ascribing other variables like “location” or “labor status” to subsequent columns simply came naturally. The way George Washington structured his ledger was only useful to his own purposes, whilst the tidy structure means anyone can look from left to right to observe that the enslaved man Sam Cook was considered property of George Washington, lived in his Mansion House, and was considered too old to work (“passed labor”).
Of course, this process has also highlighted the significance of the contrast between received and derived data. The original intent behind the received data can be obscured or recreated in derived data (and which is “better” is always a contextual matter). For example, when I created the column “labor status,” I chose to take from how George Washington would categorize slaves at different locations under the special subheadings “passed labor” or “child,” and turn that into three “labor status” designations in my table: “working”, “passed labor”, or “child.” In this single decision, there were multiple layers of interpretive judgment calls. On one hand, it assumes that, in how Washington would categorize his slaves in this manner, there was a subtext of thinking of the enslaved as either useful, no longer useful, or eventually useful. But in deciding to use this subtext in the making of my derived data—which in theory is staying true to the intent of the received data George Washington left us—I also had to realize that I was recreating slavery’s injustice and dehumanization in doing so. In most circumstances, “child” would be considered a social status related to age and kinship; listing it as a labor status feels blatantly wrong. Yet on another flipside, if one were to make “child status” a yes-or-no category in the derived data, this would avoid recreating the injustice but at the cost of erasing the injustice contained in the received data. And when you get into deriving categories that simply are not listed in the received data (such as creating data on the gender of enslaved people based off their names or marital status in a ledger), these interpretive risks become an even steeper slope. If nothing else, this highlights why it is so important that we as historians need to be very transparent in saying not just what our final data analysis says, but how we reached that data and why we derived it as such. Data and statistics are powerful tools, and as the practice of deriving data shows, we can pull far more than meets the eye from information we are directly given to answer historical questions, but we can also create lies in doing so. In the same manner, one should not accept contemporary data at face value, but always ask how that data was created.