A Brief Guide to Legal Corpus Linguistics, the Unholy Fusion of Big Data and Originalism

As its 6-3 conservative supermajority increasingly feels relieved of the burden to announce carefully reasoned decisions, the Supreme Court is getting lazy. In several cases this term, the justices announced that originalism is no longer a way to read the Constitution; it simply is the test for whether a law violates the Constitution. After Bruen, a gun restriction can survive only if Congress would have adopted it in 1787. And after Dobbs, you only have the due process rights John Locke wrote about in his diary during the 1680s.

Now that originalism is becoming the only test the matters, litigants in constitutional cases must come up with the most reasonable-sounding argument for why a true originalist must rule for their client. And in the arms race among Republican try-hards searching for a foolproof method for making this point, the hottest new toy in conservative legal academia is corpus linguistics.

A branch of the academic discipline of linguistics, corpus linguistics starts from the uncontroversial idea that a word or phrase’s meaning can be discerned by looking at how the word is used in real life. If you want to figure out what people meant when they said a certain thing at a certain point in history, you could look back at how people actually used that word at the time. You see this logic on a smaller scale when judges spend hours poring through old dictionaries to try and divine what Congress meant in 1983, or what the Framers two centuries earlier.

Corpus linguistics applies the power of Big Data to this process. Linguists assemble a massive online database of books and newspapers, pipe them through Adobe Acrobat so the words are searchable, and then use that database of text as a sort of preserved-in-amber Urban Dictionary. Among many others, linguists have constructed a Corpus of Founding-Era English (COFEA), Corpus of Early Modern English (COEME), Corpus of Historical American English (COHA) and Corpus of Contemporary American English (COCA). COHA made news earlier this year when Judge Kathryn Mizelle, a former clerk to Justice Clarence Thomas, held that the CDC didn’t have the authority to require people to wear masks on planes. In her opinion, Mizelle spent multiple paragraphs explaining why COHA yielded a definition of the word “sanitize” that compelled her to hogtie the CDC and leave it on a nearby railroad track.

Wealthy white dictionary enthusiasts discussing their passions, 1787 (Fotosearch / Stringer / Getty Images)

Corpus linguistics as real linguists use it has been around for a while. It was only ported into legal academia about a decade ago, in what is now the epicenter of legal corpus linguistics: Brigham Young University Law School. BYU runs the Law and Corpus Linguistics Technology Platform, the flagship legal project on corpus linguistics, and hosts an annual conference put on by Senator Mike Lee’s younger brother, former Utah Supreme Court Justice Thomas Lee.

For a conservative judge, it’s easy to see the appeal of corpus linguistics: Deferring to what pops out the computer screen gives a veneer of scientific authority to their arguments about the original meaning of the Constitution. It is a way to show to liberal colleagues that what they’re doing isn’t just raw politics, but some rigorous, dignified academic effort.

But of course, we wouldn’t be talking about corpus linguistics if there weren’t serious problems with judges cosplaying as linguists. Defenders of legal corpus linguistics argue that using a database prevents a lawyer from cherry-picking historical sources or dictionaries to get the result they want. But some databases are so limited in scope that the cherry-picking is baked in to the process: For example, one analysis of COHA found that “the vast majority” of its identifiable authors are men, and that men outnumber women by “several orders of magnitude.” Another found that a whopping 30 percent of COFEA was written by six people: George Washington, John Adams, Thomas Jefferson, James Madison, Benjamin Franklin, and Alexander Hamilton.

The Hollow Originalism of Amy Coney Barrett

In other words, a database of documents from an era when most people didn’t receive formal public schooling—meaning they were largely illiterate and may have had very localized, unique ideas about what a particular word meant—is disproportionately composed of the writings of a half-dozen dudes who graduated from colleges that required competency in two different non-English languages just to be admitted. As Notre Dame law professor Donald Drakeman pointed out in a 2020 law review article, many of the pamphlets and books of that era not related to law and politics were forms of religious printing often written by non-Americans. Suffice to say that these databases can’t answer the question lawyers and judges are asking: what the Constitution or a particular law meant to ordinary people at the time it was adopted. They can only hint at what a strikingly small group of people might have meant: white men of a certain educational, religious, and economic subset. Limiting your “search” to this cohort perhaps isn’t the best way to definitively find the “original public meaning” of the Second Amendment.

Sourcing isn’t the only problem. Like any tool, databases can’t answer questions on their own; someone must search them. And there are many ways to conduct electronic searches, no one way more obviously correct than the other. To understand why that’s important, it’s worth thinking about how lawyers and judges read laws in the first place. Law students are often confronted with a classic scenario when they learn about constitutional and statutory interpretation: Say you’re pushing your terrible son in a stroller toward a park when you see a sign that says “No vehicles in the park.” Which “vehicles” are banned? Some answers, like F-350s, are easy. But what about an ambulance in an emergency? Or a bicycle? Most importantly for you, what about the stroller your terrible son is asleep in?

This is a time-honored law school exercise for a reason. It asks law students to think about what the word “vehicle” means. But it also asks what the park meant to ban when it put the sign up, and whether it makes sense to apply the law to you and your terrible son.

So how could corpus linguistics answer this dilemma for you? Maybe you’ll search the word “vehicle” and see how often people used it to mean “stroller” when the sign was posted. Maybe you do things the other way around, and search for a bunch of synonyms for the word “vehicle” (transport, carrier, conveyance), and compare how often those synonyms meant “stroller” with how often the word “vehicle” meant “stroller.” Maybe you look up every single use of the word “vehicle” in every form of media from 1973 onward and come to a sacred, mystical communion with the Parks Committee from that era.

A LAWBREAKER (MAYBE) (Photo by Rasid Necati Aslim/Anadolu Agency via Getty Images)

These are entirely different ways of answering the question, and they can produce very different answers. That would be a real problem for linguists, who want to ensure that any search result or analysis could be reproduced by a future researcher—a key part of the scientific method. Judges don’t have to worry about that.

This is troubling even for researchers earnestly attempting to use corpus linguistics to shed light on a legal issue. You may recall that in the early years of the Trump administration, the president was accused of receiving illegal “emoluments” from foreign dignitaries by charging obscene rates at his D.C. hotels. In a 2017 paper, BYU research fellow Sara White and Chapman University professor James Phillips set out to determine what “emoluments” meant when the Founders prohibited the president from receiving them. They searched their chosen corpus linguistics databases for the word “emolument” and reviewed a huge sample of results, but found they could not agree on how each historical source was using the word as much as 30 percent of the time. So two professionals, in a serious academic effort, could not even agree on the meaning of over a quarter of their results!

That’s not great! Imagine how unreliable you would find a study of vaccines where researchers couldn’t agree whether 30 percent of patients were sick or not. And more crucially, we only know about this massive rate of disagreement because White and Phillips showed their work. How a freak Fifth Circuit judge ran a corpus linguistics analysis to justify overturning a ban on assault weapon possession by toddlers may be far less clear.

Which Is the Worst Federal Appeals Court, and Why Is It the Fifth Circruit?

For as much as proponents of this shit protest how cleanly they can wash their hands of biases, they never seem to acknowledge how dangerous it is that a ton of discretion is still left to the person doing the searches. This is hardly a surprise: As with every other exercise in originalism or textualism, judges never seem to stop and question whether they’re qualified to be doing any of this. That’s how you end up with a 35-year-old with a lifetime appointment to the federal bench requiring that flight attendants get COVID-19 every week forever.

All that points to the fundamental problem with corpus linguistics, or dictionary-shopping, or any other method for Scientifically Determining the Meaning of the Law: A judge—not an incomplete linguistic database, not a dictionary—is saying what the law is. And a judge is just a person, with all their preconceived beliefs, biases, and decades of Federalist Society membership shaping how they answer any question before them. What makes a judge’s answer the right answer is just the fact that they’re the ones saying it.

Unfortunately, corpus linguistics isn’t just an academic project for Provo-area law professors anymore. Several state supreme courts are adherents now. At least three different circuit courts have directed parties to brief the application of corpus linguistics, and judges on the Sixth Circuit have used it in a plurality opinion. Both Thomas and Justice Samuel Alito have written concurrences citing to psycho conservative think tank amicus briefs that either rooted their argument in corpus linguistics or cited to databases of Founding-era texts. Lawyers in one case this term, ZF Automotive US, Inc. v. Lucshare Ltd., even pulled Chief Justice John Roberts and Justice Amy Coney Barrett into an extended discussion about corpus linguistics during oral argument. Corpus linguistics is breaking containment, and it’s coming to a Supreme Court majority near you.

As public scrutiny of this Court increases, conservative judges need ways to bolster their legitimacy and lull the public or the media into opposing broader reform of the federal judiciary. Corpus linguistics will give them something on which to hang their hats; they can convince themselves, and maybe even the press that covers them, that Big Data and the science of linguistics compels an outcome. It doesn’t. These judges are doing what they want, and corpus linguistics is helping free them to do it.

Legal Culture

James LaRock

Author

James LaRock is a practicing lawyer and recovering legal podcaster whose commentary has been featured in The Outline. The views he expresses on this web site do not represent those of his employer.

More by this Author