Machines Shouldn’t Have to Spy On Us to Learn

In old spy novels, when two secret agents need to communicate with each other out in the field, one of them often leaves a document in an assigned place—tucked in the hollow of a tree trunk or between the pages of a certain library book. Once the first agent has safely vacated the scene, the second one moves in to fetch it.

This maneuver—called a dead drop—may seem straight­forward. But if you think about it, there’s a serious hitch: Somehow, the location of the dead drop has to be pre­arranged.

Related Stories

  • Zeynep Tufekci

    Shouldn’t We All Have Seamless Micropayments By Now?

  • Tom Simonite

    This US Firm Wants to Help Build China’s Surveillance State

  • Lily Hay Newman

    The Privacy Battle to Save Google From Itself

This isn’t just a problem with the genre conventions of spy thrillers. For thousands of years, this was, in fact, a fundamental flaw in human communication. Whether you were Caesar, Napoleon, or a spy using shortwave radio during the Cold War, if you wanted to communicate secretly in the future, you had to first manage to communicate secretly in the past. And arranging that moment of first contact between two agents was always dangerous, difficult, and prone to interception.

Then in 1976, two privacy-oriented computer scientists, Whitfield Diffie and Martin Hellman, cracked this age-old problem with an idea called public key cryptography. They came up with an ingenious protocol that involved the use of one-way functions—mathematical calculations that are easy to solve in one direction but very, very difficult in the other. (It’s easy to multiply two large prime numbers together to get a very large result, but it’s super hard to work backward to find those original two numbers—unless you own billions of dollars worth of supercomputers. Hi, NSA!) Here’s how it most often works in practice: One person, let’s call her Alice, creates a private key, which gets fed through an algorithm using one-way math functions to create a public key. After that, anyone can send an encrypted message to Alice, and only she will be able to unlock it easily.

The implications of this breakthrough have been vast. For most of human history, if anyone wanted to communicate secretly, they had to engage in a bunch of perilous, time-consuming, and expensive spycraft. That was the trade-off, the cost of doing business. Public key cryptography dramatically changed that. All of a sudden, in one swift revolutionary moment, ­people could communicate secretly ab initio—not only spies, but everybody. And it was fairly painless. Just think of all the confidential exchanges you conduct with strangers online: Every time you log in to a bank account or shopping site, browse an “https” web page, or send a digital signature, you’re relying on a user-friendly version of private key cryptography. Diffie and Hellman didn’t just render the dead drop obsolete; they made the internet as we know it possible.

Now consider another, more modern phenomenon that needs to be made obsolete—a data transaction that is very different from a dead drop but also terribly flawed.

Today, if you’re a company that benefits from machine learning, chances are that, somewhere along the line, you’ve engaged in a bunch of nightmarish surveillance. Right now that’s just the cost of doing business with this technology. Machine learning, as you may have heard, is a computational method that involves training algorithms to recognize patterns in lots and lots of data until, eventually, they can answer questions, make predictions, and solve problems. For example, Google Translate works by chewing through billions of words of existing translated text—its “training corpus”—to generate the most probable correct translation. The results aren’t perfect, but they can be uncannily effective.

They’re effective enough, anyway, that the rush to capitalize on this form of AI has set off a feeding frenzy. Companies are collecting, scraping, mining, and buying every scintilla of data they can get their hands on. Unscrupulous third parties are doing even worse. Wherever lots of data is stored, there’s a good chance it will be breached, pilfered, sold, hacked, and eventually fed into machine-learning protocols, sometimes for malevolent purposes. The technology’s success is simultaneously a privacy-­violating disaster for society.

We need a new breakthrough in encryption, one that fundamentally changes the rotten trade-off we now make between privacy and AI.

At the same time, other pitfalls with today’s machine learning hold it back from being useful where it might really help. For instance, it’s a challenge right now for responsible actors to use machine learning in scenarios where it’s not legally possible or ethically desirable to share underlying data. Say two hospitals each have a corpus of data about patient outcomes after initial Pap smears—a notoriously difficult diagnostic test for cervical cancer. Machine learning has shown some real promise at being able to read these tests. It might be highly beneficial to train an algorithm on the whole data set, but that would require bringing it all together somehow. Hospitals may not be legally allowed to share patient information. Their hands are tied.

So how do we preserve—and even expand—the benefits of machine learning while also reconciling it with basic standards of confidentiality? We need some new breakthroughs that fundamentally change the rotten trade-off we now make between privacy and AI. The good news is that there’s a growing research effort in what’s called “privacy-preserving” machine learning. Academics are trying to develop algorithms that can operate on encrypted data, which means they wouldn’t need to access anyone’s data directly. Other researchers are figuring out ways to combine insights from different machine-­learning models without needing to merge all their underlying data. Companies like Apple, ­Google, and Microsoft already have teams working on such projects.

SIGN UP TODAY

Sign up for the Daily newsletter and never miss the best of WIRED.

With effective regulation, those teams would grow faster. We should ban the spurious and excessive data collection that is currently the norm—not only to put a stop to obviously abusive practices, but also to speed innovation. When faced with new regulatory barriers, companies and researchers will pour effort into developing new compliant ways to have their cake.

With any luck, their breakthroughs will make it newly possible for, say, medical researchers to use machine learning on sensitive, private data sets—and for the rest of us to enjoy the perks of ubiquitous AI without having our privacy savaged. Those researchers’ challenge, and ours, should be to render the headlines of the early 21st century (Equifax hacked! Facebook data pilfered! ) obsolete. Heck, one day they may even seem quaint—like the bizarre practices of spies in old paperback novels.


Zeynep Tufekci (@zeynep) is a WIRED contributor and a professor at the University of North Carolina at Chapel Hill.

This article appears in the April issue. Subscribe now.


More Great WIRED Stories