What actually is a data clearinghouse?

And other questions for Tim Cook’s put-upon architect

Everyone in tech's friendly uncle Tim Cook had an interesting op-ed in Time recently about taking your privacy back. People I talk to seem generally to fall into one of two camps on Uncle Tim and his loud privacy talk; they either consider him and Apple to be:

crusaders for privacy rights,
or conveniently attacking the business model that’s helped Google and Facebook to lap them on functionality—i.e. leveraging "Machine Learning" on “Big Data”.

But I don't really care about that debate; I care about the debate that seems to be increasingly prevalent in public discourse—the concern about our ability to tell who’s using our data for what. (And whatever you think of Uncle Tim, you probably agree that this is a question worth asking.) So what's the problem the data clearinghouse is supposed to solve, and what would it have to look like?

What market would the clearinghouse actually break up?

There are a whole host of names out there for the data collection and sale phenomenon we have become subjects to: the data-industrial complex, surveillance capitalism, Sheryl and Mark’s playground, etc. But while it’s becoming increasingly apparent that this vast economic engine (whatever you choose to call it) exists and pervades our lives, even the most informed of us have only a vague sense of its scale, let alone its shape.

Last night I attended an event put on by the Trust & Technology Initiative here in Cambridge, as part of the launch of Shoshana Zuboff’s new book on surveillance capitalism. The chair of the panel described the problem of understanding the surveillance capitalist economy using an old adage about a group of artists depicting an elephant. Because the elephant is so big, the artists split up the task between them, each looking at a different part of the elephant and trying to put together an image. It’s an excellent analogy for the problem we face as researchers of data markets.

For example, I hit a wall with some research on advertising privacy last year. I believed I understood the mechanics of the problem—the technical basis for targeted advertising—and I wanted to propose some technical privacy protections. After I’d laid out the basics of my idea, my supervisor stopped me. He reminded me of the point that underlies all the research we’re undertaking; preserving privacy is not a solely technical problem, it has economic and legal factors too. And while I had tried to address the market as I saw it, I didn’t really know enough of the whole picture to know if I was addressing the right problem. He told me:

If you wanted to do this, you could go to Google and do an internship with the Search team. Then you’ll know what the problems are, but until then I don’t think this is worth your time.

He was right. I couldn't see the whole elephant, so I couldn't really diagnose its ills and treat it. (My metaphor is getting away from me.) In fact, I soon experienced a perfect example of this problem—while attempting to perform experiments with targeted ads, someone on the inside of the market noticed me, and revoked my access to the advertising tools...

From the little I have read of Zuboff’s book so far, it seems that she has done a better job than anyone yet of painting the whole elephant. But we still live in that difficult phase where researchers and civil liberties campaigners might have to resort to corporate espionage to even know if they’re asking the right questions.

For now, what we can confidently say about data markets is that:

there are brokerage apparatuses that allow the collection and fusion of almost all aspects of a person’s behaviour,
these brokerage tools are a black box and can only be made any more transparent if one were to build business relationships with the data companies,
and that regulation remains toothless while no-one knows what compliance looks like.

In Uncle Tim’s words:

One of the biggest challenges in protecting privacy is that many of the violations are invisible. For example, you might have bought a product from an online retailer—something most of us have done. But what the retailer doesn’t tell you is that it then turned around and sold or transferred information about your purchase to a “data broker”—a company that exists purely to collect your information, package it and sell it to yet another buyer. The trail disappears before you even know there is a trail. Right now, all of these secondary markets for your information exist in a shadow economy that’s largely unchecked—out of sight of consumers, regulators and lawmakers. Let’s be clear: you never signed up for that. We think every user should have the chance to say, “Wait a minute. That’s my information that you’re selling, and I didn’t consent.”

On that cheery note, now that we know how much we don’t know about the problem, let’s interrogate the idea of data clearinghouses as a solution.

Who operates the clearinghouse?

The idea of a clearinghouse is appealing for addressing the lack of transparency in data sharing. Tim again:

That’s why we believe the Federal Trade Commission should establish a data-broker clearinghouse, requiring all data brokers to register, enabling consumers to track the transactions that have bundled and sold their data from place to place, and giving users the power to delete their data on demand, freely, easily and online, once and for all.

Uncle Tim wants the FTC to step in and operate this clearinghouse—one clearinghouse, under the oversight of the US government. Let’s take that step by step.

One clearinghouse, under God

The appeal of a single clearinghouse is clear. In the digital economy, where network effects are strong and marginal costs are low, markets become winner-takes-all. The biggest clearinghouse would have the most data linked into it; everyone would flock there and our decentralised system would effectively be nothing of the sort. It becomes increasingly harder to miss violations from smaller clearinghouses, as you have to expend all your resources monitoring the quasi-monopolist. So why bother with allowing a market?

If a market of clearinghouses were to exist, it might be simpler to transition to (perhaps existing brokerage platforms would transpose themselves into this new model). Data protection agencies, in that case, would be forced to take on the role of ombudsman. I can’t claim to be optimistic about this prospect, given the mishmash that is GDPR compliance and enforcement right now. Maybe asking DPAs to make sure everyone does their business in clearinghouses and monitor all the clearinghouses for compliance is too much of an ask. Plus, if the idea is to give users control, do they then have to go check their dashboard on every single clearinghouse’s website?

Instead, better to only allow a single clearinghouse—minimise the number of cracks in the system so that no smaller clearinghouses can work dark deeds in the shadows, right?

The fiscal conservatives among you are probably up in arms right now, and not for no reason. Having a single clearinghouse, operating the data brokerage economy under its own rules, is anathema to the ideal of a free market economy. And while I don’t believe that free market economies ever really exist, the question must be asked: at what point do the policies and interface of the clearinghouse amount to anticompetitive regulation? The availability of data has enabled some remarkable advancements—any strong hand that clamps down on this will have a tough balancing act to perform, not to mention defend politically.

One clearinghouse, under the FTC

If we’re going to control the data brokerage market in one place, who’s going to run it? Uncle Tim suggests the FTC, but what about the rest of us, who don’t live in the US? Presumably the EU would set up a similar system, but given the cross-border nature of the internet, they would have to be sufficiently aligned such that data held by American companies on EU citizens shows up on the EU’s clearinghouse. Like the current state of the Privacy Shield that lets US companies comply with GDPR, it’s giving a foreign regulatory agency visibility of—and direct enforcement powers—over your company.

Politically alone, this would be a remarkable undertaking, but Privacy Shield does establish a precedent. The real question is whether enforcement becomes feasible at the international scale. It’s hard enough for EU DPAs to prosecute Google and Facebook under the GDPR; a clearinghouse regime might be harder yet.

The idealist in me hopes that the solution might be a supranational regulatory body, but 2016 taught me not to trust in optimism.

How do we know everyone’s participating in the clearinghouse?

Compliance, compliance, compliance. Having a centralised clearinghouse does make the job of investigation easier—you’re either in or you’re out. Then again, who’s investigating? Today, in the US, this comes down to the FBI; in the EU each nation has its own DPA. The jury is still out on how effective they have been as prosecutors of data protection infringements.

My impression (and I welcome correction) is that the oversight they provide only comes into play once an infringement is reported: in the case of the FBI, a suspicion of criminal activity is the impetus; for a European DPA I’m still not clear (a recent conversation with an officer from the Irish DPA gave me the distinct impression that they’re not interested in prosecuting at all).

Perhaps, if governments ever decide to be more hawkish on data protection, we’ll see data audits in addition to tax audits. Maybe, instead of the clearinghouse being operated by the FTC, the US government will establish a data equivalent to the IRS—to operate the clearinghouse and audit businesses. A boy can dream…

How much activity gets forced into the clearinghouse?

So, we’re pulling the data economy out of the shadows and into the light. Great! But to what degree do we force all data to flow through the centralised point we’re creating? If my thermostat company wants to sell my data to Google, they that’s a clearinghouse-relevant transaction, sure. But there are more nuanced cases that need to be addressed, too.

Breaking up the data banks

What happens when data-dealing companies are vertically integrated—i.e. when the same company operates the businesses of data collection, fusion, and whatever goes on top of that (advertising, surveillance, etc)? Take Google, for example. If we’re forcing data flow to happen in the clear, does this mean that it can’t simply pass the stuff it reads in your Gmail inbox along to its search algorithms? By the philosophy of visibility, accountability, and control that led us to the clearinghouse model, Google would have to make this data flow visible in the clearinghouse.

This obviously would have huge implications for the business practices of big data-centric companies. The data flows that were previously a given, and unregulated, for Facebook and Google are now subject to the restrictions and controls imposed by clearinghouse policy.

The financial impact of this move would be multifaceted. Firstly, the processes being performed would take an efficiency and power hit as their data access is restricted. Secondly, it would break the horizontal and vertical integration in the big businesses we know today, and create a more commoditised data market.

For example, with a weak instantiation of this rule, one could keep their Facebook profile from being linked to their Instagram profile, thereby separating their social lives on the platforms. This would obviously reduce Facebook’s power to provide targeted advertising on Instagram, but might bolster the appeal of Instagram to younger users who want to separate their online social personas.

A strong instantiation of the rule would prohibit Facebook from making a special brokerage arrangement for Instagram—either they don’t link data across the services or that linkage must be made available to all businesses via the clearinghouse. This would essentially have the effect of breaking up Facebook (which I also support for many other reasons).

While this would obviously break the integrated business models in play in the internet now, it does replace it with other financial opportunities (Google and Facebook are forced to sell their data more broadly, on a standardised platform). With proper oversight, this might lead to an internet that hews closer to the democratised ideal of yesterdecade.

Either way, for the clearinghouse model to truly provide the visibility and control envisaged, it must reach into the big integrated businesses themselves, and alter the way that data flows within them, not simply how it flows into and out of them.

Handling inference and fusion

Not all personal data that gets passed around is directly identifiable. Sometimes you get one “anonymous” row of data from one source, and another “anonymous” row from another, and put them together and tada! You’ve got a new row of data that is clearly specific to a particular person. How to handle these cases is not obvious.

For example, take targeted advertising. Selling services on top of data leaks data—when an ad gets shown to me, that impression is reported to the buyer, leaking the information that someone at location _X_ with features _Y_ was on their phone at time _T_. This quickly adds up and can be used to build profiles. The clearinghouse in that case has to handle data transmission by implication. Or is this handled through audits by the IRS4Data?

This act of building profiles is, in fact, the main business model that underpins Zuboff’s “Surveillance Capitalism”. Google’s strength is their capacity to fuse data together to construct rich profiles, and feeds these into models to gain remarkable predictive power. Facebook similarly can draw together disparate pieces of behaviour witnessed on the web and tie them to your Facebook profile. The field of privacy-preserving data analysis is still nascent and offers no strong answers yet to the problem of imposing limits on the invasiveness of these practices. The best answer we have so far, and the route a clearinghouse would have to take, is to ensure auditability.

When a company collects data from the clearinghouse, the trail might end there. But any manipulations performed on the data could still be done in the dark. Google feeds the data “legitimately and consensually obtained” from the clearinghouse into its opaque machine learning models. Put simply, you know what information you gave it, but you don’t know what it learned. True auditability would require Google to leave some paper trail of what learning processes your data is subject to after leaving the clearinghouse.

Google would fight tooth and nail (justifiably so) to prevent the strongest route to such auditability—forcing companies to either perform all their operations in the open (perhaps on clearinghouse infrastructure), or to put all data post-fusion back into the clearinghouse. That’s their work, their economic value added, why should everyone get access to it? A middle ground is needed, one in which provenance can be proven without full disclosure, and the results of unseen manipulations explainable to the average citizen. I won’t presume to have any answers here, it’s a huge and difficult field of research, made all the harder by the advent of unexplainable AI.

What tech will the clearinghouse run on?

I’ve spent a lot of time here wargaming the economic and legal implications of attempting a data clearinghouse model, and not talked much about the technical needs of such a system. But as I noted in my discussion of provenance in data fusion, the technical needs are still open questions. As a technologist, I must resist the urge to put the cart before the horse and propose technical solutions before the high-level framework is discussed. Understanding what a data clearinghouse is supposed to be must come before the discussion of how we build it.

I intended for this to be a short set of thoughts on Tim Cook’s proposal, and somehow I’ve spun his ten-paragraph position statement into a whole essay. In any case, I think it’s worth thinking about the implications of the idea. It’s an attractive idea for sure, but upon interrogation proves to be far too deep to point to as the clear solution for our problems—as tempting as that may be.

Feb 5 What actually is a data clearinghouse?