Computers, Privacy & the Constitution

Search Engines and Technological Privacy Solutions

Few of us share all of our intimate thoughts, anxieties and desires with even the closest of friends. Yet we have no similar qualms about dutifully recording each of our fleeting thoughts in the query field of a search engine. The AOL search data fiasco amply demonstrates just how much information can be gleaned about a person even from 'anonymous' search logs. True, the New York Times did most of its sleuthing the old fashioned way, with reporters pouring over the logs—but there's no reason to think the same degree of profiling can't be achieved in automated fashion—and applied to all search engine users—as data mining techniques matures. Nor are government agencies ignorant of what can be learned from search logs.

But even the privacy-conscious tend to balk at the thought of search-engine abstinence. There have been calls for search engines to limit what data they retain and how long they store it for. Such proposals go hand in hand with calls for new legislation and government oversight. As Eben has suggested, however, many privacy concerns can be alleviated by general adoption of freedom-enabling software. Can we rely on hacks to blunt search engine profiling?

When discussing social networking sites and the privacy issues surrounding the service provider's ability to monitor which profiles each user spends time browsing, Eben has advanced wall warts as a potential solution: small, cheap Linux servers, network connected, which can host the owner's social networking profile (and provide back-up hosting for the profiles of his friends, perhaps.) It is easy to imagine how wall-warts could replace Facebook. A personal server--always on and accessible from everywhere--could host your email or your documents, removing the need for third-party services like Gmail or Google Docs. Once the appropriate wall-wart software is written, many online privacy concerns would disappear without any need for legislative solutions.

But not search engine surveillance. A web search engine requires significant hardware investments--servers to constantly index web pages, store the results, and scan through the abstract web map produced to return relevant search results. Google maintains at least half a million servers dedicated to these tasks. Since noone has figured out an adequate way to do the indexing and searching without a central server, privacy hacks have focused on enabling users to access the indexes created by companies like Google and Yahoo while revealing as little information to the search provider as possible.

One approach is to hide true searches amongst a cloud of ghost queries This is the approach attempted by the TrackMeNot Firefox plugin, which periodically sends randomized search-queries to popular search engines like AOL, Yahoo!, Google, and MSN, hoping to obfuscate a user's real searches with background noise. A nice idea in theory, but not so practical if the the random search noise is easy to filter out. Because TrackMeNot? is open-source, concerned search providers can examine its noise-generating algorithms, making it easier to identify features shared by the fake queries they generate. If fake queries can be categorized, seach engines can sort the wheat from the chaff, or fight back by blocking access to users of the plug-in. This is not to say that the approach is entirely without merit; newer versions of the TrackMeNot plugin have implemented increasingly sophisticated techniques geared towards making fake queries look more like the real thing. As in the realm of cryptology, understanding the algorithm won't improve the chances of defeating it if the searches it generates are indistinguishable from real user searches.

Scroogle exemplifies another common approach, which involves anonymizing search queries by routing them through a portal used by a number of other users. Since the search engine sees all the queries as coming from the proxy, it cannot use the originating computer's IP address as a unique identifier; it cannot categorize a series of searches as the thoughts of any particular person. The problem? One must trust the proxy not to keep its own logs, for one. And even if a trustworthy proxy exists (say, a website based in a country with laws severely limiting data retention), search engines can simply block requests from that proxy, once it is discovered. Unlike the game of whack-a-mole between the content industry and peer-to-peer file sharing services, search engines can block anonymizing proxies like Scroogle faster then new ones can gain popularity, since they do not need any judicial imprimatur to engage in effective self-help.

Tor is still the gold standard in terms of online anonymity, but the exit nodes of the Tor network can also be identified and blocked by search providers if few volunteers are willing to run relays. Many potential relay operators are dissuaded by the possibility of incurring legal liability for abetting criminal conduct by other users of the Tor network. Even if no liability exits, relay operators may still come under investigation by law enforcement, which can be a burden in itself.

Perhaps the solution lies in combining the Tor and TrackMeNot? approaches. A Firefox plugin could route the search requests of other plugin users, so that queries initiating from any particular address would represent the thoughts of many actual users. Since all queries would be user-generated, the plugin would be very difficult to detect. And because only search queries would be routed, there would be no danger of abetting anonymous copyright infringement, defamation, or trafficking in child pornography. More people should be willing to run a highly limited search-anonymization plugin than a full-fledged Tor relay. Still, what if one bad apple uses the plugin to make incriminating queries? If law enforcement has access to search engine logs uses them as a means to narrow down the list of suspects in a given crime, innocent people may come under investigation simply for running a plugin meant to preserve their privacy. The threat of that may be enough to dissuade many from employing such a plugin.

This suggests that FOSS alone probably cannot solve the problem. At the very least, what is needed are restrictions on the circumstances under which search logs are subject to subpoena. Law enforcement should not be allowed to go on fishing expeditions through the records of everyone's thoughts. A subpoena for the search history of a particular IP address should require preexisting evidence reasonably linking that IP address to illegal behavior.

-- AndreiVoinigescu - 17 May 2009

If we had wallwart servers, wouldn't this be pretty easy? It would be simple to maintain some TOR-like routing daemon that could run in the wallwart and reroute queries semi-anonymously. I also like the idea of having it only run for queries.

The real question I had when reading this was whether people would accept even this as a solution... it would be popular among a particular privacy-valuing subset, but it seems to me as though the average person may actually value google's "value adding" services enough to eschew the privacy filter and purposefully give them their information.

-- TheodoreSmith - 20 May 2009

It looks like I was beaten to the punch a bit by Ted's comment. You propose some interesting ideas here, Andrei, but even assuming you can get people to see that there is a problem, I, too, wonder if these are solutions people can accept. In my own paper, I argued that part of the reason people don't do the things you propose because they spurn freedom itself, but I also suspect laziness and technological ineptness are also partly to blame. I'll throw myself to the fire by saying that while I have AdBlock? and TrackMeNot? (because they were easy to install and worked in a framework I already understand), really, the only way I will get a wall wart server (might I also embarassingly contend that this name is, well, sort of distasteful to those of us non-tech people that need to be seduced by it?) is if Justin or Ted a) came to my house with the wallwart b) installed the wallwart and c) agreed to maintain the wallwart for me, forever. In some sense, this is just an example of the spurning of freedom I discuss in my paper; I could certainly learn to use this technology--- there is no reason I need others to do it for me--- but despite my strong feelings about these problems, I have yet to take some of these steps because they seem out of (easy) technological reach. I doubt that making the technology easier or integrating it into familiar contexts will have any effect on the underlying problem of rejecting the burdens of freedom, but it may at least cause people to seriously consider the questions, rather than rejecting them outright on ease or ability grounds.

-- DanaDelger - 20 May 2009

Dana, I could perhaps have made the point a little bit more clear in paper, but what I'm proposing wouldn't require a wall-wart or anything of the kind. It could be implemented using only software, and the end user would not be required to engage in any configuration or maintenance beyond that involved in setting up AdBlock? , TrackMeNot? or any other Firefox plugin.

It may indeed be the case that people will reject even easy-to-use privacy enabling software in favor of the convenience offered by Google's "value adding" services, of course. I take it that the loss of those services is the burden of freedom Dana refers to. A really successful TrackMeNot? plugin could put a big dent in Google's revenue. If this leads to a choice between search with voluntary surrender of privacy and no search at all (because no company can turn search into a successful business model), then I suspect most will voluntarily surrender their privacy.

But that doesn't have to be the only choice. A tool for indexing and searching for networked content is the kind of public good that government should provide and support using tax revenue, if necessary.

-- AndreiVoinigescu - 21 May 2009

While a Firefox plug-in that generates excess searches, based on other user-made searches, would indeed serve to reduce Google revenues ... what about the other 4/5 of the web-using populace that does not have Firefox?

-- JonathanBonilla - 23 May 2009

Jonathan, that's a problem I've been thinking about, but probably merits its own paper. It seem like the FOSS projects that have enjoyed the widest adoption among non-technical users (Firefox, Ubuntu, etc) are also the ones with significant corporate backers that can pay for marketing or invest in pretty/easy-to-use interfaces that can capture the general public.

Like any patronage system though, this setup gives the patron considerable influence in the strategy of the project. Google's donations, for instance, are probably among the reasons why AdBlock? is not bundled in Firefox. Perhaps some sort of government fund to support FOSS projects (especially infrastructure ones like operating systems and browsers) would help.

-- AndreiVoinigescu - 24 May 2009

  • In my view, the wall-wart was a distraction for you. You want to call upon a simple proposition from Philippe Aigrain's proposed taxonomy of services: there are those that can be forced to the edge through decomposition and those that cannot be downscaled, like search. Once you've done that, the server side is no longer of interest: you are talking about a question which is actually a species in the genus, "How can client software, by spoofing or otherwise, inhibit or devalue the surveillance conducted by servers?"

  • I agree with Dana that the wall-wart personal server you aren't really writing about won't take hold if its form factor is also its name. I think you made a mistake there. This is a personal server appliance that you buy and bring home, which hooks itself up automatically and smartly, lets you configure it through a wireless web interface that just appears in your apartment or house when you plug it in, allows you to import your existing social network profiles, email, etc., learns and protects your passwords, recognizes your personal authentication tokens etc., and then goes live. It becomes the safely stored system that feeds the world your information: tells your status, sends out feeds of your activities, manages your web presences including photos and video, gets your email, and so on. It arranges for safe, encrypted backup to the network, either destroys the logs of your conduct and your friends' activities there or puts them somewhere safe, and can move with you no more complexly than you would move your stereo to another apartment. If it breaks, you just plug in another one and do basic configuration, and it downloads the rest and becomes your identity manager again. It's your face to the network. Why not call it, for example, your Facebox?

  • Your suggestion that people proxy their interactions with search engines looks eminently feasible to me, but I probably am satisfied with a more rudimentary spoofing client. Even if the search engine is trying to normalize my stream, and I doubt they will bother unless events force them to pay far more computational attention to my stream than it can possibly be worth, their incentives remain all on the side of counting all my searches as valid, because they must perform the searches submitted and can only benefit by also serving the related advertisements. Analytically they are better off letting me masquerade as whatever my spoofer tries to make me than spending money to distinguish the actual me who consumes many fewer ads.

  • That argument implies that we may be able to satisfy ourselves with a slightly more comprehensive but essentially simple spoofing platform resembling TMN. You might want to give a little more analytical, game-theoretic attention to the relationship between a search engine selling advertising and a user spoofing a large number of excess searches to hide her own.

  • Jonathan's question should not have been answered by talking about how free software solutions become popular. He was really asking a question that applies also to Chrome and Opera: how can anyone achieve privacy in the use of the Web while operating an unfree browser, or at any rate a browser you as its operator can't absolutely trust? The answer is that the user is at the mercy of the untrusted browser, and that no browser can be trusted whose source code can't be read. People have begun to understand this about voting machines, but other technology that controls their lives is still given a pass.



Webs Webs

r8 - 05 Jan 2010 - 22:31:33 - IanSullivan
This site is powered by the TWiki collaboration platform.
All material on this collaboration platform is the property of the contributing authors.
All material marked as authored by Eben Moglen is available under the license terms CC-BY-SA version 4.
Syndicate this site RSSATOM