Technical Analysis
How Online Tracking Companies Know Most of What You Do Online (and What Social Networks Are Doing to Help Them)
Technical Analysis by Peter Eckersley3rd party advertising and tracking firms are ubiquitous on the modern web. When you visit a webpage, there's a good chance that it contains tiny images or invisible JavaScript that exists for the sole purpose of tracking and recording your browsing habits. This sort of tracking is performed by many dozens of different firms. In this post, we're going to look at how this tracking occurs, and how it is being combined with data from accounts on social networking sites to build extensive, identified profiles of your online activity.
How 3rd parties get to see what you do on the web.
Let's start with an example of 3rd party tracking: when we went to CareerBuilder.com, which is the largest online jobs site in the United States, and searched for a job, CareerBuilder included JavaScript code from 10 (!) different tracking domains: Rubicon Project, AdSonar, Advertising.com, Tacoda.net (all three are divisions of AOL advertising), Quantcast, Pulse 360, Undertone, AdBureau (part of Microsoft Advertising), Traffic Marketplace, and DoubleClick (which is owned by Google). On other visits we've also seen CareerBuilder include tracking scripts and non-JavaScript web bugs from several other domains. There are pretty sound reasons to hope that when you search for a job online, that fact isn't broadcast to dozens of companies you've never heard of — but that's precisely what's happening here.

(in this screenshot, NoScript is being used to identify the third parties whose code is embedded in the page)
Each of these tracking companies can track you over multiple different websites, effectively following you as you browse the web. They use either cookies, or hard-to-delete "super cookies", or other means, to link their records of each new page they see you visit to their records of all the pages you've visited in the previous minutes, months and years. The widespread presence of 3rd party web bugs and tracking scripts on a large proportion of the sites on the Web means that these companies can build up a long term profile of most of the things we do with our web browsers.
They can track us, but do they know who we are?
Given how much tracking firms know about our browsing history, it's worth asking whether these companies also know who we are. The answer, unfortunately, appears to be "yes", at least for those of us who use social networking sites.A recent research paper by Balachander Krishnamurthy and Craig Wills shows that social networking sites like Facebook, LinkedIn and MySpace are giving the hungry cloud of tracking companies an easy way to add your name, lists of friends, and other profile information to the records they already keep on you.
The main theme of the paper is that when you log in to a social networking site, the social network includes advertising and tracking code in such a way that the 3rd party can see which account on the social network is yours. They can then just go to your profile page, record its contents, and add them to their file. Of the 12 social networks surveyed in the paper, only one (Orkut) didn't leak any personally identifying information to 3rd parties.
There are some interesting technical details in how the social networking sites leak this data. In some cases, the leakage may be unintentional, but in others, there is clever and surreptitious anti-privacy engineering at work.
Paths for Data Leakage from Social Networks to 3rd party Tracking Firms
The most obvious way that a 3rd party tracker might learn which account on a social networking site is yours is via the HTTP Referrer header. A typical URL on a social networking site includes a username or user ID number, and any 3rd party will be able to see that.1A second and slightly more revealing method that some social networks use to leak personal information is through URL/URI parameters for the 3rd party content. Here's an anonymized example from the paper:
GET /track/?...&fb_sig_time=1236041837.3573&
fb_sig_user=123456789&...
Host: adtracker.socialmedia.com
Referer: http://apps.facebook.com/kick_ass/...
(In this request, a Facebook app is sending the user's facebook user ID and signin time to to adtracker.socialmedia.com)
The third and most surprising method for leaking personal information is to alias 3rd party tracking servers into the host site's domain name in such a way that the 3rd party can see the host site's cookies, in violation of the same origin policy. Here's an example from the paper:
GET /st?ad_type=iframe&age=29&gender=M&e=&zip=11301&... Host: ad.hi5.com Referer: http://www.hi5.com/friend/profile/displaySameProfile.do?userid=123456789 Cookie: LoginInfo=M_AD_MI_MS|US_0_11301; Userid=123456789;Email=jdoe@email.com;(ad.hi5.com is actually ad.yieldmanager.com, and it's receiving different bits of personal information via referrer, URI parameters, and the hi5.com cookie which the same origin policy wouldn't have allowed it to have — so it's an example of all three leakage methods methods)
What can I do to protect myself?
Unfortunately, there is no easy way to use modern, cookie- and JavaScript-dependent websites and social networking sites and avoid tracking at the same time. In order to be substantially protected against these tracking mechanisms, you'd need to do the following:
- Pick a good cookie policy for your browser, like "only keep cookies until I close my browser", or manual approval of all cookies.
- Disable Flash Cookies and all the other kinds of "super cookies". You can test for these here.
- Use the Firefox extensions RequestPolicy and NoScript to control when 3rd party sites can include content in your pages or run code in your browser, respectively. These tools are very effective, but be aware that they're hard to use: lots of sites that depend on JavaScript will need to be whitelisted before they work correctly.
- Use the Targeted Advertising Cookie Opt-Out plugin. This will automatically opt you out of any 3rd party trackers who have an opt out somewhere that requires you to accept a cookie. Be aware that not all 3rd parties will offer opt outs, or that some of them may interpret "opt out" to mean "do not show me targeted ads", rather than "do not track my behavior online".
- As always, it doesn't hurt to use Tor via TorButton to hide your IP address and other browser characteristics when you want maximal browser privacy.
Unfortunately, many of the steps above are quite difficult to follow, and we're fearful that the vast majority of Internet users will continue to be tracked by dozens of companies — companies they've never heard of, companies they have no relationship with, companies they would never choose to trust with their most private thoughts and reading habits.
It isn't going to be easy to fix this mess. On the technical side, all of this tracking follows from the design of the Web as an interactive hypertext system, combined with the fact that so many websites are willing to assist advertisers in tracking their visitors. Browsers could be altered to make them harder to track, but great care and clever design will be required to achieve that without undermining the virtues of interactive hypertext in the first place. It's not clear that anyone has found the right way to do that yet.
On the legal side, it's clear that the current U.S. privacy regime isn't working: behavioral tracking companies can put whatever they want in the fine print of their privacy policies, and few of the visitors to CareerBuilder or any other website will ever realize that the trackers are there, let alone read their policies. It's time we found legal rules to ensure that people actually know when their privacy is part of the price they pay to visit a site.
- 1. One subtlety here is that sometimes the 3rd party won't be able to tell whether a profile is yours or belongs to someone else. But there are several ways around that: they can look for URLs associated with profile editing or other activites that your friends can't do with to your profile; they can see which profile you visit first when you log in to the site, and they can see which profile you visit most often over time.
New Cookie Technologies: Harder to See and Remove, Widely Used to Track You
Technical Analysis by Seth SchoenThis is part 1 of a three-part series on user tracking on the web today. You can read Part 2 here.
Cookies are still a privacy problem for web users, many years after privacy advocates first raised concerns about their use to track web browsing. Today, cookies are one of the main mechanisms that advertising companies like Google use to track and profile users across sites and over time -- often building up a single gigantic profile for years and years. Many EFF members respond to this threat by using their browsers' cookie management features to limit which cookies they'll accept or how long they'll be retained.
But it turns out that the cookie situation is quite a bit trickier today, and sites that want to track users have new technical options that are hard for users to respond to. The traditional "cookie" is an HTTP cookie, invented by Lou Montulli and John Giannandrea at Netscape in 1994. But today many browsers implement a range of things with the same kind of cookie-like tracking behavior -- mechanisms that are far less familiar, harder to notice, and often harder to control.
A great overview of the wide range of cookie technologies confronting us today is Cleaning Up After Cookies, an article published last year by Katherine McKinley at iSEC Partners. McKinley describes five cookie-like tracking methods that go beyond traditional HTTP cookies, and explains how browsers often fail to let users exercise meaningful control over these varieties of tracking.
The most prominent of these tracking methods is the so-called "Flash cookie", a kind of cookie maintained by the Adobe Flash plug-in on behalf of Flash applications embedded in web pages.1 These cookie files are stored outside of the browser's control. Web browsers do not directly allow users to view or delete the cookies stored by a Flash application, users are not notified when such cookies are set, and these cookies never expire. Flash cookies can track users in all the ways traditionally HTTP cookies do, and they can be stored or retrieved whenever a user accesses a page containing a Flash application. Some of the problems are highlighted by Rob Savoye, the developer of Gnash, an open source Flash implementation.
Last month, a group of researchers at UC Berkeley led by Ashkan Soltani released a study, Flash Cookies and Privacy, about this technology and the ways it's being used to track Internet users today. The study found that Flash cookies are extensively used by popular sites, and that most users probably don't know about them or how to delete them. They also found that at least one major site uses them in a way that violates the advertising industry's own rules on tracking.
What's more, the Berkeley researchers found that Flash cookies are often used to deliberately circumvent users' HTTP cookie policies. That is, a site may intentionally store the same information redundantly in both HTTP cookie and Flash cookie forms. When a user deletes the HTTP cookie, the site may "respawn" it from the copy that was stored as a Flash cookie! It seems clear that site operators know many users don't want to be tracked with cookies, but have found a way of circumventing those users' privacy preferences.
These privacy-invasive marketing practices need greater scrutiny. We need more research to reveal whether the other kinds of cookies McKinley described are also being used to track users, as Soltani and his collaborators showed that Flash cookies are. It's entirely possible that Flash cookies will turn out to be just the tip of the next-generation user tracking iceberg.
Meanwhile, browser developers should do more to let users understand and control how they're being tracked -- using any of these techniques. Unfortunately, Adobe has made that extremely difficult with regard to Flash cookies, since they're stored outside of the browser's control, and since the official Flash plug-in isn't open source, users can't easily fix this for themselves. The BetterPrivacy Firefox plugin tries to address this by finding Flash cookies on the hard drive and regularly deleting them, but Adobe could help by ensuring their cookie system follows the browser's privacy settings.2
Clearly, there's a lot of work to be done to bring these next-generation cookies even to the same level of visibility and control that users experience with regular HTTP cookies.
- 1. Adobe refers to Flash cookies as Local Shared Objects. Aside from Flash cookies, the other kinds of cookie-like objects McKinley identifies are HTML 5 DOM storage, Microsoft Silverlight cookies, Microsoft Internet Explorer User Data Persistence, and Google Gears data.
- 2. Adobe currently provides an interface to manage Flash cookies, but most users are unaware of it, it's not integrated with browser cookie policies at all, and it seems to focus as much on the disk space Flash cookies can take up as on their privacy implications. It also doesn't provide some of the kinds of control over cookies that regular browsers and browser plugins can.
What Information is "Personally Identifiable"?
Technical Analysis by Seth SchoenMr. X lives in ZIP code 02138 and was born July 31, 1945.
These facts about him were included in an anonymized medical record released to the public. Sounds like Mr. X is pretty anonymous, right?
Not if you're Latanya Sweeney, a Carnegie Mellon University computer science professor who showed in 1997 that this information was enough to pin down Mr. X's more familiar identity -- William Weld, the governor of Massachusetts throughout the 1990s.
Gender, ZIP code, and birth date feel anonymous, but Prof. Sweeney was able to identify Governor Weld through them for two reasons. First, each of these facts about an individual (or other kinds of facts we might not usually think of as identifying) independently narrows down the population, so much so that the combination of (gender, ZIP code, birthdate) was unique for about 87% of the U.S. population. If you live in the United States, there's an 87% chance that you don't share all three of these attributes with any other U.S. resident. Second, there may be particular data sources available (Sweeney used a Massachusetts voter registration database) that let people do searches to bootstrap what they know about someone in order to learn more -- including traditional identifiers like name and address. In a very concrete sense, "anonymized" or "merely demographic" information about people may be neither. (And a web site that asks "anonymous" users for seemingly trivial information about themselves may be able to use that information to make a unique profile for an individual, or even look up that individual in other databases.)
Many contemporary privacy rules and debates center on the notion of "personally identifiable information" (PII). The PII concept is used by several legal regimes and many organizations' privacy policies; generally, information that identifies a particular person is considered much more sensitive than information that does not. For instance,
- Federal telecommunications privacy laws use "individually identifiable information" (about a subscriber) as a basis for the category of protected information called Customer Proprietary Network Information (CPNI);
- Federal health privacy regulations use "individually identifiable health information" (about a patient) as a basis for the category called Protected Health Information (PHI);
- Federal financial privacy laws, the EU Data Protection Directive, and state privacy laws all employ similar terms and concepts;
and, in each case, facts deemed "personally identifiable" or "individually identifiable" may receive dramatically higher protections under these laws and regulations.
But research by Prof. Sweeney and other experts has demonstrated that surprisingly many facts, including those that seem quite innocuous, neutral, or "common", could potentially identify an individual. Privacy law, mainly clinging to a traditional intuitive notion of identifiability, has largely not kept up with the technical reality.
A recent paper by Paul Ohm, "Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization", provides a thorough introduction and a useful perspective on this issue. Prof. Ohm's paper is important reading for anyone interested in personal privacy, because it shows how deanonymization results achieved by researchers like Latanya Sweeney and Arvind Narayanan seriously undermine traditional privacy assumptions. In particular, the binary distinction between "personally-identifiable information" and "non-personally-identifiable information" is increasingly difficult to sustain. Our intuition that certain information is "anonymous" is often wrong. Given the proper circumstances and insight, almost any kind of information might tend to identify an individual; information about people is more identifying than has been assumed, and in the long run the whole enterprise of classifying facts as "PII" or "not PII" is questionable.
Statistical inference and clever use of databases has resulted in impressive examples of deanonymization of supposedly anonymous data, the kinds of data that most organizations have not regarded as PII. Apart from combinations of demographic data, some of the sorts of things that may well uniquely identify you include your search terms; your purchase habits; your preferences or opinions about music, books, or movies; and even the structure of your social networks -- in a purely abstract sense, even when shorn of the identities of your friends and contacts. Deanonymization is effective, and it's dramatically easier than our intuitions suggest. Given the number of variables that potentially distinguish us, we are much more different from each other than we expect, and there are more sources of data than we realize that may be used to narrow down exactly who a particular record refers to.
Many of these papers were meant as proofs of concept: they show that people can potentially be re-identified by these kinds of data, not that everyone will be. Not everyone's medical records were as easy to put a name to as Governor Weld's. And Narayanan and Shmatikov's research definitively identified only two Netflix users from their movie ratings -- not every user whose ratings were published by Netflix. Still, many of these research results deliberately do not use all the data available about individuals because their goal is to show the effectiveness of mathematical techniques, not to violate individuals' privacy. Real-world attacks will use many more kinds of available information simultaneously to narrow in on people's identities. As Bruce Schneier has observed, such attacks only get better over time; they never get worse.
Ohm argues that it's more appropriate to think of identifiability as a continuum. The notion of "anonymized" or "sanitized" data is then problematic; researchers habitually share, or even publish, data sets which assign code numbers to individuals. There have already been conspicuous problems with this practice, like when AOL published "anonymized" search logs which turned out to identify some individuals from the content of their search terms alone.
We hope "Broken Promises of Privacy" encourages people who work with personal data to think more critically about their retention and sharing practices and the effectiveness of the anonymization or pseudonymization techniques they're using. We also hope it finds a broad audience and helps start a wider discussion among researchers, technologists, and lawyers about what "privacy protection" should mean in the era of deanonymization.
The UK's Surveillance Society: Half A Million Intercepts of Communications Data in 2008
Technical Analysis by Danny O'BrienThis week, the United Kingdom's Interception of Communications commissioner, Sir Paul Kennedy, announced his latest statistics for Britain's phone and email surveillance systems, to generally shocked responses by the British Public. In 2008, law enforcement, local authorities and the secret services in that country demanded "communication data" — the "who, how, when and where", but not the actual content of messages — 504,073 times. That's 1,381 times a day; or one inquiry every year for every 78 people in the UK.
Sir Kennedy's report is, in many ways, all the public oversight these half a million requests get.
In the United Kingdom, there is no judicial review of these requests; law enforcement together with the Information Commission regulate their own regime, and are bound only to a government "code of conduct".
Communications data continues to be viewed by lawmakers as non-invasive and therefore not regarded as requiring strict regulation, despite the growing range of personal information that can now be revealed by a communications data intercept request. These orders can reveal lists of websites visited, email headers, name and address lookups, and, perhaps most controversially, the real-time location of a particular mobile telephone.
Such a breadth of information so readily available make these intercepts increasingly tempting for law enforcement; modern technology makes them far easier to capture and process en masse; and with no probable cause or other conditions on obtaining such data, these numbers will keep rising. To guard against the misuse of these invasive powers, we need more than just aggregate statistics presented at the end of the year. Across the world, these frequent invasions of privacy need full judicial oversight, once case at a time.
Several Facts about Google and HTTPS
Technical Analysis by Peter EckersleyThree simple facts about Google and HTTPS:
One: as we posted last week, we're very pleased to hear that Google is trialling full HTTPS encryption of all Gmail pages.
Two: if Google's trials are successful, and the company does indeed make HTTPS encryption the default protocol for reading and writing Gmail messages, it will have taken a two-step lead on its competitors in the free webmail and social networking spaces. People use Yahoo! Mail, Hotmail, LiveJournal and Facebook for their private communications, but all of the private messages on those services travel over the network unprotected.1 MySpace doesn't even support HTTPS for passwords!
Three: webmail is one thing, but search is another. Sadly, it isn't possible to use Google's excellent search engine over HTTPS. If you attempt to visit google.com via https, you'll just be redirected back to unencrypted HTTP. If you try the same thing at Yahoo or Microsoft, you'll receive unhelpful error messages.
We've been privately urging Google to make their search service available by HTTPS for some time, but nothing has happened. Yahoo and Microsoft should of course do the same. At the moment, the only search engine that offers protection against eavesdropping is a metasearch site called Ixquick (they also have a truly excellent privacy policy). We hope that some day, the major search engines can catch up with Ixquick.
Those are three simple observations. If you're interested in some less-simple technical detail about what HTTPS actually does, why it's important, and what its limitations are, continue reading below the fold.
- 1. Yahoo! Mail is the least worst of these services, since it defaults to HTTPS login, but all of these services are severely lacking in security.
Last.fm and the Diabolical Power of Data Mining
Technical Analysis by Peter EckersleyRecently, there was a minor scandal when TechCrunch accused Last.fm of turning over information — the identities of people listening to copies of a leaked U2 album — to the RIAA. Last.fm issued a scathing denial of these allegations, and it's good to hear that the site hasn't turned into a worldwide music surveillance system. Not on purpose, that is.
Last.fm's avowed innocence isn't quite the end of the story. The whole kerfuffle should remind us that websites that collect and republish seemingly innocuous facts about their users are often vulnerable to data mining. It doesn't matter whether you keep the users' names and addresses secret — the facts you publish about them may be sufficient to ensure that there is only one person on the whole wide web to whom those facts pertain.1
This isn't a problem that's unique to Last.fm in any way. Networked computer systems often leak secrets in unexpected ways, but Last.fm serves as a particularly clear example of why anonymity is hard to achieve.
More on this risk, and what to do about it, after the jump.
- 1. There are only 7 billion people on the planet, and only about a billion on the Internet. Every fact about a person (are they male or female? Where they live? Do they listen to Brian Eno?) slices that number down by a significant fraction. If you have enough facts about a person, (33
bits of independent facts, it turns out, because log 2 7,000,000,000 = 32.7) you can determine who they are.
Laboratories and Roadmaps for Network Testing
Technical Analysis by Peter EckersleyToday, the New America Foundation, PlanetLab and Google announced the launch of the Measurement Lab project, an initiative to provide server resources for researchers interested in network neutrality and performance testing. This is good news for the community of academics and activists who are trying to map, measure and record the state of network management by ISPs as well as many other aspects of Internet performance.
The Measurement Lab is an alternative version of a pre-existing network called PlanetLab, which is run by a consortium headquartered at Princeton. Essentially, PlanetLab is a large network of computers that researchers can run experiments on. Until now, it has been hard to use PlanetLab for network neutrality tests because the system wasn't designed for it: all the code runs in virtual machines and might be starved of CPU time right in the middle of an attempt to take high-precision network latency measurements. M-Lab is a version of PlanetLab that is designed to ensure that when a test is running, it has near-exclusive use of a CPU core and network interface.
M-Lab is not a testing tool in and of itself; rather, it is a platform that will save researchers from having to deploy their own servers in order to run "active" network tests. Active tests are those in which the clients send synthetic traffic that is made up simply for the purposes of the test (M-Lab only works with active tests initiated by clients run on users' computers). M-Lab won't be useful for "passive" network tests which examine the way the network carries traffic that your computer was sending independently of the test. You can see a list of active and passive network testing tools here; EFF's Switzerland software is an example of the passive testing approach, although M-Lab will be useful if we add synthetic traffic generation features to Switzerland in the future.
M-Lab gets good marks for openness and privacy by design. The active-testing paradigm ensures that the network's servers will never receive real user traffic, which would need very high levels of privacy protection. Essentially, the servers may record traffic sent to their IP addresses, but the only software that will be sending such traffic will be clients that generate synthetic messages.
The code for the servers will be free/open sourced, and all of the experimental data it collects will be published. The most noteworthy disadvantage of the project is that use of M-Lab is currently limited to PlanetLab members, so testing projects that are not affiliated with research institutions will need to find academics to collaborate with if they want to use M-Lab servers.
And speaking of Switzerland: development on EFF's network testing project was slow in the past few months, but coding has started again, and we have a roadmap and release schedule for new Switzerland versions. We'll be posting to announce a new release (and reporting on some of the interesting network phenomena Switzerland has detected to date) in the next few weeks.
Comcast Unveils Its New Traffic Management Architecture
Technical Analysis by Peter EckersleyLate on Friday night, Comcast filed an overview of its new traffic management arrangements with the FCC. This is the long term replacement for its controversial practice of using forged TCP Reset packets to limit the use of peer to peer protocols.
The new system appears to be a reasonable attempt at sharing limited bandwidth amongst groups of users. Unlike TCP RST spoofing, it doesn't explicitly discriminate against some applications, and it doesn't threaten protocol developers with interoperability problems and uncertainty about network behavior.
Comcast's objective here is still largely to prioritize non-P2P traffic above P2P traffic. But the criterion they use is the amount of data a cable modem sends during each 15 minute period, which is a much fairer rule than examining the traffic protocol. The way deprioritization works is simple: high priority machines get to send data, and if there is any transmission capacity left over, the low priority machines get a share of that.
EFF is proud that our work helped to expose Comcast's misadventures in network management last year, and we're pleased to see Comcast returning to congestion management practices that are transparently disclosed and avoid protocol discrimination.
The new traffic management setup should not be confused with the 250 GB/month cap which Comcast announced last month; the two will exist side by side.
DRM for Streaming Music Dies a Quiet Death
Technical Analysis by Fred von LohmannYet another nail has been driven into DRM's coffin, this time for streaming audio (PCPro has a nice overview of the state of DRM for digital music).
Two of the leading on-demand streaming music sites, iMeem and LaLa, are not using DRM on their audio streams, instead sending the music as MP3s dusted with a dash of obfuscation. This is significant because both sites have been licensed by all the major record labels -- the very same record labels that were just last year pushing Congress to require DRM on all noninteractive webcasts. So it looks like the RIAA companies have changed their minds, dropping DRM requirements for the on-demand streaming music services.
This should put an end to legislation to mandate DRM on noninteractive webcasters. After all, why should these webcasters be in a worse position than the free, on-demand music services like LaLa and iMeem?
This also undermines the argument that DRM for music is necessary for subscription services. If the major labels have given up DRM for free, ad-supported (correction: iMeem is ad-supported, LaLa is free for a first listen of a track, 10 cents for repeat listening), on-demand streaming services like LaLa and iMeem, there's no plausible reason to insist on DRM for paid subscription services like Rhapsody and Napster 2.0. After all, there's no reason to think that those who prefer commercial-free subscriptions like Rhapsody are more likely to "pirate" streams than those who prefer ad-supported services like LaLa iMeem.
LaLa and iMeem each take slightly different approaches to streaming music. LaLa uses HTTP to download each requested song as an MP3 to your browser, but relies on aggressive "no-cache" headers and pre-expired date stamps to suggest that your browser not make a copy of the file on your hard drive. Using a packet sniffer to capture the entire HTTP session, however, easily reveals the complete MP3 embedded right after the HTTP headers.

iMeem also downloads and caches each requested song, but sends the MP3 as the audio track of a Flash Video file. This FLV file is typically saved (cached) on your hard drive as an obscurely named temporary file, which is overwritten when you request your next song (we mentioned iMeem's approach back in January, and it's essentially unchanged). Copy this temp file, however, and you can easily extract the audio track from the Flash video, saving it as a stand-alone MP3 file.

(The location of this TemporaryItems folder, and its equivalent on other operating systems, varies significantly depending on operating system and version. On some operating systems it's buried deep within the directory hierarchy, but it can be found automatically with standard tools.)
While the light obfuscation used by iMeem and LaLa might create a "speed bump" of inconvenience for users who want to keep the MP3 files, it doesn't rise to the level of a "technical protection measure" protected by the DMCA. In short, this is yet another example of why there is no legitimate business case for DRM on music -- it doesn't prevent piracy and it's not necessary to enable "new business models" like subscription or ad-supported music. (Of course, as the movie industry has demonstrated, DRM can still be valuable for impeding competition and putting the brakes on disruptive innovation. But it's hard to see how the law should protect those goals.)
Embedded Video and Your Privacy
Technical Analysis by Seth SchoenWe've recently started embedding video from YouTube and elsewhere into Deeplinks and other areas of EFF.org. This posed a challenge: On one hand, embedded video is an important tool that we want to be able to use. But, on the other hand, embedded video has worrisome privacy implications that we thought we should do something about.
All embedded, in-line, or off-site content on the World Wide Web implies some privacy risk because of the way most web browsers work. Whenever you follow a link, or download an embedded or off-site resource, your browser sends a referer header (sic) that tells the web site what web page you came from. And whenever you load any document, your browser may send cookies that show whether you've visited the same site before, and that may even identify you directly. For instance, if you're logged into YouTube and you watch an embedded YouTube video on some other site, YouTube can still recognize you because your browser will still send a personalized YouTube cookie.
This means that loading an embedded video from within a blog could enable the video hosting site (and, in some cases, its advertising partners) to compile a history of which blog entries you were reading and when — even if you didn't try to play the video. When the video hosting site uses an <IFRAME> tag (an increasingly common technique), your browser will automatically load an entire web page from the hosting site; in the course of displaying that page, your browser might send several dozen cookies to several different entities including portal sites or advertising networks. (Even using software like a Flash blocker won't stop this from happening.)
So, that's the challenge we faced: We want to embed video here in the Deeplinks blog because it's an important way of communicating with our readers. But we've also gone to great lengths to protect our visitors' privacy; we believe that when you visit EFF.org, nobody but you should know about it.
As a compromise, we've developed a script called MyTube to protect your privacy. When we embed a video using MyTube, Deeplinks readers will see only a thumbnail from the embedded video — hosted on EFF's own servers — in their web browsers. MyTube prevents the third-party-hosted video from being loaded until and unless the user clicks to play it.
To learn more and get the code, visit our MyTube homepage. You can see the script in action here and here.
This prevents YouTube.com (and other third-party video-hosts) from knowing you've been to EFF.org or reading Deeplinks unless you specifically click to watch the video.
As the web gets smarter and more powerful, a broad range of exciting new tools for enabling collaboration and communication are emerging, of which embedded video is just one. As these capabilities grow, it's important to keep an eye on the unexpected privacy implications. Increasingly often, loading a website or even using a desktop application can send information to multiple third-parties without the user's knowledge or consent. EFF encourages the web community to help us find ways to make these information leaks transparent and controllable for the average user.
Updated Jan 23 2008: We removed a line about EFF's site search which was no longer accurate, and added a link to the new MyTube Homepage.
