Freep this poll!

October 9th, 2008 by Grace Meng

Have you ever been asked to “Freep this poll”?

The word “freep” comes from the “Free Republic,” an online forum for conservatives where its members are regularly informed of online polls and told to go vote en masse.  Although they don’t necessarily admit to “cheating” the polls, they have been accused of clearing cookies or otherwise circumventing the systems set up to prevent one person from voting multiple times.

Conservatives aren’t the only ones “freeping,” though.  The term has migrated across the political spectrum, and readers of decidedly more liberal sites, like DailyKos, are regularly asked to freep a poll.  And right after a presidential debate is prime freeping time for everyone, as nearly every newspaper and cable news channel will set up online polls asking, “Who won?”

I think freeping is great.

Freeping makes obvious how ridiculously inaccurate online polls can be.  Der Spiegel, a German magazine, was shocked when a 2004 online poll asking readers to rate President Bush’s performance in office was rated “excellent” by 59% of its readers–it turned out it had been freeped.  When freeping skews results to the point that no one can believe them, well, that’s a blow for truth, not ideology.

But being an ever-so-optimistic sort of person, I think freeping also shows the potential of online polls, and online measures of public opinion in general, to be more accurate than they are today.  Online polls are popular, despite being obviously inaccurate, because they’re cheap and fun (for those who just can’t get enough of sharing their opinions).  Most of all, at least in theory, they can reach a much larger group of people than professional pollsters.

The problem is that this larger group, even before freepers get involved, is shaped by the website and the audience it tends to draw.  (And of course, the world of people online is already smaller than the world as a whole.)  It wasn’t surprising, nor particularly revealing, that the people who went to the conservative Drudge Report and voted in its poll rating the Palin-Biden VP debate overwhelmingly found that Palin had won.  But if liberal online politicos had freeped the poll, they could have made the poll more representative of our country’s mix of conservatives and liberals.  And vice versa.

My point is that freeping, as creepy as it seems, is one of those strategies that’s open to everyone, left, right, liberal, conservative, polka-dotted or striped.  Some people will always just enjoy freeping for the sake of messing up the system, to enjoy their power to clear cookies and skew polls, though as I stated above, that can easily go so far that no one believes the results.  But if freeping pushes people to participate in polls in forums where they normally wouldn’t be heard, well, that sounds kind of democratic.  Sure, we still have that problem with ensuring one vote per person, but if we thought online polling could have more than entertainment value, maybe we would try harder to come up with better systems.  (I wonder if it would be possible to set up an online poll that actually let you vote as often as you wanted, but indicated you had done so.  Sometimes it’s entertaining to see who cares the most, or maybe more accurately, has the most time on his hands.)  As Mimi stated earlier, choosing to participate in polls, surveys, and studies that shape our world and our lives is increasingly becoming as democratic a duty as voting in the election booth.

Politics and Privacy, Part II

October 2nd, 2008 by Grace Meng

Last week, I wrote about how political data collection has shown that data collection doesn’t have to be a completely one-way street, but rather, can involve individuals’ active and sometimes almost enthusiastic participation.  Part of the enthusiasm comes from a belief that this is what democracy is about—we have the right to try to persuade our fellow citizens, whether from a soap box in the town square or by calling a voter list through a phone bank.  But the data collection by political campaigns encompasses a lot more than name, occupation, and email address.  Karl Rove revolutionized it, with his famous use of consumer preferences to identify and target likely Republican voters, but the Democrats have worked hard to catch up, Catalist being one of the big players in this effort. It’s one thing to compile donor lists; another to cross-reference “beer versus wine” preferences to voter lists.  How is democracy affected by intense, data-based voter profiling?

As Solon Barocas pointed out during his talk on voter profiling at the recent DIMACS workshop, researchers have found that micro-targeting voters can increase polarization and divisiveness.  As candidates are able to air one radio ad for the Latino voters in one state and a different one for the white voters in another, they’re able to espouse more extreme positions than they would if forced to appeal to a more general audience.

If true, this is a serious problem.  But I like to believe that in the long run, and done right, political data collection and analysis could actually enable new kinds of consensus and coalition-building.  For one, in an era where blogs monitor political campaigns hour-by-hour, a local radio ad can be made available to a national audience no matter which micro-audience was originally targeted.  (Update: we can even find out about “telephone” calls to the deaf community!)

But more importantly, I can imagine that if voters and not just campaigns were able to see who else felt the way they did on major issues, many might be surprised.  Solon mentioned that despite the headlines, the algorithms by which likely Democratic or Republican voters are identified is not as simple as beer = conservative, wine = liberal.  Yes, campaigns believe they can figure out who in a community might lean in their direction, but it’s a much more complicated calculation.

So if people chose to share and know who else felt similarly, in ways that were more fine-grained than national polls, really interesting things could happen to our political discourse.  The Left Coast environmentalist might learn the hunter in South Dakota shares a commitment to conservation.  The pro-choice atheist and the pro-life Catholic might learn they both oppose the death penalty.  I’m not advocating that we throw open the curtains on the voting booth.  But knowing how our fellow citizens feel about the issues facing all of us—it almost sounds like that old-fashioned American democratic institution, the town hall meeting.

After all, democracy is the ultimate social activity.  We’re supposed to be making decisions together.

Politics and Privacy, Part I

September 26th, 2008 by Grace Meng

Rock the Vote Application

Catalist and Rock the Vote recently launched an effort to increase voter registration through a very exact tool, a Facebook application.   Using Catalist’s voter targeting databases, and knowing who has downloaded a voter registration form from Rock the Vote’s widget, they’re asking Facebook users to call the people who never actually sent in their forms and remind them to do so.

I’m curious to know how potential voters are responding to these phone calls.  Given Rock the Vote’s target demographic, and the age of most users on Facebook, they may not be as shocked to get a phone call as an older voter might. And in general, I think people are more aware that their personal information is being collected, analyzed, and shared in the political context than they are in other contexts.  I’ve had friends tell me they don’t make donations, even to candidates they support, for fear of getting on “some list.”  And anyone who has ever lived in a state or district involving a close race knows that it’s not uncommon to have a total stranger call you or even knock on your door and ask for you by name.

These kinds of intrusions can be annoying, and in some communities, being outed as a Democrat or a Republican can have more serious repercussions.  But in general, I don’t think the public is as uncomfortable with this kind of data collection by political campaigns and the Federal Election Commission as they are when it’s being done by search engines or ISPs.  (I’m not talking specifically about detailed voter profiling and data mining, which I think is slightly different and will blog about separately.)

I think there are a couple of reasons for this.  First, people believe there are a number of issues that have to be weighed.  It’s not just their privacy rights versus a company’s profits, but their privacy rights versus democratic principles, like government transparency in the case of FEC disclosure.  Second, the data collection is extremely obvious.  We all know campaigns are tracking who’s donated, so they can ask again and again and again, at least until that maximum contribution limit is reached.

Most importantly, though, people want their candidate to win.  If they are contributing more than $200 in an election cycle to a political candidate, they can live with being in the campaign’s database, as well as the FEC’s.  If they care enough to go to a rally and then are asked for their email address, they don’t mind being sent emails from the campaign day after day.  They know that if they are called during dinner and reminded to vote for their candidate, the other likely voters are being called, too.  Heck, the most enthusiastic supporters are using the data themselves, by volunteering for phonebanks and canvassing, as with the Catalist/Facebook application.

Political data collection has some lessons to teach data collection in other arenas.  Don’t try to hide what you’re doing—be obvious.  Even more importantly, give people an incentive to provide information.  Google and Yahoo can assure us that the log data, the IP addresses, the tracking they do when we’re logged into their email accounts, are all meant to provide us a better service, but we don’t really feel like we’re getting something out of it, especially compared to what they’re getting out of it.  These companies currently seem to be working on the model of “Don’t worry, whatever we’re doing won’t hurt you.”  The model should be, “Participate and get value out of the data yourselves.”

DIMACS Workshop on Internet Privacy

September 25th, 2008 by The Common Data Project

Intuitive as a door

Slide from our presentation; image from Harpeth Presbyterian Church

The Common Data Project recently attended the DIMACS Workshop on Internet Privacy at Rutgers University.  Since we’d already introduced the basic idea of a datatrust at the last DIMACS workshop we attended in February, we decided to do a presentation on a more specific aspect of our work—how an individual user might interact with the datatrust.  We want to create a new paradigm, a completely new way for individuals to collect their own personal information and share it with others—whether friends, researchers, or businesses—in ways individuals dictate.  Alex emphasized how such a model must be more intuitive than the opt-in/opt-out models available today, and walked through how this might be possible.

Given that the topic “Internet Privacy” covers a range of issues, the workshop drew a diverse group of participants. We heard a presentation by Adam Smith at Penn State University on differential privacy, a new area of research that we’ve been interested in for some time now, with the hope that it could be useful to our datatrust.  Daniel Howe from NYU and Felipe Saint-Jean from Yale presented on TrackMeNot and Private Web Search, two different approaches to obscuring identification by search engines, leading to an intense discussion on the ethics of purposefully messing with the business model of Google and the other search engines.  EJ Jung from the University of Iowa gave a fascinating talk on the ways controls have been placed on access to data in the Medical Image File Archive (MIFAR) at the Radiology Department.  We found her talk particularly compelling, as her project deals very practically with existing data and the obvious needs of doctors, researchers, and patients.  Solon Barocas at NYU, who also spoke on our panel, shared his research on how data-mining is used by political campaigns for voter profiling, which raises interesting and possibly troubling implications for democracy.

We were also struck by Naftaly Minsky’s presentation on preventing servers from abusing their clients, as he discussed the possibility of hypothetical “trusted third parties” to act as intermediaries between individuals with information and businesses and other organizations that seek information.  His description of the ”trusted third party” seemed to us somewhat similar to our conception of a datatrust.  We’re looking forward to exploring further how his research, as well as the other research we learned about, could shape our work.

P.S. 8 in Brooklyn Heights gets an F. Really?!

September 19th, 2008 by Mimi Yin

Is this a case of “question the data when it tells you something you don’t want to hear”?

Or is there something genuinely broken about NYC’s grading system? How could a school in a neighborhood where the median home sale price for June-August of this year was $2.775 million receive a failing grade? For that matter, what is an F going to do to the median home sales price? (But that’s neither here nor there.)

For what it’s worth, I think the grading system sounds reasonably well thought out: 5% is based on attendance, 10% on parent surveys (which actually gives P.S. 8’s parents an opportunity to influence the grade by expressing their satisfaction over the schools various extra-curricular activities.)

The lion’s share of the grade is based on year-over-year improvement (not for a particular grade, as in this year’s 4th grader’s versus last year’s 4th graders, but for the same class of students: this year’s 4th graders versus last year’s 3rd graders.) It’s a subtle, but importance difference.

In effect, what’s being measured is year-over-year improvement of the students, not the teachers. The remaining 25% is based on median scores (but calculated against other schools and schools with similar demographics).

Still, F seems a bit harsh. And the consequences of receiving an F sound harsh too.

Mr. Klein has said that schools that receive a D or an F two years in a row could be closed or the principal could be removed.

So, is NYC really better off without P.S. 8 than with it?

Perhaps there is something not quite right about the algorithm. Unless I’m completely misunderstanding this third-hand account of how the algorithm works, recent demographic changes in the P.S. 8 district that may be the reason why the school received an F, as opposed to a D or C. (P.S. 8 received a C in 2007.)

A quarter of the students now qualify for free lunch, compared with 98 percent in 2002, and more than half the students are white or Asian-American, up from 11 percent in 2002. Most of these changes are happening among the youngest children, before tests begin in the third grade.

In short, P.S. 8 is now competing against some of the most privileged NYC schools. However, P.S. 8’s privileged (aka white) kids are for the most part, still too young to be tested. As a result, the burden of competing with the city’s other well-off schools is carried mostly by P.S. 8’s less privileged middle-schoolers.

However, this rather significant oversight in the city’s algorithm is not what P.S. 8’s parents (at least the ones who were quoted in the NYT article) are putting forth as evidence of the grading system’s inadequacy.

Several P.S. 8 parents suggested that the F said more about the grading system than the school. They cited events like the annual read-a-thon fund-raiser, an art program that culminated in student work’s being showcased at the Guggenheim, and the school’s recent selection as Brooklyn’s Rising Star Public Elementary School for 2008 by Manhattan Media, a publisher of weekly newspapers.

But, what does the Guggenheim have to do with anything? Not much, if you buy into the city’s reasoning for what the grading system is trying to measure.

What we want with progress reports is to measure what schools add to kids, not what kids bring to the schools, Mr. Liebman said.

So what’s the issue here? A fundamental disagreement about what should be graded? Dissatisfaction with how its graded? Sour grapes over the results of the grades?

It’s hard to imagine that anyone would have made a peep about the city’s algorithm if P.S. 8 had received an A. At the end of the day, we all lend too much credence to data that tells us what we want to hear and too readily discount data that surprises us with bad news. It’s a modern-day variant of “shooting the messenger“. Although in the case of P.S. 8, it seems like the messenger did kinda munge up the message a bit, it certainly wasn’t as badly as some would like to believe.

Upcoming CDP Presentation at DIMACS

September 16th, 2008 by The Common Data Project

The Common Data Project is excited to announce we will be presenting at the DIMACS conference this week.  Officially called the “Workshop on Internet Privacy: Facilitating Seamless Data Movement with Appropriate Control,” the conference is organized by Dan Boneh, Ed Felten, and Helen Nissenbaum.

Alex Selkirk will be speaking on a panel on Thursday, September 18, called, “Aggregation, Mining, Profiling: Who should be in control?”  We’re looking forward to the feedback we’ll get at the conference, as we’re eager to share our ideas and learn from others who are on the program.  We’ll provide more information on our presentation after the conference, and we look forward to hearing your thoughts.

Stand up and be counted!

September 11th, 2008 by Mimi Yin

Just to quote a different news source:

John McCain’s choice of Alaska Governor Sarah Palin as his running mate is sitting very well with a lot of American voters, according to the latest FOX News poll.

The new survey also shows that—among all four candidates running—Palin (at 33 percent) is seen as most likely to understand “the problems of everyday life”—barely outpacing Barack Obama (32 percent), and finishing significantly ahead of both McCain (17 percent) and Joe Biden (10 percent).

Among independent voters, Palin’s lead over Obama on this score widens to 13 points (35 percent to 22 percent).

Fox also provides a link to the “raw data”. (Don’t get too excited.)

Apparently, raw data means a list of the questions asked, broken down by party (Republican, Democrat and Independents, over time.)

Raw data doesn’t include: Who are these people being polled? How did they get so lucky as to be selected to represent the viewpoints of their fellow voting Americans?

I’m not going to get into the validity of political polling here - we have explored poll validity at some length before. (e.g. Polls are conducted entirely through land lines, excluding the 15% of Americans that only have a cellphone number.) Frankly I don’t know enough to say anything useful about them, other than to express the usual lay person skepticism about such things.

Instead, I think a far more interesting question the recent flurry of political polling highlights is how such surveys or any “data collection effort for that matter affects the people who choose not to participate or are never selected to participate in such polls. Most of us breath a sigh of relief when we dodge a telemarketer who wants to spend “a quick 10 minutes” with us on the phone to answer a “simple survey”. But what are we passing up by not participating when so much decision-making is data-driven?

A missed questionnaire about cleaning products is nothing to fret about? But about declining to hand over personal medical data to the doctors, nurses (not to mention hospitals, researchers and insurance companies) we depend on to treat our ailments and prevent future problems?

To cite yet another poll to back up my point about polls ;)

“According to a recent poll, one in six adults (17%) – representing 38 million persons – say they withhold information from their health providers due to worries about how the medical data might be disclosed.

Persons who report that they are in fair or poor health and racial and ethnic minorities report even higher levels of concern about the privacy of their personal medical records and are more likely than average to practice privacy-protective behaviors.”

Harris Interactive Poll #27, March 2007.

More info on the poll here.

Case in point: Back in April, I wrote about the site PatientsLikeMe.com, which provides a wonderful new service that allows individual users to share the most intimate details of their medical conditions and treatments, which in turn creates a pool of invaluable information that is publicly available. However, I also wonder about how their data may be skewed because their users are limited to the pool of people who are comfortable sharing their HIV status and publicly charting their daily bowel movements. The question we have for PatientsLikeMe is: Who isn’t being represented in your data set? And how does that affect the relevance of your data to the average person who comes to your site looking for information? Who won’t find your data helpful because it’s not relevant to their personal situation?

Increasingly, companies, agencies at all levels of government, researchers who advise policy-makers and even individuals are making “data-driven” decisions.

Yet, how often do we dismiss a study by scoffing at the limited range of its participants?

So what do we do? Tell everyone Privacy is soo 20th century. The new millenium is all about self-exposure.

How we resurrect the notion of privacy in a world that can no longer depend on a closed door to protect us against invasions, is a question we must find answers to. However the the solution is not to keep people away from sharing personal data. Doing so means giving up our place at the discussion table when it comes to influencing decisions as mundane as “How reliable does cellphone reception need to be?” to life or death decisions such as “How many ambulances does my local hospital need?”, “What combination of therapies will work best for my condition?”

To state my case more strongly: Participating as a data point in data-driven research is a passive form of voting, the most basic of rights in a functioning democracy.

To be sure, this is an idealistic take on data collection. There remains the much thornier issue of how to ensure that data is used for mutual benefit not monitoring or predatory manipulation. (Factory workers being watched by union bosses at the voting booth faced similar challenges.)

However, to repeat our favorite mantra: The enemy isn’t the data!

Google announces data will be “anonymized” after nine months–but then what?

September 9th, 2008 by Grace Meng

Everyone is in a tizzy with the news that Google is slashing its data-retention policy from 18 months to nine.  To be more specific, Google will “anonymize IP addresses on our server logs after 9 months.”  The announcement, though, only highlights for me the lack of clarity around the word “anonymize” and the general lack of information around what these data retention policies are actually doing for users’ privacy.

Data-retention is a big issue for some privacy advocates, on the theory that something like the AOL privacy scandal wouldn’t have happened if AOL hadn’t been storing the search queries to begin with.  But as we’ve stated before, we at CDP don’t think data deletion is the answer.  In fact, we’re concerned that announcements like the one today from Google can actually further confuse consumers about what’s at stake.

To begin with, Google isn’t promising to delete its data after nine months, just to “anonymize” it.  The company knows that the word “anonymize” can mean quite a lot of things, and even says so: “We haven’t sorted out all of the implementation details, and we may not be able to use precisely the same methods for anonymizing as we do after 18 months…”

Google is being prodded by the European Union’s stricter regulations around privacy, but even the EU directive on data retention only states, “Such data must be erased or made anonymous when no longer needed for the purpose of the transmission of a communication, except for the data necessary for billing or interconnection payments.”  No clear directive on what “made anonymous” means.

When AOL made its search query data public, the company thought it had “anonymized” it.  Same when Netflix released its data.  That didn’t stop people from individually identifying people in the “anonymized” data set.  I trust that Google’s engineers are not using AOL’s and Netflix’s “anonymization” techniques, but it’s clear that focusing so much on the length of time data is retained draws attention away from what happens after the nine months are up.

How should we define “personal information”?

September 4th, 2008 by Grace Meng

We at CDP recently decided that in keeping with our work on developing new standards for online data collection, we should also create a survey of the privacy policies of the biggest online companies. We want to help users not only understand privacy policies more quickly and easily, but also to help them compare the practices of different companies.

As a result, I’ve been spending a lot of time reading privacy policies.  I knew it wouldn’t be a fun activity, but it’s also been challenging in ways I didn’t quite anticipate.  As I started to sit down and actually compare policies across a set of specific issues, it became quickly obvious that although they use many of the same words—private, personal, anonymous—they aren’t all using the same definitions.

For example, Yahoo defines “personal information” as “information about you that is personally identifiable like your name, address, email address, or phone number, and that is not otherwise publicly available.”  Although it discusses the collection of other information, like log data and IP addresses, it never calls this information “personal.”  Ask.com takes a similar tack, disclosing that it does collect such information, but calling it “anonymous information.”

AOL, in contrast, defines “AOL Network Information” as “personally identifiable information” that includes data like IP addresses, sites visited, and search history.  Of course, AOL can’t pretend that such data is actually “anonymous.”  After all, its proud release of “scrubbed” search query data two years ago was quickly shown to reveal the individual identities of thousands of users.

So what do you think?  When a privacy policy makes promises about your “personal information,” should that include your search query history, your IP address, and your log data?  If not, does that mean these companies are free to do what they will with this data?  Leave it unsecured? Hand it over to marketers, government, anyone?

And what does it mean to us, as a society, that companies are defining these words on their terms?

The Common Datatrust Foundation Changes Name to The Common Data Project

August 18th, 2008 by The Common Data Project

We are excited to announce that we have a new name, The Common Data Project. We’ve changed our name for a couple of reasons, to avoid confusion around our use of the words “trust” and “foundation.” As an organization trying to create a new kind of nonprofit institution, we were interested in using these words to help explain our work through analogies to existing institutions–a datatrust that holds an individual’s personal information like a personal financial account, an organization that provides “grants” of information to researchers and nonprofit organizations. But given the specific legal definitions of a financial “trust” and “foundation,” we’ve decided that it’s more important to avoid public confusion. After all, we’re very decidedly not an investment company nor a private foundation.

In any case, we like the immediacy of the word “project”! We’re excited about moving forward on our Project and we hope you’ll get involved with our Project as well.