Special Session: danah boyd’s WWW2010 Keynote Talk, April 29 | Imagining the Internet

Social media expert from Harvard and Microsoft shares insights

Brief description: danah boyd, an expert on the social uses of the Internet who works for Microsoft and Harvard University’s Berkman Center delivers a primary WWW2010 conference speech (danah boyd profile.)

Details of the session

danah boyd is known for her incisive analysis of where our use of social networks is taking us. In her keynote talk at the WWW2010 conference in Raleigh April 29, she extended in a new direction from her highly publicized speech this spring at the South By Southwest conference in Austin, where she made a big splash discussing the implications of sharing our personal information online. She delivered a rousing challenge to the Web engineers and entrepreneurs in the audience to think before they act.

Her WWW2010 talk, titled “Privacy and Publicity in the Context of Big Data,” zeroed in on the theme of Web 2.0 and the influence of this era where “data is cheap, but making sense of it is not.” She said that in our obsession with data-sharing online, “we’ve forgotten to ask some of the hard critical questions about what all this data means and how we should be engaging with it.

“Privacy concerns are not new; people have been talking about privacy – or the lack thereof – forever,” she noted. “So what’s different now? The difference is big data… the kinds of data that marketers and researchers and business folks are currently salivating over… The opportunities to scrape data – or more politely, ‘leverage APIs’ – are also unprecedented. And folks are buzzing around wondering what they can do with all of the data they’ve got their hands on.”

Pointing out that the Internet has created new opportunities for people to produce and share data, interact with and remix data, aggregate and organize data, she said, “Data is the digital air in which we breathe and countless efforts are being put into trying to make sense of all of the data swirling around. When we talk about privacy and publicity in a digital age, we can’t avoid talking about data.”

She said the WWW conference is the perfect place to talk about thinking critically about this issue and starting a conversation on methodology and ethics, because understanding the challenges will allow everyone to better frame how to address concerns over privacy and publicity.

The noted ethnographer explained that she maps culture. “Using social science logic,” she said. “I want to discuss four things that all working with big data must understand:

1) Bigger Data are Not Always Better Data – “You need to know your dataset. Just because you’re seeing millions and millions of pieces of data doesn’t mean that your data is random or representative or anything. To make claims about your data, you need to know where the data comes from.”

2) Not All Data are Created Equal – “Big data has its limitations – it can only reveal certain things and it’s outright dangerous to assume that it says more than it does… Big Data introduces two new popular types of social networks derived from data traces: articulated social networks and behavioral social networks. Articulated networks are those that result from the typically public articulation of social networks. As in the public list of people’s Friends on Facebook. Behavioral networks are those that are derived from communication patterns and cell coordinates. Each of these networks are extraordinarily interesting, but they are NOT the same as what sociologists have historically measured or theorized… Measuring tie strength through frequency or public articulation misses the point: tie strength is about a mental model of importance and signals trust and reliability and dependence. Data is not generic. It doesn’t say generic things just because you can model it, graph it, or compute it. You need to understand the meaning behind the representations to understand what it can and cannot say. Not all data is equivalent even it if can be represented similarly.”

3) What and Why are Different Questions – “Nobody loves big data better than marketers. And nobody misinterprets Big Data better than marketers. My favorite moment came when I was on a panel where a brand marketer from Coca-Cola proudly announced that they had lots of followers on MySpace. I couldn’t help but burst out laughing. Coincidentally, I had noticed that Coca-Cola was quite popular as a ‘friend’ and so I had started poking around to figure out why. After interviewing a few people, I found the answer: Those who were linking to coke were making an identity statement, but it wasn’t the fizzy beverage that they were referring to. Analyzing traces of people’s behaviors and interactions is an extremely important research task. But it’s only the first step to understanding social dynamics. You can count until you’re blue in the face, but unless you actually talk to people, you’re not going to know why they do what they do. What and Why are different questions. If you want to work with big data, you need to know what questions you can answer and which ones you can’t. And projecting Why into What based on your own guesses is methodologically irresponsible. You can motivate your analysis through whatever you’d like, but if you’re going to make claims about your data, you better be sure that you’re measuring what you think you’re measuring.”

4) Be Careful of Your Interpretations – “Every act of data analysis involves interpretation, regardless of how big or mathematical your data is. There’s a mistaken belief that qualitative researchers are in the business of interpreting data and quantitative researchers are in the business of producing facts. As computational scientists have started engaging in acts of social science, it’s painfully common for them to come with a belief that they are in the business of facts. You can build a model that is mathematically sound but the moment you try to understand what it means, you’re engaging in an act of interpretation. You can execute an experiment that is structurally sound, but the moment you try to understand the results, you’re engaging in an act of interpretation… Interpretation is the hardest part of doing data analysis. And no matter how big your data is, if you don’t understand the limits of it, if you don’t understand your own biases, you will misinterpret it. This is precisely why social scientists have been so obsessed with methodology. So if you want to understand big data, you need to begin by understanding the methodological processes that go into analyzing social data.”

Boyd said she believes the number-one “destabilizer of privacy” today results from the obsession with big data. She added that biases, misinterpretations and the way they play out are affecting people’s lives, noting “the Uncertainty Principle doesn’t just apply to physics – the more you try to formalize and model social interactions, the more you disturb the balance of them. When you implement new features based on misinterpretations, you can hurt people.”

She said the biggest methodological danger zone this presents is “just because data is’t accessible doesn’t mean that using it is ethical” and added that this is why methodological ethics matters.

Boyd said the idea of privacy is a “collective understanding of a social situation’s boundaries and knowing how to operate within them, it’s about having control over a situation, it’s about understanding the audience and knowing how far information will flow. It’s about trusting the people, the situation, the context.”

She pointed out that “people seek privacy so they can make themselves vulnerable in order to gain something: personal support, knowledge, friendship and so forth.”

She said she wanted to emphasize five key points:

1) Security Through Obscurity Is a Reasonable Strategy – “In mediated settings like Facebook, recording and amplifying are now default. The very act of interacting with these systems involves accounting for the role of technology. As people make sense of each new system, they interpret the situation and try to act appropriately. When the system changes, when the context changes, people must adjust. But each transition can have consequences. People’s encounters with social systems rely on their interpretation of the context. And they’ve come to believe that, even when their data is recorded, they’re relatively obscure, just like they’re obscure when they’re in the ocean. And generally, that’s pretty true. Just because technology can record things doesn’t mean that it brings attention to them. So people rely on being obscure, even when technology makes that really uncertain. You may think that they shouldn’t rely on being obscure, but asking everyone to be paranoid about everyone else in the world is a very very very unhealthy thing. People need to understand the context and they need to have a sense of boundaries to cope. Even in public situations, people regularly go out of their way to ignore others, to give them privacy in a public setting. Sociologist Erving Goffman refers to this as ‘civil inattention.'”

2) Not All Publicly Accessible Data is Meant to be Publicized – “When we argue for the right to publicize any data that is publicly accessible, we are arguing that everyone deserves the right to be stalked like a celebrity. Even with the money and connections to actually maintain some kind of control, many celebrities go crazy or even die trying to navigate paparazzi. What might be the psychological consequences of treating everyone this way?”

3) People Who Share PII Aren’t Rejecting Privacy – “Historically, our conversations about privacy centered on ‘personally identifiable information’ or PII. When we’re thinking about governments and corporations, we usually resort back to PII. But people regularly share their name or other identifying information with others for all sorts of legitimate reasons. They almost always share PII when engaging in a social interaction. Social media is all about social interactions so, not surprisingly, people are sharing PII all the time. What they care about, what they’re concerned about is PEI: Personally Embarrassing Information. That’s what they’re trying to maintain privacy around. Too many people working with big data assume that people who give out PII want their data to be aggregated and shared widely. But this isn’t remotely true. And they certainly don’t want PEI mixed with PII and spread widely. They share data in context and, by and large, they want it to remain in context.”

4) Aggregating and Distributing Data Out of Context is a Privacy Violation – “Context Matters. There are two kinds of content that we focus on when we think about Big Data – that which is shared explicitly and that which is implicitly derived. There’s a nice parallel here to what sociologist Erving Goffman describes as that which is given and that which is given off. When people share something explicitly, they assess the situation and its context and choose what to share. When they produce implicit content, they’re living and breathing the situation without necessarily being conscious of it. Context still matters. It shapes the data that’s produced and what people’s expectations are. When you take content produced explicitly or implicitly out of its context, you’re violating social norms. When you aggregate people’s content or redistribute it without their consent, you’re violating their privacy. At some level, we know this. This is precisely why we force people to sign contracts in the form of Terms of Use that take away their right to demand contextual integrity. This may be legal, but is it ethical? Is it healthy? What are the consequences?

5) Privacy is Not Access Control – “When we talk about privacy in technical circles, it’s hard to get past the technical issue: How does one represent privacy? We have a long history of thinking of content as public or private, of representing privacy through numerical sequences like 700. But this collapses two things: privacy and accessibility. File permissions are about articulating who can and cannot access something. Privacy is about understanding the social conditions and working to manage the situation. Limiting access can be one mechanism in one’s effort to maintain privacy, but it is not privacy itself. Privacy settings aren’t privacy settings; they’re accessibility settings. Privacy settings should be about defining the situation and communicating one’s sense of the situation to others. In LiveJournal, it’s common for participants to lock a post and then write at the top of the post everyone who has access to it. This process is context setting; it’s letting everyone who can see the post understand the situation in which the post is being produced and who is expected to be in the conversation. It’s dangerous to read the accessibility settings and assume that this conveys the privacy expectations. Unfortunately, because access controls are so common, we’ve lost track of the fact that accessibility and privacy are not the same things. And privacy settings don’t address the core problem.”

Boyd noted that “publicity twists it all” – saying, “All five of these issues present ethical questions for big data. Just because we can rupture obscurity, should we? Just because we can publicize content, should we? Just because we can leverage PII, should we? Just because we can aggregate and redistribute data, should we? The answers to these questions aren’t clear.

“Social norms can and are changing, but that doesn’t mean that privacy has been thrown out the door. People care deeply about privacy, care deeply about maintaining context. But they also care about publicity, or the right to walk out in public and be seen. Technology has provided new opportunities for people to actively seek to distribute their content. They can and should have a right to leverage technology to demand a presence in public. And technology that helps them scale is beneficial. The problem is that it’s hard to differentiate between publicly accessible data that is meant to be widely distributed and that which is meant to simply be accessible. It’s hard to distinguish between the content that people want to share to be aggregated for their own gains and that which is never meant for any such thing. It’s hard to distinguish between PII that is shared for social purposes and that which is shared as a self-branding exercise.

“This goes back to our methodological conundrum with big data. Not all data are created equal and it’s really hard to make reasonable interpretations from 30,000 feet without understanding the context in which content is produced and shared. Treating data as arbitrary bytes is bound to get everyone into trouble. So we’re stuck with an ethical conundrum: do we err on the side of making sure that we care for those who are most likely to be hurt or do we accept the costs of exposing people?”

She talked in detail about Facebook’s privacy-boundaries-pushing actions over the past several years. “Facebook has slowly dismantled the protective walls that made users trust Facebook. Going public is not inherently bad – there are plenty of websites out there where people are even more publicly accessible by default. But Facebook started out one way and is slowly changed, leaving users either clueless or confused or outright screwed. This is fundamentally how contexts get changed in ways that make people’s lives really complicated. Facebook users are the proverbial boiling frog – they jumped in when the water was cold but the water has slowly been heating up and some users are getting cooked.

“…Healthy social interaction depends on effectively interpreting a social situation and knowing how to operate accordingly. This, along with an understanding of how information flows, is central to the process of privacy. When people cannot get a meaningful read on what’s happening, people are likely to make numerous mistakes that are socially costly. Facebook does a great job of giving people lots of settings for adjusting content’s visibility, but they do a terrible job of making them understandable. Even when they inform people that change is underway, they opt people in by default rather than doing the work of convincing people that a new feature might be valuable to them. The opt-out norm in Facebook – and on many other sites – is not in the better interest of people; it’s in the better interest of companies.”

She said Facebook could tell you all of the services that have accessed your data through their APIs and all of the accounts inspecting any item of content, noting that people would like to have this feature, “but it’s not in the company’s better interest,” she said, because it is likely to stifle participation – people will recognize that they are being scrutinized and they may withdraw from sharing their information in public.

“It’s easy to swing to extremes, preaching about the awesomeness of all of these new technologies or condemning them as evil,” she said. “But we know that reality is much more complicated and that the pros and cons are intricately intertwined. Teasing out how to walk the tightrope of privacy and publicity is going to be a critical challenge of our era.”

She said social norms are messy and unstable. That data is made of people, people producing data in a context, producing data for a purpose and “just because it’s technically possible to do all sorts of things with that data doesn’t mean that it won’t have consequences for the people it’s made of.”

She directed her closing remarks at the crowd of Web engineers in her audience. “You have the technical and organizational chops to shape the future of code. What you choose to build and how you choose to engage with big data matters. What is possible is wide open, but so are the consequences of your decisions… Privacy will never be encoded in zeros and ones. It will always be a process that people are navigating. Your challenge is to develop systems and do analyses that balance the complex ways in which people are negotiating these systems. You are shaping the future. Build the future you want to inhabit.”

(Thanks to danah for sharing her notes so this representation of her talk could provide so many deep, specific details and complete, accurate direct quotations. She always publishes her talks, and you will soon find the complete transcript here.)

FutureWeb 2010 home>