At ScienceOnline Together 2014, I moderated a session titled “social media as a scientific research tool” (background information here). We had a great discussion, and I wanted to thank everyone who came or participated virtually. For the benefit of those who couldn’t make it, I wanted to summarize our discussion.
1) Social media and “big data” can be an incredibly powerful research tool. The ability to study what hundreds/thousands/millions of people are saying, and therefore thinking, about a given topic has countless implications for research. I listed 5 examples in my background blog post, but we discussed many more in the session.
2) It is relatively easy to use social media as a research tool. Though some of the software, like Radian6, can be expensive, if your project is relatively small, it can be inexpensive (even free) and simple to get the data you need. To demonstrate this, attendee Edmund Hart performed an analysis of tweets about the NC Natural Sciences Museum within a few hours of my session ending. Someone made a NodeXL graph of the resulting #scioResearch twitter conversation before I had even gotten snacks in the break after my session. The programming involved is relatively simple, which means that, in the words of an attendee, even if you can’t do it yourself, buy your friendly neighborhood programmer a beer and they can do it for you quickly.
3) The ease in getting the data doesn’t mean that you shouldn’t involve a trained social scientist in study design and data analysis. In order to ensure that the data is analyzed correctly, be sure to involve someone in the project who is familiar with content and discourse analysis, the study of knowledge and attitudes, etc. Just because you can easily obtain the data does not mean that you have the training needed to properly analyze it. As has frequently been discussed, social science is a technical and rigorous discipline.
4) It is important to understand what data you AREN’T getting from social media. Any tool has its limitations. For example, if someone isn’t on twitter (for any reason, including but not limited to “doesn’t have a smartphone/computer” or “doesn’t live in an area with internet/3G”), then you simply won’t be able to study their knowledge and attitudes using their tweets.
5) There are important ethical considerations when using social media and “big data” to study certain subjects. While someone’s tweets are essentially public statements from a legal perspective, someone with only a few followers on twitter is probably not thinking of their tweets in this way. While a study like mine (knowledge and attitudes with respect to shark conservation and management) is unlikely to have any negative effect on the stakeholders I study, it is easy to see how a study that can detect medical conditions like post-partum depression can cause problems for the research subjects if the information gets into the wrong hands. As with any research involving humans, it is important to get approval from your Institutional Review Board (IRB).
Thanks again to all who participated in the discussion! If you have any questions, please let me know in the comments!
> it can be inexpensive (even free) and simple to get the data you need.
It may not be as simple as it appears. To take the example of Twitter — probably the most-used and most-studied social data source — most collection tools are used with either Twitter Search API or Streaming API, both of which have known incompleteness and sample bias. So for example, a collection of “all” tweets employing a given hashag, made with those tools, will likely not include all tweets actually sent with that hashtag. Also, it is hard to know what portion of, or in what pattern, tweets may have been missed.
The only data source Twitter even claims any completeness for is full “firehose” data, available only by arrangement with them or one of their data partners like Gnip. Even with this data, there are questions about how its completeness or neutrality might be assessed or verified. The scrupulous path, I think, is to assume there isn’t really any “raw” or self-evidently neutral data, from any source so complex and mediated as Twitter; there are just data artifacts, which have to be critically interpreted.
Conversary, Palo Alto