## Background
Recent research finds that political polarization in America is growing (Moody & Mucha, 2013), in part because of the diminishing overlap in news media consumption between liberals and conservatives (Pariser, 2011; Pew Research Center, 2014). More than ever before, liberals and conservatives exist in different worlds of news about politics and government. They turn to and trust different sources for information about current events (Iyengar & Hahn, 2009; Pew Research Center, 2014). These patterns are reinforced by the proliferation of social media: Because individuals often choose to friend or follow others who share their political views (among other characteristics), the news and opinions they see on their newsfeeds and threads are more likely to align with their political ideologies (Bakshy, Messing, & Adamic, 2015; Pariser, 2011).
Scientists have only recently begun studying this polarization in media consumption. Much of this research has been focused on “echo chambers” (e.g., Barberá, Jost, Nagler, Tucker, & Tonneau, 2015) or “filter bubbles” (e.g., Pariser, 2011) in social media, examining whether social media has resulted in increased selective exposure to viewpoints in line with users’ own political preferences. In addition, network analyst Valdis Krebs has examined polarization in individuals’ trust of news sources (2014) and political book purchases (2000; 2003; 2004). This past work largely concentrates on how social media platforms and individuals themselves shape the way they interact with online news media.
The present research takes a step back, setting aside the behavior of individual actors to examine first the new landscape of online news media. A few decades ago, news in America was primarily delivered through a small number of network newscasts and newspapers that targeted all consumers with balanced "point-counterpoint" coverage of events (Iyengar & Hahn, 2009). Consumers chose from this small number of available sources and received the majority of their news from that program and/or newspaper (Messing & Westwood, 2014). Today, the number of news outlets available to consumers has grown exponentially. The rapid growth of cable television and digitalization of news and radio has lead to a "fragmented information environment" that gives consumers unprecedented choice in the news they consume (Iyengar & Hahn, 2009). This has lead news media to completely restructure in an effort to attract and retain consumers in a divided market. From catering to consumers' political perspectives (Mullainathan & Schleifer, 2005) to attempting to attract viewers to specific stories in the new absence of source loyalty (Messing & Westwood, 2014), the underlying structure of news media has fundamentally changed. In this research, we aim to better understand the nature of the new structure of online news media and, ultimately, how this news environment may affect consumers' understanding of events and attitudes.
We will focus on two aspects of online news media in particular. The first is **hypertextuality** (Deuze, 2003), a new feature of journalism unique to the online arena. Through external hypertextuality (including hyperlinks in one article that link to stories on other media sources’ websites), news media may create an inherent structure where some sources are connected through hyperlinks frequently, whereas other connections are much more rare.
The second aspect of underlying structure we will examine in this research is **thematic clustering**. Unlike the "point-counterpoint" style of earlier news media, today's sources are likely to convey different perspectives on the same event. In doing so, we might expect that they will focus on different aspects of these events through the text, photographs, reposted tweets, and instagram posts they choose to include. As such, certain online sources could cluster together thematically, choosing similar sorts of content to emphasize and differentiating themselves from other sources that tend to cover different aspects of events.
We hypothesize that hypertextuality and thematic clustering will overlap, such that sources that tend to link to one another will also tend to emphasize the same themes in their news coverage. Such a finding would suggest that the new structure of online news media may inherently encourage "echo chambers" or "filter bubbles"--even apart from the influence of social media filtering and individual choice--through its organization into connected pockets of articles espousing homogeneous perspectives.
We will examine online news media in the context of a recent polarizing racialized event, the shooting of a black unarmed teenager in Ferguson, MO, by a white police officer. The topic of race and cross-race relations is particularly pertinent in the context of political polarization, as survey research shows that attitudes toward race, ethnicity, and "identity politics" increasingly differ along party lines (Pew Research Center, 2014), including when it comes to interpreting the role of race in events such as recent police shootings of black Americans (Pew Research Center, 2015). This focus is thus timely, and also will allow us to meld our study of online news media with existing research on intergroup relations. For example, past research has found evidence of linguistic intergroup bias (the tendency to communicate about racial in-group and out-group behavior at different levels of abstraction; Gorham, 2006; Karpinski & von Hippel, 1996; Maass, Salvi, Arcuri, & Semin, 1989; Perdue et al., 1990; Semin & Fiedler, 1988) in individuals’ speech, a potential indicator of implicit prejudice (von Hippel, Sekaquaptewa, & Vargas, 1997). Thus, as part of our thematic investigation, we will examine whether different media sources show evidence of linguistic intergroup bias as well, and whether the degree to which this bias appears in news coverage differs as a function of source. By examining a racialized event, this work will have implications for our understanding of the effects of media on intergroup relations in addition to our understanding of political polarization in the media.
## Study Design
Articles will be collected from approximately 70 sources identified by Pew Research Center in January, 2015 as the top online news entities and the top African American-Oriented websites in America. Using the search engines of these websites and Google search functions, articles from August 9th-19th, 2014, will be identified using the keyword “Ferguson.” These will be screened by research assistants to make sure they are about the correct “Ferguson.” Article contents (including article source, author, title, text, images, and hyperlinks) will then be extracted and entered into spreadsheets.
## Network Analysis
#### Network 1: Hyperlinks
Data will be transformed into a network with each source representing one vertex and each hyperlink between two sources representing an edge. For example, if an article in the Huffington Post links to a Washington Post article, this would be represented by an edge connecting the Huffington Post and Washington Post vertices. Edges will be weighted such that edge weights represent the number of times any article of one source link to any article of another. Both directed (from the source containing the link to the linked source) and undirected networks will be formed.
We will also attach the following attributes to these graphs: which list of top media sources they came from (Pew Top 50 News Websites or Pew Top African American-Oriented Websites), national ratings of trustworthiness from liberals and conservatives (Pew, 2014), amount of traffic to the source (i.e., number of unique visitors and average time spent on site; Pew, 2015), any demographics of the sources' readership, demographics of the articles' author, and demographics of the sources' editorial staff.
We will analyze these networks at three levels:
1. ***Network-level:***
- *Network connectedness:* First, we will examine the connectedness of the network as a whole. We will calculate various network-level measures as indicators of overall network structure, such as average degree (weighted & unweighted), geodesic distance (i.e., mean shortest path length), clustering coefficient, density, inclusiveness, reciprocity, and connected components. These values will primarily be used as descriptive statistics for the network. However, we can also explore whether our observed values are different from what we would expect by chance by randomly drawing 10,000 samples of the network and calculating network statistics on each. If a value as extreme as our observed value (in either direction--i.e., a two-tailed test) occurs in less than 5% of the 10,000 random networks, we suggest that the observed network differs significantly from what we would expect by chance (based on an inference criteria of *p* = 0.05).
- *Assortativity:* Based on attached attributes, we will also examine assortativity of the network (i.e., the degree to which sources link to other sources that have similar characteristics in the network in general). We can calculate assortativity based on source origin (general top 50 or top African-American sites), trustworthiness by partisan users, site popularity/traffic, and other attributes. Using the random sampling method we outlined above, we can also test whether assortativity based on these attributes differs from what we would expect by chance.
2. ***Subgroup-level:*** We will use community detection methods to identify subgroups in the network. There are several methods that each define subgroups slightly differently, but the overall goal is to identify groups of sources that tend to link frequently to other sources inside the group and less frequently to other sources outside of that group. We will begin by identifying cliques--instances where every vertex is directly connected to every other vertex--which is the most stringent criteria for identifying subgroups, and will then proceed to relax this criteria using n-cliques, k-cores, and hierarchal clustering. This type of method will not be feasible if our network is too sparsely connected; in that case, community detection methods such as the Girvan-Newman algorithm (with the topping rule of maximizing modularity) will be more appropriate. Recent research also suggests a new method--community assortativity--as a method of community detection that is robust against sampling error and number of observations (Shizuka & Farine, 2015). The goal of these analyses is to identify any meaningful subgroups in the overall online news media network, which we will use in the article content analysis to determine whether membership in a certain subgroup predicts different themes in coverage of the Ferguson shooting. We will also examine whether key attributes (e.g., political leaning, size, or audience of source) predicts membership in different network subgroups.
3. ***Actor-level:***
- *Centrality:* We will use actor-level centrality measures (degree, in-degree, out-degree, Eigenvector, betweenness, and closeness) to examine which sources are the most and least central to the online news media network (in the context of the Ferguson shooting coverage). We will also examine whether centrality differs as a function of any key attributes (e.g., political leaning, size, or audience of source).
- *Brokerage:* Using the communities detected in the subgroup-level analyses, we will examine brokerage at the vertex-level to determine whether any sources play representative, gatekeeper, or liaison roles between different communities in the network. For example, if we detect 3 subgroups--A, B, and C--a representative source will show an A-->**B**-->B relationship, indicating that sources of another subgroup tend to link to the representative source (in bold), which links to other sources in its own subgroup. A gatekeeper source would show a A-->**A**-->B relationship, indicating that sources of the same subgroup tend to link to the gatekeeper source (in bold), which then links out to sources of other subgroups. A liaison source shows a A-->**B**-->C relationship, serving as the bridge between sources of other subgroups. The emergence of other brokerage roles--liaison (A-->**A**-->A) and itinerant (A-->**B**-->A)--indicate that a particular source serves as a bridge between other sources of the same subgroup. The purpose of conducting brokerage analysis is primarily for descriptive purposes, but we can also compare the ways that sources in the same community cover the Ferguson shooting depending on whether they are a bridge to sources outside the community (representative, gatekeeper, liaison) or inside the community (coordinator).
- *Homogeneity of linked sources:* Actor-level measures of homogeneity will be used to supplement the network-level assortativity measure. Actor-level measures include H-H Index (measure of how diverse a node's ties are) and homophily (measure of how similar a node's ties are to the node).
#### Network(s) 2+: Images / Content Themes
The above networks are driven purely by the hyperlinks present in the articles collected, and are therefore agnostic to any of the content analysis outlined below. We plan to form the hyperlink networks first, then procede to content analysis. After the content analysis, however, we can return to the network approach by forming an incidence matrix based on images or content themes. In other words, we will create a spreadsheet indicating which source employed which image or content theme (a binary judgment -- it did or it did not). This can then be converted to a one-mode network by imputing an edge between any two sources that both used a certain image or content theme. If there are multiple image types or content themes, these edges can be weighted by how frequently the two sources overlap along different dimensions of content.
This is an important step because it will be informative about the kinds of inferences we can make based on hypertextuality. Does one source frequently linking to another mean that they share a similar perspective on the Ferguson shooting? Or do sources link to other sources that have opposite perspectives, either to call attention to or discredit other interpretations?
If the networks based on hyperlinks and the networks based on shared content are similar, this will support the idea that online news media has organized into "echo chambers" where similar sources link to one another, facilitating movement between homogeneous-perspective sources, while dissimilar sources do not link to one another, impeding movement between heterogeneous-perspective sources. If the hyperlink-driven and content-driven networks do *not* resemble each other, this will not support the echo chamber hypothesis. Instead, this will suggest that the news media environment facilitates (or at least, doesn't impede) movement across heterogeneous-perspective sources, which may suggest that any "echo chambers" that are created may be more influenced by social media and individual choice than any inherent property of online news media today.
## Text Analysis
The text from each of the articles and sources will be explored in a number of ways in order to understand how coverage of this event differs among different types of media sources.
#### Sentiment Analysis
Sentiment analysis is a straightforward method of exploring the positivity/negativity of a body of text. The metrics we will derive using this approach include:
- The overall positivity of a given document. Where $Liwc_{POS}(Doc)$ is the overall positivity, and $posemo$ and $negemo$ are the sets of positive and negative words in the LIWC dictionaries (Pennebaker, Francis & Booth, 2007), and $w_{doc}$ are the words in the document, the overall positivity is:
\begin{equation}
LIWC_{pos}(Doc) = \frac{\sum_{{w}\in{posemo}} w_{doc}}{\sum_{{w}\in{posemo}\cup{negemo}} w_{doc}}
\end{equation}
- The overall subjectivity of a document (Godbole, Srinivasaiah & Skiena, 2007):
\begin{equation}
subjectivity(Doc) = \frac{\sum_{{w}\in{poseomo}\cup{negemo}} w_{doc}}{\sum_{w} w_{doc}}
\end{equation}
- We will also explore using a Vader score as a metric of sentiment. Vader is a sentiment analysis tool that has recently shown good performance in comparison to LIWC on a variety of datasets, and is able to capture some context and by using a set of simple rules and some crowdsourced labeling (Hutto & Gilbert, 2014)
#### Linguistic distance
A second goal is to quantify exactly how different the language is across each type of source. To answer this question, we will use two metrics:
- Cosine similarity between tf-idf document vectors. Here, we will treat all documents from each source/type of source as one entire document. From these documents, we can then construct the tf-idf matrix, and obtain distance scores between all pairings. We can also attempt to uncover any structure that is apparent in the raw textual similarity between sources by running a clustering algorithm (e.g. k-nearest neighbors) over the tf-idf vector space, where each source is treated as its own document.
- Recent developments in computational approaches to text have taken advantage of the power of distributed word representations, such as Stanford's GloVe (Pennington, Socher & Manning, 2014) and Google's Word2Vec (Mikolov et al., 2013). We can use these more nuanced representations in a similar approach to that described above, using techniques described by Kusner et al. (2015).
#### Named entity resolution/information extraction
One question of interest is the way in which different sources and articles refer to the actors in the event. For instance, Michael Brown and Darren Wilson may be referred to by pronouns, by their names, or by some other descriptive noun (e.g. teen, policeman). We will use hand-curated lists and named-entity extraction techniques to identify the ways in which different news organizations refer to these individuals. We will also do this to examine news organizations' references to the individuals engaged in protest on the streets of Ferguson in the days after the shooting.
In addition to retrieving the varied ways in which news organizations refer to these primary actors, we will also explore the types of descriptors that are used in reference to these actors (for instance, the frequency with which race or age are mentioned). We can also explore the use of adjectives describing these actors, especially for Michael Brown and Darren Wilson. The adjectives used in conjunction with these individuals can tell us about the way in which a particular cluster of media sources refers to these actors. Once we have a set of entity references for each individual, we can obtain the adjectives that occur in connection to these actors. We will run the obtained descriptors through the same sentiment analysis techniques described above, as well as explore the use of words in the liwc categories for anger and anxiety, and the lexicon for abstractness, as described below.
We anticipate performing these tasks using Python's [Natural Language Toolkit](http://www.nltk.org/) and [Information Extraction framework](https://github.com/machinalis/iepy). However, it should be noted that methodology in this area is moving quickly, and the specific tools may be adapted, depending on needs and suitability.
#### Dictionary methods
One question of interest, as noted by the [Pew Research Center](http://www.people-press.org/2014/08/18/stark-racial-divisions-in-reactions-to-ferguson-police-shooting/) is how salient issues of race are to different groups. We can derive a lexicon of race related words by first hand-curating a list of seed words that are directly related to race (e.g., "black," "white," "race," "ethnicity," "diversity"), and then appending words to the list that score high on a measure of Pointwise Mutual Information (Turney et al., 2002; Balasubramanyan et al., 2012; Bouma, 2009). More formally, we can select the top $N$ words from the vocabulary of words, $w$ in the vocabulary of our corpus, sorted according to the average race PMI:
\begin{equation}
racePMI_{w} = \frac{\sum_{{s}\in{R}} PMI(w, s)}{|R|}
\end{equation}
Where $R$ is the list of words related to race.
We are not currently aware of any pre-developed race lexicons, but we could begin with a hand-curated list and then 'boost' it by identifying other similar words (see e.g., the PMI approach in Balasubramanyan et al., 2012).
We will create lexicons for other ideas that speak to issues important for this topic. In particular, the distinction between individualism and egalitarianism has been previously highlighted as an important distinction in the way that race is covered by the press (Kellstedt, 2000).
We can also leverage preexisting lexicons, when available. For instance, previous work has suggested that there are differences in the abstractness of language used to describe ingroups versus outgroups (Gorham, 2006). Brysbaert, Warriner & Kuperman (2014) have provided a crowd-sourced lexicon of concreteness for 40k English lemmas that we can use to explore this issue.
Additionally, because we are focusing on one particular event, there are a number of specific terms of phrases that are of interest. For instance, before the encounter with Darren Wilson, Michael Brown stole cigarillos from a local convenience store. This fact lead to discussion over whether the shooting was in part justified because Brown stole this item. We believe a simple normalized frequency (by article or by source) of mentions of the word cigarillo or cigar could highlight the degree to which this event was a focus of the news coverage. We will apply similar analyses to the other terms of interest, including *Graduation*, *High School*, (reflecting his youth), *Eric Garner*, *Travyon Martin*, and *Ezell Ford* (reflecting the increase in awareness of police killings of unarmed black men).
### Connecting network and text metrics
### Conclusion
Rapid growth in online news media permits unprecedented choice in what news content to consume and increasingly allows individuals to operate in different worlds of information. Each individual has their own set of 'accepted' beliefs, their own set of facts and trusted data sources and even their own way of approaching and thinking through problems. These effects may be magnified by emerging "echo chambers" (e.g., Barberá, Jost, Nagler, Tucker, & Tonneau, 2015), especially those driven by political orientation or other group characteristics of media sources.
The analyses described here are the first step in a research program moving beyond studying individual-based patterns in media consumption, to instead focus on how the hypertextuality and thematic clustering of the media environment itself may reinforce these separate worlds. This project is largely exploratory. While there is published research that can guide us in terms of expected relationships between media clustering and the content of the coverage within these clusters, this work is driven by attempting to find which of these relationships are the most robust. We intend to use the results of these analyses to devise a more formal experiment featuring random assignment of participants to experimental conditions, which will allow us to (1) test the reliability of any notable findings we obtain in this initial exploration, and (2) study the psychological impact of any differences we find in coverage of racialized events. We will preregister this second study in due time. Our primary goal in registering the current project is to outline our general research questions and the types of analyses we plan to use to address these questions.
### References
Bakshy, E., Messing, S., & Adamic, L. A. (2015). Exposure to ideologically diverse news and opinion on Facebook. Science, 348(6239), 1130–1132. doi:10.1126/science.aaa1160
Balasubramanyan, R., Cohen, W. W., Pierce, D., & Redlawsk, D. P. (2012). Modeling Polarizing Topics: When Do Different Political Communities Respond Differently to the Same News?. In ICWSM.
Barberá, P., Jost, J. T., Nagler, J., Tucker, J. A., & Bonneau, R. (2015). Tweeting From Left to Right Is Online Political Communication More Than an Echo Chamber? Psychological Science, 956797615594620. doi:10.1177/0956797615594620
Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 31-40.
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904-911.
Deuze, Ma. (2003). The Web and its Journalisms: Considering the Consequences of Different Types of Newsmedia Online. New Media & Society, 5(2), 203–230. doi:10.1177/1461444803005002004
Godbole, N., Srinivasaiah, M., & Skiena, S. (2007). Large-Scale Sentiment Analysis for News and Blogs. ICWSM, 7(21), 219-222.
Chicago
Gorham, B. W. (2006). News Media’s Relationship With Stereotyping: The Linguistic Intergroup Bias in Response to Crime News. Journal of Communication, 56(2), 289–308. doi:10.1111/j.1460-2466.2006.00020.x
Hutto, C. J., & Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media.
Iyengar, S., & Hahn, K. S. (2009). Red Media, Blue Media: Evidence of Ideological Selectivity in Media Use. Journal of Communication, 59(1), 19–39. doi:10.1111/j.1460-2466.2008.01402.x
Karpinski, A., & Von Hippel, W. (1996). The Role of the Linguistic Intergroup Bias in Expectancy Maintenance. Social Cognition, 14(2), 141–163. doi:10.1521/soco.1996.14.2.141
Kellstedt, P. M. (2000). Media framing and the dynamics of racial policy preferences. American Journal of Political Science, 44, 245-260.
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, 957-966.
Maass, A., Salvi, D., Arcuri, L., & Semin, G. R. (1989). Language use in intergroup contexts: The linguistic intergroup bias. Journal of Personality and Social Psychology, 57(6), 981–993. doi:10.1037/0022-3514.57.6.981
Messing, S., & Westwood, S. J. (2014). Selective Exposure in the Age of Social Media Endorsements Trump Partisan Source Affiliation When Selecting News Online. Communication Research, 41(8), 1042–1063. doi:10.1177/0093650212466406
Mikolov, T., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.
Mitchell, A., Gottfried, J., Kiley, J., & Matsa, K. E. (2014, October 21). Political Polarization & Media Habits. Retrieved from http://www.journalism.org/2014/10/21/political-polarization-media-habits/
Moody, J., & Mucha, P. J. (2013). Portrait of political party polarization. Network Science, 1, 119-121.
Mullainathan, S., & Shleifer, A. (2005). The market for news. The American Economic Review, 95(4), 1031-1053.
Pariser, E. (2011). The Filter Bubble: How the New Personalized Web Is Changing What We Read and How We Think. Penguin.
Pennebaker, J. W., Booth, R. J., & Francis, M. E. (2007). Linguistic inquiry and word count: LIWC [Computer software]. Austin, TX
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global Vectors for Word Representation. In EMNLP Vol. 14, 1532-43.
Perdue, C. W., Dovidio, J. F., Gurtman, M. B., & Tyler, R. B. (1990). Us and them: Social categorization and the process of intergroup bias. Journal of Personality and Social Psychology, 59(3), 475–486. doi:10.1037/0022-3514.59.3.475
Pew Research Center (2014). Political Polarization in the American Public. Retrieved from http://www.people-press.org/2014/06/12/political-polarization-in-the-american-public/
Pew Research Center. (2015). State of the news media 2015. Retrieved from: http://www.journalism.org/files/2015/04/FINAL-STATE-OF-THE-NEWS-MEDIA1.pdf
Semin, G. R., & Fiedler, K. (1988). The cognitive functions of linguistic categories in describing persons: Social cognition and language. Journal of Personality and Social Psychology, 54(4), 558–568. doi:10.1037/0022-3514.54.4.558
Shizuka, D., & Farine, D. R. (2016). Measuring the robustness of network community structure using assortativity. Animal Behaviour, 112, 237–246. doi:10.1016/j.anbehav.2015.12.007
Turney, P. D. (2002, July). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 417-424.
von Hippel, W., Sekaquaptewa, D., & Vargas, P. (1997). The Linguistic Intergroup Bias As an Implicit Indicator of Prejudice. Journal of Experimental Social Psychology, 33(5), 490–509. doi:10.1006/jesp.1997.1332