Dataset Construction on 1981 Riots
The construction of governmental, as well as public, narratives on race riots has partly been supported by analyzing and presenting available quantifiable data from police reports and court hearings. By more critically reviewing the construction and analysis of the data on the riots of 1981, I begin to trace how the past and contemporary discourses on the riots in the U.K. have been primarily reinforcing the perspectives of the U.K. Government in the 1980, as reflected in the data sources.
By contributing a “geographical perspective on the 1981 urban riots in England”, Ceri Peach, a geographer at St. Catherine’s College in Oxford statistically investigated the aggregate data sources on arrests during the race riots of 1981, publishing them five years after the events. Peach argued that the rioting did not correlate with the census-type measures, but instead with a less measurable factor - the “pervasive racism in British society, particularly in its manifestation in the relationship between the police and Afro-Caribbean youth.”[58]
The scholar used the “brief and standardized” data on the incidents reported by 25 police authorities across the UK, including that of the Metropolitan police in Brixton.[59] The categorizations used in the analysis included the: Total number of arrests, percentages of arrested Female, Under 17, 17-20, Not ‘white’, Not local, With Criminal Records Office Record, Unemployed (See Figure 1). Peach’s summary of the arrest statistics concluded that “the young, the black, the unemployed, the male, and those with police records were the most highly represented among those arrested” during the riots of 1981.[60] When Peach analyzed the impact of each category separately, the unemployment and riot arrests did not have a strong correlation.[61] Importantly, the research concluded that the rioting of 1981 “seems more likely to be the product of relations between the police and the Afro-Caribbean population than of the correlation with census-type variables of deprivation,” a view that differed from the governmental officials’ responses to the riots of 1981.[62]
Peach’s conclusion hinted at police racism being a primary cause of the rioting of 1981, though his inferences appear limited due to using the data sources produced only by local police authorities. It does not incorporate the perspectives of the rioters, and, importantly, it does not investigate the data on the individuals and groups of individuals within police forces, and in charge of arresting the rioters. The work was published at a time when the Government publicly denied the presence of institutional and police racism and constructed a negative perception of rioters, primarily Afro-Caribbean males, to be viewed as criminals, and police officers as victims, deserving of respect and support.[63]
Figure 1. “Table 1. Persons arrested in serious incidents of public disorder, by police force area and types of person” published by Ceri in “A geographical perspective on the 1981 urban riots in England” (p. 400).
Dataset Construction on 2011 Riots
The Guardian newspaper published a project titled “U.K. riots: every verified incident. Download the full list”,[64] a data journalism work on the riots of 2011. We can compare how various types of data from the London riots of 2011 and the Brixton riots of 1981 were collected, analyzed, and narrated in 1981 and 2011. You may run into a couple of issues when trying to access the project - (1) the interactive map of riots has been unavailable on the Guardian’s website when checking between August 2022 and October 2022 (Figures 2-3), and (2) the governmental data on the police reports was removed from public view, though it was previously reported that the U.K. government made such data accessible for The Guardian’s analysis and public use in 2011.[65]
Figure 2. Screenshot of “The riots animated by Nick Evershed”, the animation which is not functioning as of October 29, 2022.
Figure 3. Screenshot of “UK riots: every verified incident - interactive map,” the data-driven map which is not functional as of October 29, 2022.
We can retrieve one data set that The Guardian has saved within the same publication; it is shared with the public via Google Sheets titled “UK RIOT LOCATIONS”, and it can be accessed via the following link:
https://bit.ly/UK_Riot_Locations_Guardian-2011
Dataset Description
The United Kingdom experienced a series of riots from August 4, 2011, and until August 10, 2011, firstly starting in London and occurring across the country.[66] The public in London started protesting Tottenham Police, following the death of a 29-year-old black British man Mark Duggan who was shot by Metropolitan Police on August 4.[67] The Guardian’s dataset includes 245 records on places where rioting occurred between August 4 and August 10, with each record including the following variables:
Variable Name | Variable Description |
Time (Approx), if known | Time in the format of HH:MM:SS indicating the start of the event. |
Date | Date in the format of Month, Date, Year. |
Day (from noon to 6am) | Day of the week when the event was recorded. |
Place | Place where the event occurred (e.g., name of the store/building/facility). |
Location details | The detailed address of the location (Street #, County, City/Town, Zip Code). |
Local authority | Name of the authority operating in the district. |
What happened? | Brief description of what happened at the location; the descriptions of damages committed at the location. |
Source | The source and evidence to confirm the event. |
Picture links | The link to pictures taken at the location during the event. |
Icon name | The name of the event icon on the map, used for mapping purposes:
|
We can not identify how exactly the records on riots have been populated into the Spreadsheet, as The Guardian did not publicly document this process. Neither article nor the Spreadsheet includes a description of how the data set was constructed (e.g., who were the contributors defining the time of the event, creating the “What happened” variable descriptions, and identifying the sources for a particular event), thus failing to represent this data more transparently. The public working with this data can easily view it as something objective and complete, and when working with it and presenting it via visualizations or maps - further reinforce the biases of individuals who initially collected the data and formally constructed these data sources.
Loading and Cleaning the Data Source on 2011 Riots
The intention behind loading and transforming this dataset is for us to critique how it was constructed and then approach the same data on riots and rioters more ethically, emphasizing not just the physical damages the rioters had done but also accounting for the possible motivations behind rioting in the first place. I assume that due to the nature of this dataset primarily focusing on documenting property damages in its “What happened” variable, and due to a lack of additional publicly available data sets on the riots of 2011, my transformation of this data will be rather limited. However, the process I intended to use can be transferable to critiquing and offering improvements for similarly constructed data sources on riots in the United Kingdom.
I decided to perform two data transformations, one related to analyzing the data of riots for each local authority separately, and the second related to flagging the availability and absence of evidence for the sources behind each of the records in the data set. Keeping in mind that I might want to share the data publicly and visualize it in a more interactive way that could be useful when comparing different subsets of data, I decided to utilize Google’s Data Studio and BigQuery when working with documenting and visualizing the processed data later. The other feasible, publicly accessible option could be posting the data, data transformation, and project documentation on GitHub; however, this is not a desired option if I intend to include visualizations and create the relationships (e.g., Schemas and Variable relations) between different variables while working with multiple subsets of data (e.g., the original data table, and the tables on each district (local authority) where the riots had occurred.)
I am outlining the Python code and guidance for data transformation in the following pages, which outlines how I approached working with the data - to clean it for my research purposes. I first decided to critique how the data was constructed.
Below I start implementing two data transformations:
- Zooming-In on the Districts (Local Authorities)
- Rationale: to attempt to analyze the records on riots in a specific local context rather than a larger national context like it has been done so far
- Dividing the Dataset into two parts - with sources available (supporting the records) and sources missing (to be added as the evidence behind the record)
- Rationale: to critique the ‘objective’ portrayal of data in the Guardian’s Dataset; to separate the records of riots that are and aren’t evidence-based into two separate tables
# importing packages for transforming data import pandas as pd import numpy as np import matplotlib.pyplot as plt |
# loading the initial dataset produced by The Guardian df_init = pd.read_csv ('1. UK Riot Locations_initial.csv') df_init.head(5) |
# defining a new dataset to transform the data # keeping all columns besides the "Icon name" - this is a variable created initially by Guardian for mapping purposes df = pd.DataFrame(df_init, columns = ['Time (Approx), if known', 'Date', 'Day (from noon to 6am)', 'Place', 'Location details', 'Local authority', 'What happened?', 'Source', 'Picture links']) |
# renaming columns to fit Schema requirements* in Google BigQuery, and for better reusability in general # requirement of BigQuery are that the variable name should be represented by 1 word df = df.rename(columns={"Time (Approx), if known": "Timestamp", "Day (from noon to 6am)": "Day", "Location details": "Address", "Local authority": "Authority", "What happened?": "EventDescription", "Source": "Source1", "Picture links": "PhotoLink"}) |
Data Transformation 1: Observing the amount of authorities (local districts), and creating separate tables for each district.
# there might be missing data! I substitute float type values 'NaN' with string-type ones "None", # to avoid running into errors with the data types when transforming data later df["Authority"].fillna("None", inplace = True) # getting all the possible authority names authorities = df['Authority'].unique() |
# viewing the authorities; note - there's also a 'nan' one - authority is missing for some records! authorities |
# transforming the numpy array type of this data into a list type, to allow going through the list iteratively later all_authorities = authorities.tolist() |
# the amount of authority-specific datasets we will get as of now print('The number of unique authorities is', len(all_authorities)) |
# sort the authorities alphabetically all_authorities_sorted = sorted(all_authorities) # view the list to look for any errors all_authorities_sorted |
Errors identified:
1. Typo: 'Gloucstershire' should be 'Gloucestershire' instead
2. Research required: 'Salford' and 'Salford City Council' might indicate one authority 3. Research required: 'Wolverhampton' and 'Wolverhampton City Council' might indicate one authority
4. Research required: 'Waltham Forest' and 'Walthamstow' might indicate one authority
After checking with the official list of local authorities in the United Kingdom (https://www.local.gov.uk/our-support/guidance-and-resources/communications-support/digital-councils/social-media/go-further/a-z-councils-online), we can define:
- 'Salford' for both 'Salford' and 'Salford City Council'
- 'Wolverhampton' for both 'Wolverhampton' and 'Wolverhampton City Council'
- 'Walthamstow' (borough) for both 'Waltham Forest' and 'Walthamstow'
# substituting any NaN (missing) value with "None" for further analyses with these 'string' data types df.fillna("None", inplace = True) |
# fixing errors 1 - 4 identified above (renaming the values) df['Authority'].replace({"Gloucstershire": "Gloucestershire", "Salford City Council": "Salford", "Waltham Forest": "Walthamstow", "Wolverhampton City Council": "Wolverhampton"}, inplace = True) |
# let's check the list of authorities now authorities_count = len(df['Authority'].unique()) print('The number of unique authorities is', authorities_count) |
# getting all authorities sorted alphabetically authorities = df['Authority'].unique() all_authorities = authorities.tolist() all_authorities_sorted = sorted(all_authorities) all_authorities_sorted |
After fixing the errors we now have 43 unique authorities instead of the initial 47. We can view how many records there are per each authority - this corresponds to the amount of individual tables we will be creating. However, a particular data limitation should be noted - a couple of boroughs within the area of London are defined separately in the list, though they could potentially be clustered together, due to London having a higher density of boroughs than the rest of the country; some of such boroughs like Islington and Westminster are located geographically next each other, and being representative of 4 records on riots that might have happened within close proximity to one another (with 2 events per each of these districts, potentially gathering public living in both districts). While I do not proceed to cluster these districts yet, this can be one of the potential suggestions for future work with the data.
df['Authority'].value_counts() |
# visualizing the frequencies of records per authority, for top 20 values df['Authority'].value_counts().head(20).plot(kind='barh', color = 'olivedrab') |
As we can have these tables stored separately in the Google BigQuery environment, we can store them in csv files that can later be ingested into the database/data warehouse environment for analyses, more systematic ways of documenting the processes, and creating visualizations.
# creating csv files for i in all_authorities_sorted: authority_name = str(i) table = df[df['Authority']== authority_name] table.to_csv(authority_name + '_Riot_Records_2011'+ '.csv') |
# testing if the files have been saved successfully by opening some of the records saved in csv # Haringey - top 3 records pd.read_csv('Haringey_Riot_Records_2011.csv').head(3) # Birmingham - top 3 records pd.read_csv('Birmingham_Riot_Records_2011.csv').head(3) # Manchester - top 3 records pd.read_csv('Manchester_Riot_Records_2011.csv').head(3) |
Data Transformation 2: Separating the dataset into the parts with the sources on the events, and the part without the sources (to motivate finding the sources/evidence for the records that are missing them).
# creating a data frame of records that include sources sources_available = df[df['Source1'] != 'None'] sources_available print('The number of records supported with the sources is', len(sources_available)) |
# briefly looking into the sources referenced in the dataset sources_available['Source1'].value_counts() |
We can see that in the column 'Source1' the "Agency reports" are referenced 16 times, though the name of the specific agency and link to the Agency's source is missing. The articles from the Birmingham Mail, The Guardian and Thames Valley Police are referenced 11 times each. As of now, the Birmingham Mail and the Thames Valley Police articles are not viewable. The implication of having these sources support so many of the records on events may be that such sources are representative of a particular perspective on the riots and rioters that may limit more complex historical and social analyses of the data on these historical events.
# creating a data frame of records that are missing the sources sources_missing = df[df['Source1'] == 'None'] sources_missing print('The number of records not supported with the sources is', len(sources_missing)) |
# viewing the missing sources per authority sources_missing['Authority'].value_counts() |
# visualizing the frequencies of missing sources per authority sources_missing['Authority'].value_counts().plot(kind='barh', color = 'olivedrab') |
# viewing the missing sources per date sources_missing['Date'].value_counts() |
# visualizing the frequencies of missing sources per date sources_missing['Date'].value_counts().plot(kind='barh', color = 'olivedrab') |
While the records supported with the sources amount to 225, the records without evidence amount to 20.
By stratifying the data into the "Dates" of these sources we can see that most of them, recorded on August 7, 2011 are missing the sources; while looking at the "Authorities", the highest source-missing one is related to Haringey (N = 7) authority, followed by Lambeth, Enfield, and Walthamstow (N = 3).
Due to such limited sample sizes we might not be able to reach significant inferences on causality, however the relation between the highest number of missing sources coming from Haringey is likely due to it being the district with the highest number of records overall. We can now save these tables in csv files as well, to be later ingested into database systems for the digital humanities project:
table = sources_available table.to_csv('Sources_Available_Riot_Records_2011.csv') |
table = sources_missing_df table.to_csv('Sources_Missing_Riot_Records_2011.csv') |
Limitations of the Data Transformation Approaches
As I aim to motivate two arguments during the data transformation process, based on focusing on the subsets of data on local authorities and on critiquing the available supporting evidence behind the records, below I raise the limitations to such an approach in relation to both arguments. Such limitations fall under the critique of the data source as an objective source of truth, as I note how subjective beliefs of the data source creators can influence the subsets of data on local authorities and sources of evidence. The data cleaning approach is at risk of reinforcing such beliefs when extracting the records limited to this dataset and lacking the exploratory data analysis of "Authority" and "Source" values.
Due to missing the official documentation on how The Guardian constructed the dataset and which decision-making criteria it involved, it is impossible for us to know which local authority records might have been excluded from it. When the dataset is transformed and divided into local authority subsets, we should further investigate the availability of other data records on 2011 riot incidents in the corresponding authority via other publicly available and archived sources, as The Guardian journalists might have focused primarily on the national-level view of these historical events, at the expense of overlooking other records valuable for the analysis of riots incidents in a particular local authority context. Thus by only motivating to work with the local authority data from the given source, we would be reinforcing one's undocumented approach to representing the data on riots nationally, which is limiting when trying to investigate the local contexts of events by working with such data. As an extension to addressing the argument related to local authorities, we can offer the selection of other available data sources pertinent to the 2011 riots in each local authority where the riot incidents had been recorded.
Another limitation related to the local authorities' records is that some of them might be a part of a larger metropolitan area, thus if they are formally defined and divided as separate, a closer geospatial look can motivate us to cluster some of the local authorities together - due to their location proximity, and intersections between different social services that communities of residents interact with across local authorities. A practical example for such an investigation would be the London Metropolitan area, under which we can locate authorities of Brixton, Tottenham, Westminster, Haringey, and more, identifying the clusters of riot incidents that can occur in different authorities, though potentially share the same characteristics such as historical actors, and causes behind the rioting. As we prepare for analyzing data on such local authorities separately, we do not consider identifying the clusters of such local authorities, which could strengthen data inferences on the causes of the riots in communities across the United Kingdom.
An additional limitation is related to an incomplete representation of the "Source" variable in the dataset. While I focus on flagging the empty cells of the "Source" variable with missing data ("None" string-type value), a closer look into the non-empty cells reveals that some source descriptions only include the titles of sources (e.g., "Agency reports"), which lack an appropriate reference, or a link, thus lacking credibility. We need to differentiate between the "Source" variable values that contain a link, being an accessible and more credible source of evidence and the values that lack such a link. The current approach does not account for such discrepancies within the non-empty "Source" column cells and thus would benefit from defining a way to flag less credible sources - e.g., by defining a new variable "Source_Classified" that preserves the format of the old records in "Source," but extends it with a classification such as "Missing Source"/"Missing Link, Source Name Only"/"Source with a Link."
Summarizing the Data Transformation
In this section, we completed the first important step of working with the data - by intentionally transforming it and considering data cleaning objectives to prepare it for testing of particular historical analysis questions. In the next section, I explore how we can interactively visualize such data, utilizing the affordances of its variables for more effective visual and persuasive communication with our intended project audiences.
Footnotes
[58] Peach, Ceri. "A geographical perspective on the 1981 urban riots in England." Ethnic and Racial Studies 9, no. 3 (1986): 396–411. doi:10.1080/01419870.1986.9993541.
[59] Ibid.
[60] Ibid.
[61] Ibid.
[62] Ibid.
[63] Teun A. van Dijk. "Race, riots and the press: An analysis of editorials in the British press about the 1985 disorders." Gazette (Leiden, Netherlands) 43, no. 3 (1989): 229–253. https://doi.org/10.1177/001654928904300305.
[64] Rogers, Simon, and Lisa Evans. "UK riots: every verified incident. Download the full list." The Guardian, 2011. https://www.theguardian.com/news/datablog/2011/aug/09/uk-riots-incident-listed-mapped.
[65] Rogers, Simon, and Lisa Evans. "UK riots: the demographics of magistrate cases and convictions." The Guardian, 2011. https://www.theguardian.com/news/datablog/2011/aug/11/uk-riots-magistrates-court-list.
[66] “Riots 10 years on: The five summer nights when London burned.” BBC, 2021. https://www.bbc.com/news/uk-england-london-58058031.
[67] Ibid.