Transforming and Visualizing Data


I continue working with The Guardian’s dataset on the 2011 riot incidents in the United Kingdom. After cleaning and transforming the original dataset source, I use the transformed dataset to create an interactive dashboard in Google’s Data Studio (Figure 4), accessible via the link - Final Capstone Project Dashboard.


Figure 4. The interactive Data Studio dashboard for investigating The Guardian's Data Records on the 2011 Riot Incidents in the United Kingdom (Link).


Below I summarize the data transformation steps that resulted in current differences between the two versions of the dataset:
  1. I renamed the “Authority” variable values after identifying and fixing spelling errors in the names of local authorities, referring to the official sources[68] with the lists of names of all authorities in the United Kingdom.
  2. I removed the “Icon name” column, as The Guardian likely created it for mapping purposes, visually defining the data points (1 point as “capital_small,” 243 points as “placemark_circle,” and 1 point as “placemark_circle_highlight”). As such a variable is not coming from the primary source data I focus on, and as the documentation on the creation of such a variable is missing, I will not be using it for the purposes of visualizing data in this assignment.
  3. I substituted empty cells in the “Source” and “Picture links” columns with a string value of “Missing Data”, to emphasize the records that lack credibility, as The Guardian refers to the dataset as including “every verified incident”[69] of 2011 riots, though its entries are missing the “Source” value to verify them.
  4. I geocoded the “Location details” variable values by using the extension “Geocode by Awesome Table,” which extended the dataset with corresponding “Latitude” and “Longitude” values. Such transformation was necessary to represent the data geospatially in the dashboard visualization. During this process, 1 out of 245 entries geocoded incorrectly (“Location details” with “Chatham, Kent”) (placing the coordinates in Canada!); I resolved this issue by also adding the country name in the cell, and thus 245 out of 245 entries were geocoded successfully.
  5. As two other columns of “Time” and “Authority” had empty cells, I substituted them with a “Missing Data” string value as well, to transparently communicate that such data is missing for some entries, which should be accounted for and emphasized when making inferences about the data in this source.
  6. I renamed some of the original variables to avoid running into ingestion data errors in Google’s Data Studio and BigQuery, as these tools require ingesting variables with a one-word name length. I also renamed other variables to follow a more clear and concise naming convention. In the table below I document the updates made to the original source variables, a rationale for such updates, and variable descriptions:


Original Variable NameCurrent Variable NameRationale for UpdatingVariable Description
Time (Approx), if knownTimestampGoogle BigQuery and Data Studio require 1-word variable names; the representation of time as HH:MM:SS can be better defined as a timestamp. The timestamp of the incident (in a “HH:MM:SS” format).
DateDate

The date of the incident (from April 4 to April 10, 2011).
Day (from noon to 6am)DayGoogle BigQuery and Data Studio requirements.The day of the week when an incident happened.
PlacePlace

The place where the incident happened (e.g., a grocery store, the bus station, etc.).
Location detailsAddressGoogle BigQuery and Data Studio requirements; “Address” is more representative of the data entries, which include street names, building numbers, postal codes, etc.The address of the incident.
Local authorityAuthorityGoogle BigQuery and Data Studio requirements.Name of the local authority in which the incident occurred.
What happened?EventDescriptionGoogle BigQuery and Data Studio requirements; “EventDescription” is a more practical name for defining data in BigQuery and Data Studio schemas.  Description of the incident.
SourceSource1Keeping in mind that only one Source column is present, I wanted to extend the dataset with columns Source1, Source2, and Source3 to motivate the review of the dataset and input of evidence, to support and challenge the perceived credibility of each record. The Source2 and Source3 columns are proposed to be added later to continue working with this dataset.Link to the primary source evidence.
Picture linksPhotoLinkGoogle BigQuery and Data Studio requirements; changed “Picture” to “Photo” as it might associate more with the primary photo evidence of incidents in sources such as news articles.Link to the primary photo evidence.
Icon name
The variable was removed due to not representing primary data on the incidents.


EventIDA new variable to refer to the records via assigned ID.The ID value assigned to a record (from 1 to 245, with the date and timestamp sorted chronologically).

LatitudeA new location variable.The computed value of the Latitude of the “Address.”

LongitudeA new location variable.The computed value of the Longitude of the “Address.”


Table 1. Representing changes in the names of variables from the original dataset and the descriptions of variables in the transformed dataset.

Motivating the Dashboard Data Visualization


After identifying the uses of different data variables, particularly for spatial mapping and visualizing changes in data over time, we effectively transformed the primary dataset on 2011 riot incidents. We can further note several limitations of representing the data only in a current tabular/dataset format:
  • Lacking Evidence: 2 of the data records are not supported with any evidence (in “Source1”), though they are presented in a dataset along with more credible, evidence-based records (with the links to articles, police records, etc.). The columns such as “PhotoLink”, “Authority”, “Timestamp” also have missing data.
  • Lacking Geospatial Representation: Tabular representation of data does not effectively represent spatial differences in the records, which can be mapped and further explored by visualizing locations behind records, data clusters, etc.
  • Lacking Representation of Change over Time: As records include dates from April 4 to April 10, 2011, we can benefit from looking at the change over time in all records, as well as the filtered records (e.g., by “Authority”), which we fail to grasp by only looking at a tabular representation of such data.
  • Lacking Explorability on the Levels of Authority/Date/Time/Source: To explore different historical arguments pertinent to the data, it can be especially useful to analyze the records by filtering them on separate levels - looking into select local authorities, dates and timestamps, and the data sources.

I thus note how these issues can be addressed in a form of an interactive dashboard with multiple data visualizations and an option to cross-filter data between them:
  • Conditional Formatting: The possibility of incorporating tabular, conditional formatting in a dashboard can visually emphasize missing data. While such formatting can be achieved via Google Spreadsheets,[70] dashboards allow to cross-filter of multiple visualizations, connecting a tabular representation of data with its corresponding map representation, charts, and other visualizations.[71] We can conditionally format the “Source1”, “PhotoLink,” as well as “Authority” and “Timestamp” values by flagging their missing record values in red.
  • Geospatial Chart: The dashboard can include a map of data points, given that we have the values of Latitude and Longitude; when we select a data point on the map, we can display its information in other dashboard figures such as charts and tables (e.g., representing the record’s variables in a table format, as it might not be possible to visualize all of the available record’s data on a map).
  • Time Series Chart: We can create a time series histogram that displays how the number of records change between August 4 and August 11, 2011; once we select one of the bars on the histogram, the other dashboard component can be cross-filtered - e.g., the geospatial chart will only map the records with the exact value of date selected from a histogram.
  • Filtering by: To investigate the data on multiple variable levels, we can create filters for “Authority,” “Date,” “Time,” and “Source” variables, and once the records are customized based on these filters, the proposed visualizations above will display only the data that is filtered by select values of these variables.

Keeping in mind such use of the dashboard functionalities, I create an interactive visualization, accessible via Final Capstone Project Dashboard; it  includes five pages (besides documentation) - the first page being the landing page describing this Capstone Project (Figure 5).



Figure 5. The interactive Data Studio dashboard landing page of the Capstone Project (Link).

The second page of the dashboard (Figure 4, on p. 46) utilizes the data transformation we have completed so far, and ingests and visualizes such data in Google’s Data Studio. The remaining three pages of the dashboard visualize the data we will be exploring in the next two sections - investigating (1) the Entity Sentiment Analysis for the riot event “Description” variable, (2) the arrests and interviews records of the 2011 riots in the United Kingdom, and (3) the Entity Sentiment Analysis for the “Interview Response” variable.

Dashboard Documentation


Below I outline the dashboard documentation for the second dashboard page investigating The Guardian's Data Records on the 2011 Riot Incidents in the United Kingdom.  This documentation is also user-facing, on the dashboard’s Page 6 and Page 7.

1. About Data Source:

The data ingested into the dashboard comes from The Guardian's datablog on the 2011 riots in the United Kingdom. It is publicly available via a spreadsheet containing 245 records of riot incidents from April 4 - April 10, 2011. We transform the source into a new project dataset by fixing the errors in the namings of "Authorities," substituting empty cells with "Missing Data" values, and geocoding the "Address" values to visualize the data spatially. We establish a Data Studio connection with a project dataset by clicking on "Add data" and selecting the corresponding dataset from Google Sheets.



Figure 6.
Page 1 of dashboard documentation including description of the data source, a way to connect it to Data Studio, and steps to produce Filters and Figure 1.


2. Steps to Produce:

(a) Filters (Objective: to achieve higher explorability of data on the levels of select authorities, dates, and sources):
  1. To filter data in the dashboard by Authority, Date, Time, and Source, we select "Add a control" and "Drop-down list."
  2. In the "Setup" section, we input a corresponding filter variable in the "Control field"; for "Metric," we choose "Record Count," marking "Show values," and for "Order," we select "Dimension" and "Ascending" to display the values alphabetically or chronologically.
  3. In the "Style" section, we select the "Enable search box" to search for records of interest inside the filter boxes.



Figure 7. Page 2 of dashboard documentation including steps to produce Figures 2-3.



(b) Figure 1 (Objective: to represent the data spatially, allowing the user to view clusters of incidents across the country):
  1. We click on "Add a chart" and select the "Bubble Map" under the "Google Maps" section.
  2. In the "Setup" section, under "Fields," we input the "LatLon" location variable, which we compute by merging the dataset's Latitude and Longitude fields; under "Tooltip," we input the variable "Place," which will be shown once you click on the data points on the map; under "Chart interactions" we enable "Cross-filtering," which will sync actions of the user on the map to other visualizations - displaying the selected data point on the histogram of Figure 2 and table of Figure 3.
  3. In the "Style" section, we select "Map" under the "Background Layer" field, define "Style" as "Dark," and "Layer Type" as "Bubbles"; under "Map Controls," we mark "Allow pan and zoom" to be able to zoom in and out on the map.

(c) Figure 2 (Objective: to represent and compare change over time in the count of records):
  1. We click on "Add a chart" and select the "Time Series Chart" under the "Time Series" section.
  2. In the "Setup" section, under "Dimension," we input the "Date" variable; we select "Count" under "Metric" and enable "Cross-filtering" for this figure, too - once a user clicks on the bar of the histogram (a specific count of records per date), the map will zoom in on the corresponding location automatically, both map and the table will display the records behind the selected histogram bar.
  3. In the "Style" section, we select "Bars" under "Series #1" and mark "Show data labels"; under "Axes," we mark "Show axes"; and under both "Left Y-Axis" and "X-Axis" we select "Show Axis Title."

(d) Figure 3 (Objective: to extend data with a "Description" variable and highlight the records with missing evidence):
  1. We click on "Add a chart" and select "Table" under the "Table" section.
  2. In the "Setup" section, we input variables of "Date," "Time," "Authority," "Place," "Description, "Source," and "Photo"; we choose "Ascending" under both "Sort" (input "Date") and "Secondary sort" (input "Time" to order the records chronologically; we enable "Cross-filtering" to allow to click on the table entry and reflect its data spatially on the map.
  3. In the "Style" section, we mark "Show Header" and "Wrap Text" under "Table Header" and mark "Row Numbers" and "Wrap Text" under "Table Body."
  4. In the "Conditional Formatting" under "Style," we define 4 rules to emphasize the missing data in the "Source," "Photo," "Time," and "Authority" variables:

Rule 1: Source: if a source is missing, we highlight the entire entry in dark red.

Rule 2: Photo: if a source photo is missing, we highlight the cell in dark red.

Rules 3, 4: Time, Authority: if data on these variables is missing, we highlight their cells in light red; we visually prioritize flagging missing "Source" and "Photo," as they provide primary evidence to support the record's credibility.


Highlighting Dashboard Functionality


Below I represent the use case of the dashboard, based on filtering by authority (e.g., Manchester) and exploring the corresponding map, histogram, and table entries.

Once we click on the filter “Authority,” we can select “Manchester”. The map in the dashboard is cross-filtered with an “Authority” filter, and it zooms in on the data points (records) in Manchester. We can note that for Manchester there are 22 records in total; the histogram displays that 21 of those are coming from August 9, 2011, and the remaining one record is from August 10, 2011:



The table represents the date, time, authority, place, description, source, and photo variables for these 22 records:



Once we click on the entry in the table, it gets highlighted and updates the map, only showing the corresponding data point (Figure 7):



Figure 8. View of the dashboard when an individual table entry is selected.


Similarly, once we click on the data point on the map, it gets highlighted, and the table displays the corresponding data record (Figure 8):



Figure 9. View of the dashboard when a data point on the map is selected.


Considerations for Dashboard Improvements


While we can see how the missing data is highlighted in the table on the dashboard, the documentation of the rest of the sources can be critiqued further, as some of their values include but are not limited to string values of “Agency reports,” “Agency,” “Channel 4 news”, “Eyewitness report,” not effectively supported with links to these sources (Figure 10). To address this issue, we can format such “Source” cells with similar red highlights to emphasize the lack of credibility of these records and only define the “Source” values represented in a link format as credible.



Figure 10. Source filter view of the dashboard for “Manchester” authority.


An important ethical consideration behind producing such a dashboard is that it represents the data on 2011 riot incidents recorded by The Guardian, recognizing and documenting the crimes committed during the events. Such data is representative of the perspectives of people whose private, business, and other properties had been damaged due to riots, exemplifying the violent nature of the rioting events across different authorities. However, such data representation can also reinforce prior narratives of the riots as unjustified and criminal,[72] lacking the recognition of the causes behind the riots, such as ongoing tensions with police that different communities have been facing.[73] While the dashboard does not effectively represent the perspectives of protesters, it contributes a more critical analysis of The Guardian's data by highlighting the missing data entries, "Source" in particular, and by providing a more exploratory approach to viewing the data - allowing the users to filter by authorities, date and time, and sources.

Such an approach draws on the concept of "generous interfaces"[74] - inviting the user to explore the data (with multiple variables presented in a map, a histogram, and a table) and support browsing of it (by allowing to search the variable values of interest and to filter the data by selecting only the variables of interest). By doing so, I also aimed to portray the dataset's limitations, such as the inclusion of records that are not evidence-based, to challenge the potential view of the data source as being "objective."[75] Such an approach to visualizing the riot incidents data encourages the viewers to analyze it more critically and identify its flaws, challenging the validity of a largely “criminal” prior perception of riots and potentially motivating them to investigate the data on causes behind the events, represented by data sources driven by the 2011 riot participants. Thus I further extend the dashboard with additional pages that ingest and visualize data with the motivations behind the 2011 riots (representing riot participants' perspectives collected from interview records and other qualitative data sources) (Figure 10), with the steps outlining this process in ArcGIS mapping in the following section, which also includes transferring these maps from ArcGIS to Google Data Studio for a complete dashboard including all of the project approaches.



Figure 11. View of the dashboard’s page 3,  representing the Home Office’s official data on riot arrests, and Newburn et al. interview data records on the 2011 Riots in the United Kingdom.

Footnotes

[68] Office for National Statistics. (2020). “Local Authority Districts (April 2020) Names and Codes in the United Kingdom.” Data.gov.uk. https://www.data.gov.uk/dataset/24d87ad2-0fa9-4b35-816a-89f9d92b0042/local-authority-districts-april-2020-names-and-codes-in-the-united-kingdom.

[69] Rogers, Simon. "UK Riots: Every Verified Incident. Download the Full List." The Guardian, 2011. https://www.theguardian.com/news/datablog/2011/aug/09/uk-riots-incident-listed-mapped.

[70] “Use Conditional Formatting Rules in Google Sheets - Computer - Google Docs Editors Help.” Google.com, 2022. https://support.google.com/docs/answer/78413?hl=en&co=GENIE.Platform%3DDesktop.

[71] “Chart Cross-Filtering - Looker Studio Help.” Google.com. https://support.google.com/looker-studio/answer/9173401?hl=en#zippy=%2Cin-this-article.

[72] Murji, Karim and Sarah Neal. “Riot: Race and Politics in the 2011 Disorders.” Sociological Research Online 16, no. 4 (2011): 216-220. https://doi.org/10.5153/sro.2557.

[73] Newburn, Tim, Rebekah Diski, Katy Cooper, Robert Deacon, Alex Burch, and Mike Grant. “‘The biggest gang’? Police and people in the 2011 England riots.” Policing and Society 28, no. 2 (2018): 205-222. https://doi.org/10.1080/10439463.2016.1165220.

[74] Whitelaw, Mitchell. “Generous Interfaces for Digital Cultural Collections.” Digital Humanities Quarterly 9, no. 1 (2015). http://www.digitalhumanities.org/dhq/vol/9/1/000205/000205.html.

[75] D’Ignazio, Catherine and Lauren Klein. “Data Feminism.” Data Feminism, 2020. https://data-feminism.mitpress.mit.edu/pub/5evfe9yd/release/5.