Skip to content

Building a Risk Map of New Zealand's Roads

Introduction

In this workflow we will use Edge to quickly find insights from across a public road traffic event database. We will cover how Edge can be used to automatically reveal outliers and hidden patterns from across a dataset's many dimensions, and accelerate exploratory data analysis to ultimately build a personalised risk dashboard within minutes.


Overview of Data

The dataset that we will be loading and exploring today is NZTA Waka Kotahi’s Crash Analysis System dataset which tracks traffic incidents across

New Zealand. It contains details such as the types of vehicles involved, local environment features, and weather conditions during each event.

The dataset comes in the form of a large, sparse, tabular .csv file which can be overwhelming and challenging to query without code and a clear direction.

Let's use Edge to quickly make sense of this dataset.


From Upload to Dashboard

Data can be imported into Edge using the API, by connecting to a database, or by uploading a file from your local drive. During upload, all your data is converted automatically into a knowledge graph which is then presented back to you in the form of a simplified dashboard containing views on the data that Edge thinks might be of interest to you.

Read more in How To Upload Data.

For this workflow, we uploaded the .csv directly: some blue t Open Dataset in Edge

Data Model and Taxonomy Overview

Before diving in, let’s quickly relate back what we’re looking at here to the original data structure.

What we see in Edge is the graph version of the large csv file we saw earlier, but instead of getting lost in all the rows and columns of the table, every collision and feature has become a single node which we can interact with directly.

For example, click on an event node from the spatial view and the Inspect Agent will present to us all information related to this event. You will notice that selecting any nodes in one view will highlight them in all other views too - this helps us quickly understand distributions across the dataset's many dimensions.

The "Home" view gives you a full picture of the taxonomy defining these data (see Starting Views). Simply select any feature node and the Inspect Agent will display example values as well as statistics like the mean, min, and max values.

For example, if we select "trafficSign" from the Home view and take a look at the Inspect Agent on the right, we can immediately see a few example values for rows (events) in the dataset, that events have a min of 0 and max of 1 traffic sign present in the local environment of the collision, and that this field is only filled in for 3k of the 10k records contained in this sample of the dataset.

In this example, Edge recognized the spatial context of the data and so in addition to the "Home" view which provides taxonomic coverage, Edge also automatically opened the Spatial View to display the data's spatial coverage.

Edge ran a quick clustering of all the collision records based on similarity to create the "Embedding coloured by tlaId" view which reveal patterns hidden across the datasets many dimensions we may not otherwise have known to look for. Finally, as well as showing us the forrest for the trees in the embedding view, Edge found a particular variable in the dataset which contained outliers and as a result opened "Histogram of Longitude" to surface the anomalous data points.

Within seconds of upload, Edge has given us an accelerated starting point from which we can understand the data's structure, scale, patterns and anomalies.


Reviewing Geospatial Coverage

Edge recognised spatial dimensions and so opened a map view to show us the geospatial distribution of our data. For example, by selecting the event nodes located in the North Island and then checking out the node count (bottom LHS of view), we immediately find out that the vast majority of incidents take place in the North Island, compared to the South.

some blue text

You can toggle on and off the google-maps style background using the Globe icon in the tool bar.

some blue text


Verifying an Anomaly

During upload, Edge will search for fields in the taxonomy which hold unusual distributions, and present them to you during upload. In this case, we find a Histogram of the "Longitude" field has been opened.

some blue text some blue text

Over on the LHS of the view, we can see some events that have much lower values of Longitude that the rest. Why might this be?

some blue text some blue text some blue text some blue text

some blue text

By selecting individual nodes and opening the Inspect agent from the RHS drawer, we can read through all the details logged for each of these collision events. A feature related to their location which pops up is "Chatham Islands Territory".

Having recognised the spatial dimensions, a bolstering agent which works in the background during data upload has assigned a google maps link to every collision record, such that once selected, we can use the Open URL button from the context menu to jump over to google maps and check out the environment at which the collision took place.

some blue text

They did indeed take place on the Chatham Islands, and as such this anomalie is likely the product of the coordinate conversion between NZMG and Lat/Long. At this point, we can either update their coordinate values by moving them in the view, or simply not consider them by removing them

from our histogram and geospatial map views.

Edge helped surfaced outliers which we can quickly validate, then clean or remove from our investigation.


Backtracking Outlier to Source

Let's now take a look at this embedding view which was created for us during import...

First, remind me, what actually is an embedding again?

Embeddings are a common AI tool that cluster records based on their similarity according to the features that define them. In this case, all the event features we see in the "Home" view were used to cluster the collisions such that similar collision types will be located close together in the embedding.

Embeddings are a great tool for working with high-dimensional data as we just uploaded the dataset and have got an immediate feeling for the size, shape and even underlying patterns that are not easily noticeable in the tabular form.

Read more here.

some blue text

As we can read from the title of the embedding view, the collision events have been coloured by the tlaId field (transport local authority ID). Looking at the combination of both the spatial view and embedding surfaces an immediate pattern which would otherwise be hidden across the dataset's many dimensions: the collisions which have taken place in Auckland are fundamentally different to those in the rest of the country.

some blue text

We notice this due to red cluster of Auckland collisions being displaced vertically in the embedding view. What features of the dataset are causing this displacement? Why are the Auckland collisions different?

some blue text

To answer this, let's open the "PCA Feature Weights" view from the navigation drawer. This view shows us how the the clusters in our embedding view were formed - the features at the top of the view will have pull the scenarios which contained that feature upwards, and vice versa.

You can read this view as the incidents clustered to the LHS have high latitude values, those at the top involved the highest speed limits and tend to result in more serious injuries. And the scenarios towards the bottom of the embedding involved the highest number of cars, road lanes, parked vehicles, and are more likely to result in only minor injuries.

In a short span of time, we have discovered that collisions in Auckland typically happen at lower local speed limits, higher numbers of road lanes, involve more car, but with a lower overall severity of crash.


Querying the Data

We can quickly answer a few questions of the dataset...

Where do collisions involving pedestrians take place?

We can use Edge to quickly select subsets of our data. If we are interested in the events that involved a pedestrian hazard, simply click on the “Pedestrian” feature in the Home view then Select -> Successors. All the collisions that were connected to the pedestrian node are now highlighted, giving us immediate access to the spatial distribution.

What is the weather distribution across these collision events?

Select the "Weather A" feature from our Home view, followed by New View -> Barplot.

Immediately we can see that more collisions take place in light rain than heavy rain, and by selecting each group of nodes we can see where across the country such weather events are most likely.

What are the highest risk areas of New Zealand?

We can colour the whole view by node density using Colour By -> Density. We find that the densest region is of course Auckland in the North Island, which makes sense with respect to the country's population distribution. some blue text

But Edge is flexible in that every tool can be applied on any subset of the data you care about. So if we select the events that took place in Auckland, then use Select -> Inverse and reapply the density colouring, we can colour by density again but this time exclude the Auckland region.

Red hotspots pop up elsewhere around the country. We can quickly find out that what differentiates these data points from the rest by simply highlighting these dense regions and checking out the Describer tool. In the example above, the describer tells us these collision events take place in Christchurch city, in the Canterbury region.


Personalising the Dashboard

Since the dataset we are examining is essentially a log of vehicle collisions across New Zealand, let's transform it into a risk map. We will first categorize the incidents based on the severity of the collision.

  1. Search for the "crashSeverity" feature in the Home view
  2. Select the "crashSeverity" node, then navigate from New View -> Bar Plot to open the distribution of this dataset across severity in a new view
  3. Update the node size and colour of each node group to differentiate each class of event

We have now personalised our dashboard to display road risk across New Zealand. Let's zoom and pan through the view to identify patterns of risk at the local region or road level.

Beyond Auckland which we know is a hotspot in general, we can see a few highways which have suffered a noticeably large number of fatal collisions. For example, the State Highway 1 in the Kapiti Coast region.


Sharing Insights

Having identified a local area with high road risk, we may want to make a note of this insight. To do so, group the collisions that took place here under a new dedicated header by first selecting the data points then creating a new node and assigning it a meaningful label. Following this, there a few options for how we can export and share this insight:

  1. Share the graph with your team or individual email addresses using the Share tool located in the top RHS of your screen
  2. Share just this spatial view by right-clicking on the view title and selecting Copy View URL
  3. Export the view to a csv by selecting all nodes in the spatial view and then using Export -> Selected subgraph as .csv

In a matter of minutes, we have identified a region of high risk, recorded this insight, and shared it with a colleague.


Summary

Within seconds of uploading the NZTA Crash Analysis dataset to Edge we received fully-interactive dashboard showing us the spatial and taxonomic coverage of road risk in New Zealand, as well as outliers and patterns hidden across the dataset's many dimensions. From here we:

  • Developed an understanding of our data's distribution across taxonomy features.
  • Investigated patterns surfaced by the tool and traced clusters to their origins.
  • Addressed outliers with contextual assistance provided by the tool.
  • Identified collision classes exhibiting unexpected properties.
  • Formulated data inquiries and personalized views based on relevant metrics, such as severity.
  • Pinpointed high-risk regions for further investigation or collaboration with colleagues.

In the subsequent workflow, we'll delve deeper by integrating additional diverse data sources into our risk mapping analysis.

Next Workflow →


Last update: 2024-06-07