Predicting New York Taxi Fare Costs

Introduction

The NYC taxi-fare prediction is a Kaggle classic. The goal is simple: given a couple of features (number of passengers, pick-up/drop-off locations) can you predict the cost of the fare?

Today, we'll do the challenge in Conode, without writing a single line of code.

Open Dataset in Conode Demo Video

Baseline Prediction

When we upload the data into Conode the import process will convert the csv table into a graph. It will also figure out other sets of views that might be interesting - in this case the locations of the dropoff locations:

Map View

If you don't see the geographic map in the view you can always toggle the earth icon on the top left of the view to turn it off and on.

Let’s start by getting a baseline, this will give us a good place to compare against. After loading the dataset we can use the fits drawer to predict the target variable.

We obtained an R^2 value of 0.02, likely due to numerous outliers in the latitude and longitude data that are causing confusion for the model. So, as with many datasets, the first thing we should do is clean the data.

Data Cleaning

To successfully predict the fare amount we need to clean the data first. We need to make sure that any anomalous values are discarded and won’t (negatively) impact our models. To do so lets look at some of the relationships in our main features, for example: ‘fare_amount’ & ‘passenger_count’.

We can create views for both of the variables to look for anomalies.

In both instances we notice a couple of outlier results that look like mistakes:

Rides with passenger_count = 0
Fares that cost < 0.

Both of these seem like corrupt data, so it’s better to remove.

Similarly we can check the same thing for the pick-up & drop-off coordinates to find outliers there:

We can see that most of the nodes are in the NYC area. But there are also several data-points that lie far away from these - clearly these are also bad datapoints. We could be more precise and remove some of the ones that appear in water or in other implausible areas. But for this should be enough, we can be quite confident that our existing dataset is now consistent.

We have successfully found the outliers and pruned them from our dataset.

Baseline Prediction: Take 2

Now that we are confident of the quality of our data we can repeat the baseline prediction of the ‘fare_amount’. Again, head over to the fits drawer:

We now get an R^2=0.3 showing that there is some signal once the data is cleaned.

Feature Engineering

Let’s see if we can beat that. Let's plot the ‘fare_amount’, to see if we can spot any patterns. The histogram reveals that their is structure at certain fare values, which we can investigate further by cross comparing it to other dimensions of the data. To do this I’ve coloured the anomalies in ‘orange’ and the rest of the data in grey.

By using the spatial dimensions we can immediately spot that these are rides between Manhattan and JFK airport - which of course have a flat fare. The other spikes are probably the other airports. With this insight lets create a new feature representing rides from and two each airport.

(NOTE: we could do even better and make one for one between just Manhattan and the airport, but this will do for now).

We can then re-do the fit:

Great, we have gone from 0.30 → 0.48 with the addition of a simple feature! Depending on our use case we could keep going and find new features to boost accuracy further.

This is a simple, but great example of where being able to see the data through different dimensions can surface structures that wouldn't otherwise be obvious.

Conclusion

In just a few operations, with the aid of our human intuition we have cleaned our dataset, created new features and improved the accuracy of the model by nearly 20% accuracy - all without ever writing a single line of code.

Last update: 2025-01-06