Introduction to Enron Email Corpus in Conode

Introduction

The what

We’ll be exploring the Enron email Corpus to showcase how Conode can be used for forensic accounting purposes.

The data

The dataset contains emails by employees of Enron Corporation, an American energy, commodities and services company in Texas. These were obtained by the Federal Energy Regulatory Commission during its investigation of widespread accounting fraud within Enron. The data was made public and can be sourced from Kaggle here.

For this first session, we’re only focusing on the emails of 26 C-Level executives instead of the full 500k. We have features From , To, CC , BCC which are all email addresses, Subject of Email and Date Email Sent , which all connect to the emails.

The how

We’ll begin by asking our graph some basic questions to understand what is in the dataset and generate distribution plots to visualise the results. We’ll also use the opportunity to interact with the graph and edit the views to better understand what we’re seeing. Next we’ll enrich our graph using the extract agent, and conclude by exploring the graph structure of email connections.

The why

This investigation ultimately aims to identify any evidence of fraud within the emails. The value of Conode lies in its ability to help speed up the process of investigation and function to cross-highlight between views. This allows analysts to have a unified view of the data while querying the graph.

Try with us!

We’ve included this dataset as an example graph in Conode, so you can jump straight in and follow along as we query, enrich, and explore the graph structure to conduct our analysis on the Enron Corpus. If you are a new user, please follow the simple directions to create an account first and then feel free to utilize Conode to your content!

Hop into Conode

Approach

To begin, login to Conode, navigate to the example graphs and select the Enron Emails dataset. Within the graph, go to the ‘Ask Graph’ interface.

Part 1: Querying the graph with data interactivity

1. Let’s say you didn’t look at the csv file at all and want to very quickly find out what the dataset contains. Use the prompt question:
What is this graph about?

2. Who exactly are these 26 C-level executives? Let’s find out a bit more about their email activity with this:
Who are the key actors in this dataset and how many emails did they each send?

3. Let’s have a look at and retrieve an email from our highest email sender with this:
Can you show me an email sent from Louise?

4. We’d imagine timelines are important to any investigation to see the progression of events, so let’s get a view of our emails spread across time with this:
Can you visualise the distribution of emails over time based on when they were sent?

Our resulting view is a histogram made up of individual data points of the emails, each represented by a node. Select a node and inspect it to see other features which describe the data, such as the email subject, the date the email was sent, etc.

💪🏻 Tip

The x-axis in this histogram is a datetime feature which is not rendered properly, so what we want to do is convert the values displayed on the x axis ticks by toggling the format temporal axis button in the view toolbar.

💡 FYI

Our previous conversation isn’t lost in the graph view, its been sidelined to the right-side drawer for us to continue alongside the data. Conode champions data observability!

5. Did you notice there are outliers in our histogram? Let’s investigate this!

Upon inspection, these outlier emails seem to be spam which don’t contain any datetime information. We simply remove these nodes from the view.

6. Now that we have the volume of emails across time, let’s shift our focus back to our senders with this:
Can you generate a barplot of sender and frequency of emails sent?

🎨 Coloring our nodes by group (email sender) helps us identify them more easily in the time series plot. From cross-highlighting, we can clearly see that David was much more active earlier on versus Louise whose emails only picked up in the later time period.

Part 2: Enriching the graph

Up till this point we’ve queried the data based on features that exist in the original taxonomy of the data, such as email address of senders and date the email was sent. What could additionally benefit the analysis is to look into the sentiments exhibited within the contents of these emails, and potentially identify unusual activities.

To achieve this, we enrich and build on the graph by using the Extract agent. Take all the emails from one particular individual, input into the extract agent and ask:
Can you detect instances of high stress tones in these emails?

You’ll notice that while the agent can suggest features, we ultimately have the final say on what gets added to enrich the graph, and can edit or remove what we want extracted.

Once satisfied, we run the extractor. The results in the extracted view help us ‘slice the graph’ and focus only on emails from Rick where there is the presence of stress-related keywords.

⭐ Easy peasy

You can see how the simple usage of the extract agent can be translated to work on any other individual and question we’re interested in.

Part 3: Exploring the graph structure

An advantage of using knowledge graphs is the ability to have a single unified view of our data with the graph structure. In this scenario, it’s a perfect opportunity to show the connections between sender, email, and recipient with a force-direct layout of our nodes.

💡 Note

Notice we can immediately spot a standalone cluster of nodes in the top-right corner of the view. Upon inspecting them, we come to understand that these are company-wide announcements sent by Ken Skilling to all Enron employees.

Summary

In this session we have:

Queried our graph
Produced a time-series chart
Identified spam and removed outlier emails
Colored and cross-highlighted our data across multiple views
Used the no-code Extract agent to look into sentiments from emails
Explored the graph structure using a force-directed layout

This initial investigation we’ve done here can already serve as the springboard to uncovering fraudulent activities within the emails, and will be especially valuable to those in the Forensic Investigations or Audit and Assurance sector.

Last update: 2025-03-25