With the release of a slice of the #GuptaLeaks e-mail correspondence by parliament, there is now an opportunity to do some data mining. For this article we will focus on looking at the network of e-mails between individuals in the network.
Where did the data come from?
In this case, from the Parliamentary Portfolio Committee on Public Enterprises (https://pplaaf.in/gleaks/). On the website it states that "South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."
As such, the information used in this article was published by parliament. The analysis done here merely uses this source material.
"South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."
First, all of the PDFs from the website were converted to simple text. This makes it much easier to analyse.
We need to extract and clean the data. We will be focusing on the e-mail network for this article, not the content. A few assumptions are going to be made that should be stated right at the beginning. We will assume that the first e-mail we encounter in each of the documents is the "from" email address and all subsequent emails are those from the to, cc and bcc.
This rule already breaks down as shown in the image above because an e-mail sometimes contains a chain of emails. For simplicity we will keep our (slightly flawed) assumption on who is the sender and who the e-mail is addressed to. This could thus be improved.
Now to cleaning. The extraction of e-mail addresses is not perfect. See the image below.
One can see that on the left the e-mails have errors after the .com. A Python library, to test if the email address is valid and points to a real server, was used to find all the emails that had problems. From there all the common errors were spotted and fixed, resulting in the clean emails that were at are on the right on the image.
Now to the interesting stuff. Now that we have been able to extract all of these e-mails, we can create the #GuptaLeaks e-mail networks. A snapshot of this network is below.
You can explore this network at pygraphistry. You can visually inspect the network which is a directed graph. The graph keeps the properties of who sent what to whom and how many times those connections were made. We can also calculate some network properties that can allow us to understand some characteristics of this e-mail network. The first is a histogram of a the degree for each person on the network. The degree, specifically 'In-Degree' (how many e-mails you have received) and 'Out-Degree' (how many e-mails you have sent) will likely show that most people have very infrequent communication, while a few will be sending or receiving most of the emails. This is proven to be true below in the degree histogram
Most (the mode of the distribution) of the people in the network have sent 0 e-mails (Out-Degree), while also most have only received 1 email (In-Degree). We can look at who sent the most emails:
- email@example.com, 268 (emails sent)
- firstname.lastname@example.org, 104
- email@example.com, 91
- firstname.lastname@example.org, 77
- email@example.com, 65
- firstname.lastname@example.org, 61
- email@example.com, 45
- firstname.lastname@example.org, 39
- email@example.com, 34
- firstname.lastname@example.org, 31
Or received the most emails
- email@example.com, 105
- firstname.lastname@example.org, 53
- email@example.com, 32
- firstname.lastname@example.org, 28
- email@example.com, 23
- firstname.lastname@example.org, 18
- email@example.com, 16
- firstname.lastname@example.org, 14
- email@example.com, 13
- firstname.lastname@example.org, 13
The top domains given all the unique users:
- gmail.com, 117
- sahara.co.za, 49
- kpmg.co.za, 42
- oakbay.co.za, 36
- eskom.co.za, 32
- dha.gov.za, 27
- jic.co.za, 16
- oberoihotels.com, 16
- yahoo.com, 13
- dirco.gov.za, 11
- goldridge.co.za, 11
- t-systems.co.za, 11
- treasury.gov.za, 11
- flysaa.com, 10
- ann7.com, 10
- tarsus.co.za, 10
We can also calculate the centrality measures (they quantify how important someone is in the communication network) to identify the important people in the communication.
The plot above was created using code from [NetworkX introduction: Hacking social networks using the Python programming language](https://github.com/drewconway/NetworkX_Intro_Materials).
As one can see, there are a number of stand outs in the centrality plot, one that is interesting is email@example.com who tends to not send out a lot of emails but sends/receives them mostly to those who are very much central, as such solidifying his high centrality too.
This was just a dive into the #GuptaLeaks e-mail) network using graph mining to understand some phenomena that you can discover from looking at connections between individuals.