Data Science Townhall and Q&A #1 (#DSIDE2017)

We have started the second session of the 2017/2018 DSIDE program. Our team at CSIR Modelling and Digital Science takes about 18 our of the 50 students on the program yearly. On the 8th of December we hosted Tefo Mohapi (Tech Journalist and Owner of iAfrikan).

Data Science Townhall and Q&A

Data Science Townhall and Q&A

Tefo talked a bit about his work and it's relations to data. The Q&A session was free flowing with questions on ethics of data, challenges with privacy and POPI. What became apparent was the general lack of public knowledge of the Protection of Personal Information (POPI) Act.

I want to express a lot of gratitude to Tefo for availing himself for the session.

Tagged with: , ,

SOCML: Self-Organizing Conference on Machine Learning 2017

I was invited to the Self-Organizing Conference on Machine Learning (SOCML) and attended at the end of November 2017. As per the name, the conference did not have a preset schedule but the participants themselves would put together the agenda on a daily basis and then have sessions. These are just my high level thoughts on the process.

Ian Goodfellow opening SOCML and planning the sessions for the day

Ian Goodfellow opening SOCML and planning the sessions for the day

This was my second "Unconference" format. I had attended a South African Treasury Unconference in September. The sessions I attended and wanted to share are next. Read more ›

Analysis of the #GuptaLeaks e-mail network

Appeared on iAfrikan[Link]

With the release of a slice of the #GuptaLeaks e-mail correspondence by parliament, there is now an opportunity to do some data mining. For this article we will focus on looking at the network of e-mails between individuals in the network.

Where did the data come from? In this case, from the Parliamentary Portfolio Committee on Public Enterprises ( On the website it states that "South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."

As such, the information used in this article was published by parliament. The analysis done here merely uses this source material.

"South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."

First, all of the PDFs from the website were converted to simple text. This makes it much easier to analyse.

What's next?

We need to extract and clean the data. We will be focusing on the e-mail network for this article, not the content. A few assumptions are going to be made that should be stated right at the beginning. We will assume that the first e-mail we encounter in each of the documents is the "from" email address and all subsequent emails are those from the to, cc and bcc.

This rule already breaks down as shown in the image above because an e-mail sometimes contains a chain of emails. For simplicity we will keep our (slightly flawed) assumption on who is the sender and who the e-mail is addressed to. This could thus be improved.

Now to cleaning. The extraction of e-mail addresses is not perfect. See the image below.

One can see that on the left the e-mails have errors after the .com. A Python library, to test if the email address is valid and points to a real server, was used to find all the emails that had problems. From there all the common errors were spotted and fixed, resulting in the clean emails that were at are on the right on the image.

Now to the interesting stuff. Now that we have been able to extract all of these e-mails, we can create the #GuptaLeaks e-mail networks. A snapshot of this network is below.

You can explore this network at pygraphistry. You can visually inspect the network which is a directed graph. The graph keeps the properties of who sent what to whom and how many times those connections were made. We can also calculate some network properties that can allow us to understand some characteristics of this e-mail network. The first is a histogram of a the degree for each person on the network. The degree, specifically 'In-Degree' (how many e-mails you have received) and 'Out-Degree' (how many e-mails you have sent) will likely show that most people have very infrequent communication, while a few will be sending or receiving most of the emails. This is proven to be true below in the degree histogram

Most (the mode of the distribution) of the people in the network have sent 0 e-mails (Out-Degree), while also most have only received 1 email (In-Degree). We can look at who sent the most emails:

  •, 268 (emails sent)
  •, 104
  •, 91
  •, 77
  •, 65
  •, 61
  •, 45
  •, 39
  •, 34
  •, 31

Or received the most emails

  •, 105
  •, 53
  •, 32
  •, 28
  •, 23
  •, 18
  •, 16
  •, 14
  •, 13
  •, 13

The top domains given all the unique users:

  •, 117
  •, 49
  •, 42
  •, 36
  •, 32
  •, 27
  •, 16
  •, 16
  •, 13
  •, 11
  •, 11
  •, 11
  •, 11
  •, 10
  •, 10
  •, 10

We can also calculate the centrality measures (they quantify how important someone is in the communication network) to identify the important people in the communication.

The plot above was created using code from [NetworkX introduction: Hacking social networks using the Python programming language](

As one can see, there are a number of stand outs in the centrality plot, one that is interesting is who tends to not send out a lot of emails but sends/receives them mostly to those who are very much central, as such solidifying his high centrality too.

This was just a dive into the #GuptaLeaks e-mail) network using graph mining to understand some phenomena that you can discover from looking at connections between individuals.

Tagged with: ,

Can you call the EMPD?

So, I got a speeding ticket notification via my bank. When I investigated it I found out it was in Ekurhuleni. I checked the rest of the details of the ticket. On the date I had not driven in the Ekurhuleni Municipality. Following this I got access to the speeding fine pictures, lo and behold the car was a bakkie (which I do not own). Then I noticed that the number plate was almost identical to my car except for one letter. So the automatic number plate reader had made a mistake. Whew. Now how to fix this? Let me get hold of the EMPD traffic office. Easy right. Well, this has now turned into a nightmare that keeps on going.

TLDR: I still have not gotten hold of and EMPD office after 43 all,s 20 different unique numbers. Its getting a bit frustrating and perplexing. 

Here I present all the calls I have made to try to resolve this issue. My biggest problem is that I have not been to talk to anyone at EMPD. I have not at any time been able to connect to an EMPD office. Even after being given so many different numbers be 3 main call centers the City of Ekurhuleni runs.

Ekurhuleni Calls

Ekurhuleni Calls

Where do we start? At the beginning of course, asking how to get contact with EMPD on Twitter

Well. Got two numbers. Lets try both. Well .....

Well this was just the beginning. Finally I got a call back from one of the numbers, who then told me that done deal with traffic complaints (odd, the city Twitter account shared their number with a notice that it is for traffic complaints). Anyway, got a number that I would later learn is for the emergency services. I felt completly bad when I found this out, you dont want to misuse an emergency number. Well, story gets even more interesting. The Emergency center could not guarantee that any of the numbers they gave me would work. Meaning they knew there is a problem with numbers. Just FYI these are the only contacts the city shares on their website.

Ekurhuleni Contacts

Ekurhuleni Contacts

Did it get any better from here. No. Not by a long shot. I also got in contact with the official City of Ekurhuleni call center where the operator, matter of factly, told me they do not keep any numbers for EMPD. They only have municipal office numbers. They proceeded to give me a list of numbers. You now should know the pattern, none of the phone numbers worked. None, no joke.

Here I am a week later. I still have a fine for a car I don't own. I cant get in touch with EMPD. Help.

You can view my call data here

Tagged with: , , , ,

NIPS Authors 2006 - 2016

As part of the Deep Learning Indaba, we have gotten a dataset of the accepted papers from NIPS 2006 - 2016 and countries of corresponding authors. We wanted to understand the scale of the non-participation of countries/regions as well as overall trends. The main post on this is now available on the Deep Learning Indaba page: Missing Continents: A Study using Accepted NIPS Papers. This post is to just add some other visualisations that might provide some more context. Its a short post.

Number of Accepted NIPS papers 2006 - 2016

Number of Accepted NIPS papers 2006 - 2016

Total Authors per Country [2006-2016]

Total Authors per Country [2006-2016]

Gratitude to all of those we spoke to during this exploration and also acknowledging the work of Emily Muller who we worked with on the visualisations.


The data was prepared by NIPS. There were some challenges with using it directly. NIPS did not use ISO country names. Some countries we had to check from the original papers. Authors seem to primarily be counted agains the country of institution, but we did see a few cases where the author country was not the institution country and NIPS used the author country. We still do believe that the analysis is interesting. 

Tagged with: , , ,
Tweeter button Facebook button Linkedin button