Training future Data Scientists - Part 1: What's in a season? (A retrospective on DSIDE 2017)

Our work is never over - Kanye West, Stronger

We just finished another season of the Data Science for Insight and Decision Enablement (DSIDE). DSIDE is a Data Science training program that recruits 50 undergraduates (3rd & 4th years) and MSc/PhD students to come to the CSIR and tackle some of South Africa's challenges using a Data Science approach. The students spend 3 months at the CSIR, broken up into 1 month in the winter and 2 months in the summer. The program has been running since 2014 and the Department of Science and Technology is the main sponsor. You can find out more about the program on the program website. Now that the formalities are done, I wanted to look back at the just finished season and highlight some of the changes, successes and failures. Running such programs is very interesting and stretches the limits of the program team every year. Just a caveat, the DSIDE program is run between the Modelling and Digital Science and Meraka Units at - CSIR. Some experiences will be shared between the groups of students based at both, but some are unique to each unit. I will highlight this in the post.

Whats in a Season?

First let's start describing what actually happens during a season. The 50 students recruited will work in groups or 2/3 on a number of projects. We have had about 16 projects a year in the last few years. I believe a team of about 3 per project is a good number. It makes it easy to break ties and make decision :D. Our Data Science team at MDS takes 18 students, so 6 projects a year. The students are split into these teams and then assigned a project topic and a mentor. The project topic is not simply a description of the project, but access to a partner (who contributed the project topic) and data. The teams work to tackle the project challenge during the 3 month period they are given. The 1st month is focused on exploratory data analysis (EDA) and for the teams to refine their project challenge after spending some time with the data and essentially understanding the feasibility of tackling the challenge with the data given, the partner interactions and tools available.

Student team in meeting with a project partner, Mr. Piet Maseema from the City of Tshwane

Student team in meeting with a project partner, Mr. Piet Maseema from the City of Tshwane

The last 2 months are spent doing advanced analysis and modelling to create some product they can show as an output from their project. The modelling might be Machine Learning(ML)/Artificial Intelligence(AI), Statistics, Mathematical models, and other tools in the Data Scientist's toolbox. The "product" can be insights, a dashboard, a model etc. that can be used by a decision maker. We have been adjusting what these outputs are over time and still retain flexibility given the project challenge.

During the project execution, the students also get enrichment. This takes the form of workshops on specific topics such as Exploratory Data Analysis, Machine Learning, Code Management and other advanced topics. They are also encouraged to learn on their own. We make use of online resources a lot. Encouraging students to use online lectures and tutorials. You can see some of the resources we recommend on our website. The students also author reports on their project and create artefacts such as posters or presentations. The capstone of the program is a public presentation and exhibition where the students talk to the public about their projects and exhibit their artefacts and insights. The program prepares the students to be Data Scientists, allows researchers to look at interesting problems in society and gets South Africa ready for the 4th Industrial Revolution. Whew, that was a 3 paragraph synopsis of DSIDE.

This is a first in a series of blog posts covering the 2017/2018 DSIDE program. In the next few blog posts I'll cover what happens in preplanning, what has changed over the years and where we might go.

Tagged with: , , ,

Data Science Townhall and Q&A #1 (#DSIDE2017)

We have started the second session of the 2017/2018 DSIDE program. Our team at CSIR Modelling and Digital Science takes about 18 our of the 50 students on the program yearly. On the 8th of December we hosted Tefo Mohapi (Tech Journalist and Owner of iAfrikan).

Data Science Townhall and Q&A

Data Science Townhall and Q&A

Tefo talked a bit about his work and it's relations to data. The Q&A session was free flowing with questions on ethics of data, challenges with privacy and POPI. What became apparent was the general lack of public knowledge of the Protection of Personal Information (POPI) Act.

I want to express a lot of gratitude to Tefo for availing himself for the session.

Tagged with: , ,

SOCML: Self-Organizing Conference on Machine Learning 2017

I was invited to the Self-Organizing Conference on Machine Learning (SOCML) and attended at the end of November 2017. As per the name, the conference did not have a preset schedule but the participants themselves would put together the agenda on a daily basis and then have sessions. These are just my high level thoughts on the process.

Ian Goodfellow opening SOCML and planning the sessions for the day

Ian Goodfellow opening SOCML and planning the sessions for the day

This was my second "Unconference" format. I had attended a South African Treasury Unconference in September. The sessions I attended and wanted to share are next. Read more ›

Analysis of the #GuptaLeaks e-mail network

Appeared on iAfrikan[Link]

With the release of a slice of the #GuptaLeaks e-mail correspondence by parliament, there is now an opportunity to do some data mining. For this article we will focus on looking at the network of e-mails between individuals in the network.

Where did the data come from? In this case, from the Parliamentary Portfolio Committee on Public Enterprises ( On the website it states that "South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."

As such, the information used in this article was published by parliament. The analysis done here merely uses this source material.

"South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."

First, all of the PDFs from the website were converted to simple text. This makes it much easier to analyse.

What's next?

We need to extract and clean the data. We will be focusing on the e-mail network for this article, not the content. A few assumptions are going to be made that should be stated right at the beginning. We will assume that the first e-mail we encounter in each of the documents is the "from" email address and all subsequent emails are those from the to, cc and bcc.

This rule already breaks down as shown in the image above because an e-mail sometimes contains a chain of emails. For simplicity we will keep our (slightly flawed) assumption on who is the sender and who the e-mail is addressed to. This could thus be improved.

Now to cleaning. The extraction of e-mail addresses is not perfect. See the image below.

One can see that on the left the e-mails have errors after the .com. A Python library, to test if the email address is valid and points to a real server, was used to find all the emails that had problems. From there all the common errors were spotted and fixed, resulting in the clean emails that were at are on the right on the image.

Now to the interesting stuff. Now that we have been able to extract all of these e-mails, we can create the #GuptaLeaks e-mail networks. A snapshot of this network is below.

You can explore this network at pygraphistry. You can visually inspect the network which is a directed graph. The graph keeps the properties of who sent what to whom and how many times those connections were made. We can also calculate some network properties that can allow us to understand some characteristics of this e-mail network. The first is a histogram of a the degree for each person on the network. The degree, specifically 'In-Degree' (how many e-mails you have received) and 'Out-Degree' (how many e-mails you have sent) will likely show that most people have very infrequent communication, while a few will be sending or receiving most of the emails. This is proven to be true below in the degree histogram

Most (the mode of the distribution) of the people in the network have sent 0 e-mails (Out-Degree), while also most have only received 1 email (In-Degree). We can look at who sent the most emails:

  •, 268 (emails sent)
  •, 104
  •, 91
  •, 77
  •, 65
  •, 61
  •, 45
  •, 39
  •, 34
  •, 31

Or received the most emails

  •, 105
  •, 53
  •, 32
  •, 28
  •, 23
  •, 18
  •, 16
  •, 14
  •, 13
  •, 13

The top domains given all the unique users:

  •, 117
  •, 49
  •, 42
  •, 36
  •, 32
  •, 27
  •, 16
  •, 16
  •, 13
  •, 11
  •, 11
  •, 11
  •, 11
  •, 10
  •, 10
  •, 10

We can also calculate the centrality measures (they quantify how important someone is in the communication network) to identify the important people in the communication.

The plot above was created using code from [NetworkX introduction: Hacking social networks using the Python programming language](

As one can see, there are a number of stand outs in the centrality plot, one that is interesting is who tends to not send out a lot of emails but sends/receives them mostly to those who are very much central, as such solidifying his high centrality too.

This was just a dive into the #GuptaLeaks e-mail) network using graph mining to understand some phenomena that you can discover from looking at connections between individuals.

Tagged with: ,

Can you call the EMPD?

So, I got a speeding ticket notification via my bank. When I investigated it I found out it was in Ekurhuleni. I checked the rest of the details of the ticket. On the date I had not driven in the Ekurhuleni Municipality. Following this I got access to the speeding fine pictures, lo and behold the car was a bakkie (which I do not own). Then I noticed that the number plate was almost identical to my car except for one letter. So the automatic number plate reader had made a mistake. Whew. Now how to fix this? Let me get hold of the EMPD traffic office. Easy right. Well, this has now turned into a nightmare that keeps on going.

TLDR: I still have not gotten hold of and EMPD office after 43 all,s 20 different unique numbers. Its getting a bit frustrating and perplexing. 

Here I present all the calls I have made to try to resolve this issue. My biggest problem is that I have not been to talk to anyone at EMPD. I have not at any time been able to connect to an EMPD office. Even after being given so many different numbers be 3 main call centers the City of Ekurhuleni runs.

Ekurhuleni Calls

Ekurhuleni Calls

Where do we start? At the beginning of course, asking how to get contact with EMPD on Twitter

Well. Got two numbers. Lets try both. Well .....

Well this was just the beginning. Finally I got a call back from one of the numbers, who then told me that done deal with traffic complaints (odd, the city Twitter account shared their number with a notice that it is for traffic complaints). Anyway, got a number that I would later learn is for the emergency services. I felt completly bad when I found this out, you dont want to misuse an emergency number. Well, story gets even more interesting. The Emergency center could not guarantee that any of the numbers they gave me would work. Meaning they knew there is a problem with numbers. Just FYI these are the only contacts the city shares on their website.

Ekurhuleni Contacts

Ekurhuleni Contacts

Did it get any better from here. No. Not by a long shot. I also got in contact with the official City of Ekurhuleni call center where the operator, matter of factly, told me they do not keep any numbers for EMPD. They only have municipal office numbers. They proceeded to give me a list of numbers. You now should know the pattern, none of the phone numbers worked. None, no joke.

Here I am a week later. I still have a fine for a car I don't own. I cant get in touch with EMPD. Help.

You can view my call data here

Tagged with: , , , ,
Tweeter button Facebook button Linkedin button