Training future Data Scientists - Part 2: Preplanning

Reach for the stars so if you fall you land on a cloud - Kanye West, Homecoming

Before the students step on to the CSIR campus, a lot of preparation happens. How much preparation? A lot! Let’s talk about preplanning. We are going to break it down into program design, finding partners & problems, and recruiting students.

Program Design and Preplanning

So how many people does it take to run a program with 50 students showing up every season? A lot. We have 4 core program leads at the moment. Nyalleng Moorosi, Dr. Quentin Williams, Dhiren Seetharam and myself. We work together on the design of the program, coordination and organisation of all  other parts. The program design is always a work in progress. Our goal is to be able to reach the goal of the program of providing a rigorous  training program that delivers value for our partners. As such we have set expectations on”

  • what happens on both parts of the season,
  • what a day should typically be like,
  • what happens during a typical week,
  • when deliverables are due,
  • What is a deliverable?
  • When we start
  • What workshops will be available
  • Evaluation
  • Ambitions to get better.
  • Which other non-curricular enhancements do we add to the the schedule?

Thus in this pre planning phase, we discuss the philosophies we all might have and what changes we might introduce in the new season. This is a collaboration that stretches all of us and pushes us to think of the impact our own decision make on the program. Our ambitions on each season have to be high, and we are coignascant that this also means more pressure on the rest of the participants. To reach our goals, we work with other CSIR staff for recruitment, CSIR researchers for project leads, mentors who oversee a single project etc. Read more ›

Tagged with: , ,

Training future Data Scientists - Part 1: What's in a DSIDE season?

Our work is never over - Kanye West, Stronger

We just finished another season of the Data Science for Insight and Decision Enablement (DSIDE). DSIDE is a Data Science training program that recruits 50 undergraduates (3rd & 4th years) and MSc/PhD students to come to the CSIR and tackle some of South Africa's challenges using a Data Science approach. The students spend 3 months at the CSIR, broken up into 1 month in the winter and 2 months in the summer. The program has been running since 2014 and the Department of Science and Technology is the main sponsor. You can find out more about the program on the program website. Now that the formalities are done, I wanted to look back at the just finished season and highlight some of the changes, successes and failures. Running such programs is very interesting and stretches the limits of the program team every year. Just a caveat, the DSIDE program is run between the Modelling and Digital Science and Meraka Units at - CSIR. Some experiences will be shared between the groups of students based at both, but some are unique to each unit. I will highlight this in the post.

Whats in a Season?

First let's start describing what actually happens during a season. The 50 students recruited will work in groups or 2/3 on a number of projects. We have had about 16 projects a year in the last few years. I believe a team of about 3 per project is a good number. It makes it easy to break ties and make decision :D. Our Data Science team at MDS takes 18 students, so 6 projects a year. The students are split into these teams and then assigned a project topic and a mentor. The project topic is not simply a description of the project, but access to a partner (who contributed the project topic) and data. The teams work to tackle the project challenge during the 3 month period they are given. The 1st month is focused on exploratory data analysis (EDA) and for the teams to refine their project challenge after spending some time with the data and essentially understanding the feasibility of tackling the challenge with the data given, the partner interactions and tools available. Read more ›

Tagged with: , , ,

Data Science Townhall and Q&A #1 (#DSIDE2017)

We have started the second session of the 2017/2018 DSIDE program. Our team at CSIR Modelling and Digital Science takes about 18 our of the 50 students on the program yearly. On the 8th of December we hosted Tefo Mohapi (Tech Journalist and Owner of iAfrikan).

Data Science Townhall and Q&A

Data Science Townhall and Q&A

Tefo talked a bit about his work and it's relations to data. The Q&A session was free flowing with questions on ethics of data, challenges with privacy and POPI. What became apparent was the general lack of public knowledge of the Protection of Personal Information (POPI) Act.

I want to express a lot of gratitude to Tefo for availing himself for the session.

Tagged with: , ,

SOCML: Self-Organizing Conference on Machine Learning 2017

I was invited to the Self-Organizing Conference on Machine Learning (SOCML) and attended at the end of November 2017. As per the name, the conference did not have a preset schedule but the participants themselves would put together the agenda on a daily basis and then have sessions. These are just my high level thoughts on the process.

Ian Goodfellow opening SOCML and planning the sessions for the day

Ian Goodfellow opening SOCML and planning the sessions for the day

This was my second "Unconference" format. I had attended a South African Treasury Unconference in September. The sessions I attended and wanted to share are next. Read more ›

Analysis of the #GuptaLeaks e-mail network

Appeared on iAfrikan[Link]

With the release of a slice of the #GuptaLeaks e-mail correspondence by parliament, there is now an opportunity to do some data mining. For this article we will focus on looking at the network of e-mails between individuals in the network.

Where did the data come from? In this case, from the Parliamentary Portfolio Committee on Public Enterprises ( On the website it states that "South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."

As such, the information used in this article was published by parliament. The analysis done here merely uses this source material.

"South Africa's parliamentary Portfolio Committee on Public Enterprises, and at the request of Honorable Acting Chair Z Rantho, PPLAAF has made available a small selection on documents and emails from the Guptaleaks."

First, all of the PDFs from the website were converted to simple text. This makes it much easier to analyse.

What's next?

We need to extract and clean the data. We will be focusing on the e-mail network for this article, not the content. A few assumptions are going to be made that should be stated right at the beginning. We will assume that the first e-mail we encounter in each of the documents is the "from" email address and all subsequent emails are those from the to, cc and bcc.

This rule already breaks down as shown in the image above because an e-mail sometimes contains a chain of emails. For simplicity we will keep our (slightly flawed) assumption on who is the sender and who the e-mail is addressed to. This could thus be improved.

Now to cleaning. The extraction of e-mail addresses is not perfect. See the image below.

One can see that on the left the e-mails have errors after the .com. A Python library, to test if the email address is valid and points to a real server, was used to find all the emails that had problems. From there all the common errors were spotted and fixed, resulting in the clean emails that were at are on the right on the image.

Now to the interesting stuff. Now that we have been able to extract all of these e-mails, we can create the #GuptaLeaks e-mail networks. A snapshot of this network is below.

You can explore this network at pygraphistry. You can visually inspect the network which is a directed graph. The graph keeps the properties of who sent what to whom and how many times those connections were made. We can also calculate some network properties that can allow us to understand some characteristics of this e-mail network. The first is a histogram of a the degree for each person on the network. The degree, specifically 'In-Degree' (how many e-mails you have received) and 'Out-Degree' (how many e-mails you have sent) will likely show that most people have very infrequent communication, while a few will be sending or receiving most of the emails. This is proven to be true below in the degree histogram

Most (the mode of the distribution) of the people in the network have sent 0 e-mails (Out-Degree), while also most have only received 1 email (In-Degree). We can look at who sent the most emails:

  •, 268 (emails sent)
  •, 104
  •, 91
  •, 77
  •, 65
  •, 61
  •, 45
  •, 39
  •, 34
  •, 31

Or received the most emails

  •, 105
  •, 53
  •, 32
  •, 28
  •, 23
  •, 18
  •, 16
  •, 14
  •, 13
  •, 13

The top domains given all the unique users:

  •, 117
  •, 49
  •, 42
  •, 36
  •, 32
  •, 27
  •, 16
  •, 16
  •, 13
  •, 11
  •, 11
  •, 11
  •, 11
  •, 10
  •, 10
  •, 10

We can also calculate the centrality measures (they quantify how important someone is in the communication network) to identify the important people in the communication.

The plot above was created using code from [NetworkX introduction: Hacking social networks using the Python programming language](

As one can see, there are a number of stand outs in the centrality plot, one that is interesting is who tends to not send out a lot of emails but sends/receives them mostly to those who are very much central, as such solidifying his high centrality too.

This was just a dive into the #GuptaLeaks e-mail) network using graph mining to understand some phenomena that you can discover from looking at connections between individuals.

Tagged with: ,
Tweeter button Facebook button Linkedin button