Analysing Open Data from the Canadian Government with Python
I am a business analyst who wants to do a better job on process monitoring and improvement. This is why I became interested in data analysis and started doing some courses on data science with R Coursera Courses on Data Science and Python Coursera Course on Applied Data Science with Python.
To test what I have learned with a new example, I have defined the following research question: What is the percentage of approved Labour Market Impact Assessments out of requested work positions by province and for the country from 2008-2015?
The page on Temporary Foreign Worker Program 2008-2015 has several publicly accessible datasets. My analysis focuses on the years 2008-2015, but as this page is often updated, you may find datasets with more recent data than 2015. Looking trough them, I selected two related datasets, stored in CSV file. Both datasets are on the number of temporary foreign worker (TFW) positions by province/territory. The first dataset has positive Labour Market Impact Assessments (LMIAs) and the second one has requested Labour Market Impact Assessments (LMIAs).
Before looking into the data, you can access the entire source code and run it either using the Jupyter Notebook or set up the environment with the Python IDE PyCharm following these steps:
- Go to the folder where the file is located:
cd data-science
- Create a virtual environment where I installed the libraries needed for this
specific project:
python3 -m venv venv
- Initialise the virtual environment:
source venv/bin/activate
- Install the libraries:
pip install pandas
- In PythonCharm, use this virtual environment by editing the Run/Debug configurations and choose the virtual environment as Python Interpreter.
Explore the data
When starting the analysis, the first step is to explore the data. For that, I
read the CSV files using the read_csv
function from the pandas
library.
The first lines are to import the libraries:
Then, read the CSV file from a url:
When you look at the dataset, there is a header and a blank line and some explanations in the footer. Here’s an extract:
More attributes are added to the read_csv
function to skip the two first rows,
4 lines in the footer and change the thousands separator to comma.
Here’s an extract of the dataset:
The visual graphic to address the research question
This visual addresses the research question by showing the percentage of approved Labour Market Impact Assessment (LMIA) out of the requested by province and for the country considering the years from 2008 until 2015.
The datasets were selected from the Open Data Initiative website of the Government of Canada. For the understanding of the analysis, as stated on the website of the Government of Canada, it is important to know that not all positions on an approved LMIA result in a work permit or Citizenship, which are subject to other statistics.
This analysis takes the annual LMIA statistics per province and for the country from 2008 until 2015. First, the two selected datasets are merged, one for requested LMIA and the other for approved ones. The annual values are summed up, calculating the total of approved and total of discontinued ones. The discontinued are calculated from the difference of total of requested minus the total of approved. Discontinued can represent different outcomes, such as rejection, withdrawals, etc. The percentages are then calculated to display the approvals in the visual in descending order, starting with provinces having higher percentage of approvals of LMIA, the aim of the research question.
Compared to the country percentage, there are seven provinces with the same or higher percentage of approvals. The other provinces with lower percentage of approvals are not so much lower than the country average.
Analysis of the visual graphic
I use Alberto Cairo’s principles to analyse my own visual graphic:
1)Truthful For each dataset, requested and approved, the values for each year are summed up to represent the total per province. The requested represents the total, while the approved a percentage from this total of requested. The remaining percentage represents the ones not approved, which can vary in result (e.g. withdrawals, rejection), thus, they represent requests that were discontinued. Once the total of approved and discontinued requests are calculated per province and for the country, the percentage is calculated.
The choice of displaying percentages was done since the difference of quantities between the provinces is substantial. The provinces with lower quantities were not visible, only the provinces with higher quantities. Therefore, the percentages were calculated and displayed in the graph to represent the approvals per province and for the country.
2) Functionality Considering the focus of the research question is on the approval of LMIAs, the visual displays, in each bar, labels with percentage of approvals per province and for the country. To organise the percentage of approval by province, the stacked bar showing both approved and discarded seems to be the most appropriate.
3) Beauty With the purpose of making it captivating and easy to understand, the visual is clean: the fonts with readable size, the colors of the bars were selected to be aesthetically pleasing,the ticks on both the x and y axes were removed, the labels of both axes were not introduced because the title already mentions them, a grid with transparency was put in the background to guide, and the legend was placed on a blank space to make best use of the image space.
4) Insightful The order of bars from province with the highest approval to the lowest aims to make it more memorable and show the results quickly and clearly. For people interested in applying for LMIAs, it should give an ah-ha moment to help quickly identify the provinces with the highest percentage of total of approvals from 2008-2015.
Recent Posts
-
The Science of Choice
-
Rolling out Enterprise Architecture
-
7 Steps to Create Processes
-
First thoughts on Prediction Machines
-
The woman I am
-
Becoming a Runner
-
Business Process Overview
-
How to link Business Rules with Business Objects
-
Analysis of an Infographic according to Alberto Cairo's mechanisms
-
Git: How to know if your directory is already under version control?
-
Changing R library path
-
Shiny Application to analyse graduates per country and year
-
I don’t like boxes, do you?!
-
Need a solution to integrate a BPM and IT