This is a preliminary report for questions and comments.
The data used in the report was obtained from the City of Cape Town Open Data Portal in January 2017 from [2].
The data in this report are not in any way certified to be correct and in no way does the author accept any loss or liability from damages using this report or data. Please contact the author [1] for any corrections or questions. It should be noted that the author is not an air quality expert.
We confined ourself to the pollution data at the following test stations
We used the following files (here we use standardized file names)
We globbed the various pollution readings together into one table as
The data in chain brackets is nominally permissible levels. We discuss this later. Note the warning to their correctness.
Here ug/m3 is micro grams per cubic meter \(\dfrac {\mu g }{ m^{3} }\), mb is millibars, ppb is parts per billion and lngy/m2 is Langley per square meter (which is 41 840 Joules/m² or 11.622 watt-hours per square meter \(\dfrac{W h}{m^2}\) where \(\dfrac{1 W } {m^2}\) = \(\dfrac{24 W h} {m^2 day}\) .)
The work was done using Java, Postgres and R. The code (.java and report) is available on Bitbucket ([3], sign in required). However the data is too big and sits in csv files (the author will provide on request and make the data available via a API DB query. Currently there are 163 807 rows.)
We had to clean the data to make it usable. File names had to be standardized, .xls and .ods formats were changed to .csv.
Most of the pollution data is integer, however strange types occur such as in SO2 there is a row 26/10/2014 04:00;0.8;0;15;0;5;3;0;1;1;1. Here the float was replaced by 1.
Some samples contained “<Samp”, “RS232”, “Down”, “NoData”, “InVld”, “Calib” and other text labels which were replaced with “null”.
The author used the calculation of the Air Quality Index of the United Kingdom\(^{[4]}\) and Europe\(^{[5]}\) to determine permissible levels because of their availability. Clearly the values are merely reference values, because for example, we do not account for the period of measurement (nor is that information available). The permissible levels for Cape Town South Africa, will differ. Presumably the SABS will have appropriate levels.
This report should be understood as someone with a torch going into a dark forest to find what they can find.
We begin our investigation by looking at the Atlantis testing station.
We choose to look at Nitrogen Monoxide levels.
First we load the data then we visualise it.
Let us look at a summary of the data:
## date time no
## Min. :2013-01-01 Length:20449 Min. : -9
## 1st Qu.:2013-08-02 Class :character 1st Qu.: 2
## Median :2014-03-03 Mode :character Median : 6
## Mean :2014-03-20 Mean : 11
## 3rd Qu.:2014-10-02 3rd Qu.: 13
## Max. :2015-08-31 Max. :159
## NA's :9467
## [1] 20449
We see that we have 9467 nulls or not available data. The data goes from January 2013 to August 2015, with big holes in it. We have to be really careful to treat null data, no data and zero data correctly.
There are 20 449 lines of data. To get some kind of concept on the size we have \(365 \times 2 \times 24 + (30 + 31 + 31 + 31) \times 24 = (365\times2 + 3\times31 + 30) \times 24 = 20 472\) possible samples. The small discrepancy between 20449 and 20 472 should me investigated but is not material.
Doing the same count across all suburbs and testing stations: \(7 \times 365 \times 2 \times 24 + 14 \times (30 + 31 + 31 + 31) \times 24 = ( 365 + 3\times31 + 30) \times 24 \times 14 = 20 472\)
We look at 7 stations in the 2013 and 2014 data (14 in the 2015) which implies 163 968 rows which is not too different from 163 807 in the table.
Next we look at the histogram
The data is largely zero, and does not exceed 50. Unfortunately we have some negative data, which makes no sense.
We now look at the time series.
Initially we do a crude approximation of the x-axis as an index. We checked that the dates behaved as expectedly by sorting them and plotting them against the index. This is a naive plot to give us a feel for the data.
As mentioned unavailable data is simply not plotted - this is not merely data flagged as null, or “zero” data, but data that does not exist. This is remedied in the next plot.
Then we plot the x axis as date & time.