During last weeks I’ve been doing a MOOC on Data visualization for storytelling and discovery with Alberto Cairo, which I intensely recommend. I’ll post here some of the findings I’ve got from there. The studies are not totally finished as they would need more work to be presented as a journalistic piece, so shouldn’t be taken as more than an exercise in the learning process.
1. Dataset BiciMad
First, I wanted to go local, and I live in Madrid. In my city we have a relatively new public bike rental service, and they have their datasets available, so I got a dataset with the data on the new daily users.
In the histogram I can see the concentration and the spread of the data. There’s a curious outlier that corresponds with the maximum value of the dataset: 1446 and there’s another isolated value around 700. I find those two points like something worth of more research. Probably they correspond to the day that the service started or went open to the public.
The x axis represents the number of new users of annual tickets per day. The y line represents the number of days that those users where registered. The distribution is skewed to the right, due to the outliers to the higher values of annual passes some few (2-6) days.
The box plot shows the concentration of what could be a usual number of new users per day. The median is 132 and the mean is 133, so during that year (2014) that is the number of new users per day of this service. It could be useful to compare it with datasets of other years and other kind of information to see what variables make people decide to hop on bikes as a way of transportation in the city.
2. Second case: Comparing education expenditure (%) with GINI Index in the last years in Argentina
I was born in Argentina, and there we have been having not very good official statistics in the last years in terms of transparency, so getting good analysis on that kind of data is usually extremely complicated.
So I used data of the World Bank on three variables: total government expenditure on education, school enrollment primary private and GINI index. I know GINI is made of several indicators and not only education but I wanted to give it a try and see how it correlates.
I used data from 1980 to 2015. The highest expenditure in education in general was in 2015, with 5.875 % of the GDP. In 1980 there is an outlier point with 2,6 % of GDP expended before a dark period of 15 years where there are no registry or the data we have goes below 2,6 %.
From 1996 the line rises and shows a positive evolution until the last year in the series (2015), with some hiccup between 2002 and 2005, the years of the default crisis and political unstability in Argentina. The trend overall is positive, with a rank correlation of 0.86 (using Spearman’s Rank Correlation).
The GINI index is the most commonly used measurement of inequality. A Gini coefficient of 1 (or 100%) expresses maximal inequality among values. So if the GINI index goes down it’s best in terms of equality for the country. For OECD countries, in the late 20th century, considering the effect of taxes and transfer payments, the income Gini coefficient ranged between 0.24 and 0.49.
When I added the GINI index using the colors in the values, I found that there’s a positive correlation, as in the last years where the expenditure on education is higher, the GINI index goes down (which means that Argentina gets closer to equality). There are some quite interesting periods of time, anyway, when this correlation does not happen.
One is during 1980-1990 the expenditure was lower, quite less than 2,6%, and the GINI index kept below 45. It should be said that we have some missing values those years, and we should investigate further to reach any conclusion.
The other is an outlier in 2001, when the government expenditure on education is 4.833740234, the highest in the period until 2009, but the GINI index in that year is the highest of the total number of observations, that is very bad for the equality in the country. I find this observation interesting as 2001 is one of the worst years of the crisis, when Argentina went into financial default.
Una respuesta a «Exploring datasets: Bikes in Madrid and education expenditure in Argentina»