Here at Stack Overflow, we’re interested in using our data to share insights about the worldwide software development community. This recent post on the distribution of mobile developers is a good example: it explored traffic to Android questions from around the world, and found that Android tended to be visited more from lower-income countries than from higher-income ones.
This leads us to wonder how else programming technologies may differ between rich and poor countries, and how that affects our picture of the global software development industry. In this post, we’ll explore these differences, and show that’s it’s useful to segment the software development industry into high-income countries and the rest of the world.
All the analyses explored here were performed on 2017 so far (January-August), on the 250 tags that had the most traffic during that time. To reduce the effect of noise, we analyzed only the 64 countries that had at least 5 million question visits in this time period. It’s also worth noting that this data represents activity among developers who understand English (some analyses of the Spanish and Portuguese sites suggest that similar trends apply for non-English speakers in countries such as Mexico and Brazil).
Technologies correlated with GDP per capita
In a recent post, we saw that the traffic to Android questions (as a percentage of a country’s Stack Overflow visits) tends to be negatively correlated with a country’s GDP per capita. This may lead us to wonder if the same is true of any other tags.
When we explore major programming languages and platforms, some that stand out besides Android include PHP, Python, and R.
The amount of Android and PHP traffic is negatively correlated with a country’s income, while Python and R are positively correlated. In each case we can see exceptions (Korea uses more Android than we’d expect, and China more Python), but generally the correlations are strong. (Each has an R2 around .5-6, with p-values < 10-6 after adjusting for multiple testing).
We’ll emphasize that we’re not suggesting any causality here. We’re certainly not suggesting that programming language choice affects a country’s average income, but we’re also not saying that a country’s wealth directly influences their use of technologies. We suspect that the drivers are likely a mixture of economic and social factors (level of education, age of the software industry, level of outsourcing) that are, in general, correlated with a country’s wealth.
How can we segment the software development industry in two?
When we’re examining trends, it’s useful to talk about two groups of countries (high income and non-high income) rather than considering a pile of correlations. As a useful pre-existing categorization, we could use World Bank income classification, which is based on GNI (gross national income) per capita (see here for discussion of this categorization).
There are 78 high-income economies, largely made up of the US and Canada, Western Europe, parts of the Middle East and East Asia, and Australia/New Zealand. I’ve done some analyses of the fundamental drivers of the between-country variation (such as principal component analysis) that suggest this is a reasonable division, and that it’s more meaningful than other ways we could divide them, such as Eastern vs Western Hemisphere. (For instance, Australia is generally more similar to the US and Europe in terms of visited technologies than it is to China or Indonesia).
The division splits Stack Overflow traffic into groups of about two-thirds and one-third: 63.7% of Stack Overflow’s traffic comes from high income countries. (This likely is due to a combination of greater proportion of software development, more widespread internet access, and a disproportionate share of English-speakers). Much of the traffic from non-high-income countries comes from India, followed by Brazil, Russia, and China.
How do high-income countries differ in the technologies they use?
We’ve now divided the software development world into two segments. How do high-income and non-high-income countries differ in terms of the technologies they use?
We can extract several interesting insights:
- Difference in data science technologies: As we saw earlier, Python and R are associated with a country’s income. Python is visited about twice as often in high-income countries as in the rest of the world, and R about three times as much. We might also notice that among the smaller tags, many of the greatest shifts are in scientific Python and R packages such as pandas, numpy, matplotlib and ggplot2. This suggests that part of the income gap in these two languages may be due to their role in science and academic research. It makes sense these would be more common in wealthier industrialized nations, where scientific research makes up a larger portion of the economy and programmers are more likely to have advanced degrees.
- C/C++: C/C++ are two other notable languages that tend to be visited from high-income countries. One hypothesis is that this may have to do with education: as we saw in a previous post, C and C++ are among the languages more disproportionately visited from American universities. It could also be related to the geographic distribution of the electronics and manufacturing industries.
- PHP and Android: We explored Android development around the world in a previous post, but PHP is another technology that’s notably associated with lower-income countries. It’s interesting to see that CodeIgniter, a PHP open source framework, is the tag that’s singularly most disproportionately visited from lower-income countries, by a large margin. Further examination shows it is especially heavily visited in South/Southeast Asia (particularly India, Indonesia, Pakistan and the Philippines) while it has very little traffic from the US and Europe. It’s possible that CodeIgniter is a common choice for outsourcing firms building websites.
Conclusion: why does this matter?
I was certainly interested in these results as a fun fact about the programming language ecosystem. But it also has implications for other data explorations we’ll be publishing in the near future.
When we ask questions about the software development industry, it’s important to know that we’re really answering two separate questions that have been “blended” together, and that separating them can sometimes give us more informative answers.
For example, we’re often interested in understanding which technologies drive the most traffic, such as examining technologies like Flash that are shrinking over time. If we were to create a list of the most visited programming technologies, it would be different for high-income and low-income countries:
For instance, in 2017 so far, Python is the second most visited tag among high-income countries, while it’s only the 8th most visited in the rest of the world. My language of choice, R, is the 15th most visited tag in high-income countries, but it doesn’t even make the top 50 most visited tags elsewhere.
This is important context when we’re using Stack Overflow data to learn about the developer ecosystem. An American tech recruiter interested in the future of the industry will need a different set of answers than an Indian student wondering what language to learn, or an investor looking to understand tech companies in Kenya.
In future posts, we’ll sometimes refer back to this division as we continue to explore the worldwide developer ecosystem.
No comments:
Post a Comment