We recently showed that, based on Stack Overflow question visits, Python has a claim to being the fastest-growing major programming language, and that it has become the most visited tag on Stack Overflow within high-income countries.
Why is Python growing so fast? Python is used in a variety of purposes, ranging from web development to data science to DevOps, and it’s worth understanding what particular applications of Python have recently become more common. I’m a data scientist who uses R, so I’m certainly interested in how much of Python’s growth has been within my own field. In this post, I’ll take another look at Stack Overflow data to understand what kinds of Python development have been growing, and in what kinds of companies and organizations it’s most used.
These analyses suggest two conclusions. First, the fastest-growing use of Python is for data science, machine learning and academic research. This is particularly visible in the growth of the pandas package, which is the fastest-growing Python-related tag on the site. As for which industries are using Python, we found that it is more visited in a few industries, such as electronics, manufacturing, software, government, and especially universities. However, Python’s growth is spread pretty evenly across industries. In combination this tells a story of data science and machine learning becoming more common in many types of companies, and Python becoming a common choice for that purpose.
Just like in the previous post, all of these analyses are constrained to World Bank high-income countries.
Types of Python Development
Python is a versatile language used for a variety of tasks, such as web development and data science. How could we disentangle Python’s recent growth across these fields?
For starters, we could examine the growth in traffic to tags representing notable Python packages in each field. We could compare the web frameworks Django and Flask to the data science packages NumPy, matplotlib, and pandas. (You can also use Stack Overflow Trends to compare rates of questions asked, rather than ones visited).
In terms of Stack Overflow traffic from high-income countries, pandas is clearly the fastest growing Python package: it had barely been introduced in 2011 but now makes up almost 1% of Stack Overflow question views. Questions about numpy and matplotlib have also grown in their share of visits over time. In contrast, traffic to Django questions has stayed fairly steady during that time, and while Flask is growing it remains at a smaller share. This suggests that much of Python’s growth may be due to data science, rather than to web development.
However, this gives us only part of the picture, since it can measure only widely used Python-specific packages. Python is also popular among system administrators and DevOps engineers, who might visit Linux, Bash, and Docker questions alongside Python questions. Similarly, plenty of Python web development is done without Django or Flask, and such developers would likely visit JavaScript, HTML and CSS as “supporting” tags. We can’t simply measure the growth of tags like linux, bash, javascript and assume they’re associated with Python. Thus, we’d like to measure the tags visited alongside Python.
We’ll consider only visits in the summer (June-August) of 2017, which helps reduce the effect of undergraduate students, focuses on recent traffic, and helps reduce the computational challenges of summarizing traffic across long periods of time. We considered only signed-in users who had visited at least 50 Stack Overflow questions during that time. We considered someone a Python user only if a) their most visited tag is Python, and b) Python makes up at least 20% of their visits.
Which tags were often visited by the same people who tended to visit Python?
Pandas is by a large margin the tag most visited by Python developers, which isn’t surprising after we saw its earlier growth. The second most visited tag by Python visitors is JavaScript, which likely represents the set of Python web developers (as does Django a few slots lower). This confirms our suspicion that we should consider what tags are visited alongside Python, and not just the growth of Python-related tags in general.
Going down the list, we can see other “clusters” of technologies. We can examine their relationships by considering what pairs of tags tend to be correlated: that is, whether pairs of Python users are disproportionately likely to visit both tags. By filtering for pairs of tags with a high Pearson correlation, we can display these relationships in a network diagram (see here for more on this kind of visualization).
We can see a few large clusters of technologies, which roughly describe categories of problems that are often solved with Python. In the upper center we see a cluster for data science and machine learning: it has pandas, NumPy, and matplotlib at the center, and is closely connected to technologies like R, Keras, and TensorFlow. The cluster below describes web development, with tags like JavaScript, HTML, CSS, Django, Flask and JQuery. Two other clusters we can spot are system administration/DevOps on the left (centered around Linux and Bash), and data engineering on the right (Spark, Hadoop, and Scala).
Growth by topic
We’ve seen how Python-related Stack Overflow traffic can generally be divided into a few topics. This lets us examine which of the topics is responsible for most of Python’s growth in Stack Overflow visits.
Imagine we were looking at the history of a user, and we see that Python is their most visited tag. How might we guess whether they are a web developer, data scientist, system administrator, or something else? Well, we could consider their second most visited tag, then their third, and work our way down the list of their most visited tags until we saw something recognizable from one of the clusters above.
Thus, we propose the following simple approach for classifying a user into a topic, where we find the tag most visited by each user from the nine listed below, and use that to classify them.
- Data scientist: Pandas, NumPy, or Matplotlib
- Web developer: JavaScript, Django, HTML
- Sysadmin/DevOps: Linux, Bash, or Windows.
- None: None of the nine tags above made up more than 5% of their traffic.
This isn’t very sophisticated, but it lets us quickly estimate the influence of each major category on Python’s growth. We also tried the more rigorous approach of latent Dirichlet allocation, and got qualitatively similar results.
Which categories of Python developer have become more common over time? Note that since we’re categorizing users rather than question visits, we’re showing this as a percentage of Stack Overflow registered visitors (whether they visited Python or not).
We can see that the number of Python visitors who work with web technologies or system administration is growing at a slow or moderate pace in the last three years, out of all visitors to Stack Overflow. But the share of Python developers who are visiting data science technologies is growing very rapidly. This suggests that Python’s popularity in data science and machine learning is probably the main driver of its fast growth.
We could also consider growth on the level of individual tags, by calculating the traffic to tags visited by Python developers in 2016 and 2017. For instance, it’s possible that Javascript traffic is steady overall, but that it’s shrinking as a percentage of visits from Python developers. Once we have those per-tag growth rates, it’s useful to lay them out in our network to understand what topics are growing and shrinking.
This helps confirm our suspicion that a lot of the growth within Python is related to data science and machine learning. Most of that cluster is shifted towards orange, meaning those tags have started making up a larger part of the Python ecosystem.
Industry
Another way we can understand the growth of the Python language is by considering what types of companies it is visited from. This is a separate question from the type of developer a visitor is: both retail companies and media firms could employ data scientists or web developers.
We’ll focus on two of the countries in which Python’s growth is most notable: the United States and the United Kingdom. In these countries, we’re able to segment our traffic by industry (just as we did to compare AWS and Azure).
The industry with the greatest amount of Python traffic (by a large margin) is academia, comprised of colleges and universities. Is this because Python is often taught in undergraduate programming classes?
Partially, but not entirely. As we saw in a previous post, Python traffic from universities is common in the summer, not just in the fall and spring. For instance, Python and Java are the most visited tags from universities, and we can see the difference in their seasonal trends.
As a percentage, we can see that traffic to Java drops more sharply during each summer, because Java is a relatively common subject in undergraduate classes. (We’ll be exploring what programming languages are most taught at universities in a future post). In contrast, Python makes up a larger share of each summer’s traffic. The high traffic to Python questions from universities is therefore due partly to academic researchers, who generally work throughout the entire year. This provides more evidence that Python’s growth is due to its capacity for scientific computing and data analysis.
As for the other industries, we’ve already saw that Python is popular and fast-growing in the government sector, but we can see it’s also widely used in the electronics and manufacturing industries. I’m less familiar with those industries and would be interested in insights as to why. The language still hasn’t caught on as much in retail or insurance companies (some investigation shows that Java remains dominant there).
This post is primarily investigating causes of Python’s growth. Was Python traffic growing more quickly in some industries than others?
The growth of Python in the last year has been pretty evenly spread out across industries, at least in the US and UK. In each industry the traffic to Python increased about 2-3 absolute percentage points. (Note that this represents a larger relative growth in industries where it wasn’t already common, such as insurance and retail).
In many of these industries, Java remains the most-visited tag based on 2017 year-to-date traffic, but Python has been making progress. For example, within finance (one of the larger contributors to Stack Overflow traffic out of these industries), Python went from being the fourth most visited tag in 2016 to the second most visited in 2017.
Conclusion
As a data scientist who previously worked in Python but now works in R, should this push me towards switching back?
I don’t think so. For one thing, R has been growing rapidly as well; we saw in the last post that it’s the second-fastest-growing major programming language, after Python. But secondly, the reasons I prefer using R for data analysis aren’t particularly related to its relative popularity. (I’m planning on writing a personal blog post about my own journey from Python to R, what I like about both languages, and why I don’t feel compelled to switch back).
In any case, data science is an exciting and growing field, and there’s plenty of room for multiple languages to thrive. My main conclusion is to encourage developers early in their career to consider building skills in data science. We’ve seen here that it’s among the fastest-growing components of the software development ecosystem, and one that’s become relevant across many industries.
No comments:
Post a Comment