Thursday, December 26, 2019

10 Best Frameworks and Libraries for AI

Look at some high-quality libraries that are used for artificial intelligence, their pros and cons, and some of their features.

Artificial intelligence has existed for a long time. However, it has become a buzzword in recent years due to huge improvements in this field. AI used to be known as a field for total nerds and geniuses, but due to the development of various libraries and frameworks, it has become a friendlier IT field and has lots of people going into it.

In this article, we will be looking at top-quality libraries that are used for artificial intelligence, their pros and cons, and some of their features. Let's dive in and explore the world of these AI libraries!

1. TensorFlow

"Computation using data flow graphs for scalable machine learning."

Language: C++ or Python.

When getting into AI, one of the first frameworks you'll hear about is Google's TensorFlow.

TensorFlow is an open-source software for carrying out numerical computations using data flow graphs. This framework is known for having an architecture that allows computation on any CPU or GPU, be it a desktop, a server, or even a mobile device. This framework is available in the Python programming language.

TensorFlow sorts through data layers called nodes and makes decisions with whatever information it gets. Check it out!

Pros:

Uses an easy-to-learn a language (Python).
Uses computational graph abstraction.
Availability of TensorBoard for visualization.

Cons:

It's slow, as Python is not the fastest of languages.
Lack of many pre-trained models.
Not completely open-source.

2. Microsoft CNTK

"An open source-deep learning toolkit."

Language: C++.

We could call this Microsoft's response to Google's TensorFlow.

Microsoft's Computational Network ToolKit is a library that enhances the modularization and the maintenance of separating computation networks, providing learning algorithms and model descriptions.

CNTK can take advantage of many servers at the same time in a case where lots of servers are needed for operations.

It is said to be close in functionality to Google's TensorFlow; however, it is a bit speedier. Learn more here.

Pros:

It is very flexible.
Allows for distributed training.
Supports C++, C#, Java, and Python.

Cons:

It is implemented in a new language, Network Description Language (NDL).
Lack of visualizations.

3. Theano

"A numerical computation library."

Language: Python.

A strong competitor to TensorFlow, Theano is a powerful Python library that allows for numerical operations involving multi-dimensional arrays with a high level of efficiency.

The library's transparent use of a GPU for carrying out data-intensive computations instead of a CPU results in high efficiency in its operations.

For this reason, Theano has been used in powering large-scale computationally intensive operations for about a decade.

However, in September 2017, it was announced that major developments of Theano would cease after the 1.0 release, which was released in November 2017.

This doesn't mean it is a less powerful library in any way. You can still carry out deep learning research with it any time. Learn more here.

Pros:

Properly optimized for CPU and GPU.
Efficient for numerical tasks.

Cons:

Raw Theano is somewhat low-level compared to other libraries.
Needs to be used with other libraries to gain a high level of abstraction.
A bit buggy on AWS.

4. Caffe

"Fast, open framework for deep learning."

Language: C++.

Caffe is a powerful deep learning framework.

Like the other frameworks on this list, it is very fast and efficient for deep learning research.

With Caffe, you can very easily build a convolutional neural network (CNN) for image classification. Caffe works well on GPU, which contributes to its great speed during operations. Check out the main page for more information.

Caffe main classes:

Pros:

Bindings for Python and MATLAB are available.
Great performance.
Allows for the training of models without writing code.

Cons:

Bad for recurrent networks.
Not great with new architectures.

5. Keras

"Deep learning for humans."

Language: Python.

Keras is an open-source neural network library written in Python.

Unlike TensorFlow, CNTK, and Theano, Keras is not meant to be an end-to-end machine learning framework.

Instead, it serves as an interface and provides a high level of abstraction, which makes for easy configuration of neural networks regardless the framework it is sitting on.

Google's TensorFlow currently supports Keras as a backend, and Microsoft's CNTK will do the same in little or no time. Learn more here.

Pros:

It is user-friendly.
It is easily extensible.
Runs seamlessly on both CPU and GPU.
Works seamlessly with Theano and TensorFlow.

Cons:

Can't be efficiently used as an independent framework.

6. Torch

"An open-source machine learning library."

Language: C.

Torch is an open-source machine learning library for scientific and numerical operations.

It's a library based on — no, not Python — the Lua programming language.

By providing a large number of algorithms, it makes for easier deep learning research and improved efficiency and speed. It has a powerful N-dimensional array, which helps with operations such as slicing and indexing. It also offers linear algebra routines and neural network models. Check it out.

Pros:

Very flexible.
High level of speed and efficiency.
Lots of pre-trained models available.

Cons:

Unclear documentation.
Lack of plug-and-play code for immediate use.
It's based on a not-so-popular language, Lua.

7. Accord.NET

"Machine learning, computer vision, statistics, and general scientific computing for .NET."

Language: C#.

Here is one for the C# programmers.

The Accord.NET framework is a.NET machine learning framework that makes audio and image processing easy.

This framework can efficiently handle numerical optimization, artificial neural networks, and even visualization. Aside from this, Accord.NET is powerful for computer vision and signal processing and also makes for an easy implementation of algorithms. Check the main page.

Pros:

It has a large and active development team.
Very well-documented framework.
Quality visualization.

Cons:

Not a very popular framework.
Slow compared to TensorFlow.

8. Spark MLlib

"A scalable machine learning library."

Language: Scala.

Apache's Spark MLlib is a very scalable machine learning library.

It is very usable in languages such as Java, Scala, Python, and even R. It is very efficient, as it interoperates with the numpy in library Python and R libraries.

MLlib can easily be plugged into Hadoop workflows. It provides machine learning algorithms such as classification, regression, and clustering.

This powerful library is very fast when it comes to processing of large-scale data. Learn more on the website.

Pros:

Very fast for large-scale data.
Available in many languages.

Cons:

Steep learning curve.
Plug-and-play available for Hadoop only.

9. Sci-kit Learn

"Machine learning in Python."

Language: Python.

Sci-kit learn is a very powerful Python library for machine learning that is majorly used in building models.

Built using other libraries such as numpy, SciPy, and matplotlib, it is very efficient for statistical modeling techniques such as classification, regression, and clustering.

Sci-kit learn comes with features such as supervised learning algorithms, unsupervised learning algorithms, and cross-validation. Check it out.

Pros:

Availability of many of the main algorithms.
Efficient for data mining.

Cons:

Not the best for building models.
Not very efficient with GPU.

10. MLPack

"A scalable C++ machine learning library."

Language: C++.

MLPack is a scalable machine learning library implemented in C++. Because it's in C++, you can guess that it is great for memory management.

MLPack runs with great speed, as quality machine learning algorithms come along with the library. This library is novice-friendly and provides a simple API for use. Check it out.

Pros:

Very scalable.
Python and C++ bindings available.

Cons:

Not the best documentation.

Wrapping It Up

The libraries discussed in this article are very efficient and have proven over time to be of high quality. Big companies like Facebook, Google, Yahoo, Apple, and Microsoft make use of some of these libraries for their deep learning and machine learning projects — so why shouldn‘t you?

Can you think of any other library that you make use of very often that isn't on this list? Kindly share with us in the comments section!

Tuesday, December 24, 2019

A Tale of Two Industries: How Programming Languages Differ Between Wealthy and Developing Countries

Here at Stack Overflow, we’re interested in using our data to share insights about the worldwide software development community. This recent post on the distribution of mobile developers is a good example: it explored traffic to Android questions from around the world, and found that Android tended to be visited more from lower-income countries than from higher-income ones.

This leads us to wonder how else programming technologies may differ between rich and poor countries, and how that affects our picture of the global software development industry. In this post, we’ll explore these differences, and show that’s it’s useful to segment the software development industry into high-income countries and the rest of the world.

All the analyses explored here were performed on 2017 so far (January-August), on the 250 tags that had the most traffic during that time. To reduce the effect of noise, we analyzed only the 64 countries that had at least 5 million question visits in this time period. It’s also worth noting that this data represents activity among developers who understand English (some analyses of the Spanish and Portuguese sites suggest that similar trends apply for non-English speakers in countries such as Mexico and Brazil).

Technologies correlated with GDP per capita

In a recent post, we saw that the traffic to Android questions (as a percentage of a country’s Stack Overflow visits) tends to be negatively correlated with a country’s GDP per capita. This may lead us to wonder if the same is true of any other tags.

When we explore major programming languages and platforms, some that stand out besides Android include PHP, Python, and R.

The amount of Android and PHP traffic is negatively correlated with a country’s income, while Python and R are positively correlated. In each case we can see exceptions (Korea uses more Android than we’d expect, and China more Python), but generally the correlations are strong. (Each has an R2 around .5-6, with p-values < 10-6 after adjusting for multiple testing).

We’ll emphasize that we’re not suggesting any causality here. We’re certainly not suggesting that programming language choice affects a country’s average income, but we’re also not saying that a country’s wealth directly influences their use of technologies. We suspect that the drivers are likely a mixture of economic and social factors (level of education, age of the software industry, level of outsourcing) that are, in general, correlated with a country’s wealth.

How can we segment the software development industry in two?

When we’re examining trends, it’s useful to talk about two groups of countries (high income and non-high income) rather than considering a pile of correlations. As a useful pre-existing categorization, we could use World Bank income classification, which is based on GNI (gross national income) per capita (see here for discussion of this categorization).

There are 78 high-income economies, largely made up of the US and Canada, Western Europe, parts of the Middle East and East Asia, and Australia/New Zealand. I’ve done some analyses of the fundamental drivers of the between-country variation (such as principal component analysis) that suggest this is a reasonable division, and that it’s more meaningful than other ways we could divide them, such as Eastern vs Western Hemisphere. (For instance, Australia is generally more similar to the US and Europe in terms of visited technologies than it is to China or Indonesia).

The division splits Stack Overflow traffic into groups of about two-thirds and one-third: 63.7% of Stack Overflow’s traffic comes from high income countries. (This likely is due to a combination of greater proportion of software development, more widespread internet access, and a disproportionate share of English-speakers). Much of the traffic from non-high-income countries comes from India, followed by Brazil, Russia, and China.

How do high-income countries differ in the technologies they use?

We’ve now divided the software development world into two segments. How do high-income and non-high-income countries differ in terms of the technologies they use?

We can extract several interesting insights:

Difference in data science technologies: As we saw earlier, Python and R are associated with a country’s income. Python is visited about twice as often in high-income countries as in the rest of the world, and R about three times as much. We might also notice that among the smaller tags, many of the greatest shifts are in scientific Python and R packages such as pandas, numpy, matplotlib and ggplot2. This suggests that part of the income gap in these two languages may be due to their role in science and academic research. It makes sense these would be more common in wealthier industrialized nations, where scientific research makes up a larger portion of the economy and programmers are more likely to have advanced degrees.
C/C++: C/C++ are two other notable languages that tend to be visited from high-income countries. One hypothesis is that this may have to do with education: as we saw in a previous post, C and C++ are among the languages more disproportionately visited from American universities. It could also be related to the geographic distribution of the electronics and manufacturing industries.
PHP and Android: We explored Android development around the world in a previous post, but PHP is another technology that’s notably associated with lower-income countries. It’s interesting to see that CodeIgniter, a PHP open source framework, is the tag that’s singularly most disproportionately visited from lower-income countries, by a large margin. Further examination shows it is especially heavily visited in South/Southeast Asia (particularly India, Indonesia, Pakistan and the Philippines) while it has very little traffic from the US and Europe. It’s possible that CodeIgniter is a common choice for outsourcing firms building websites.

Conclusion: why does this matter?

I was certainly interested in these results as a fun fact about the programming language ecosystem. But it also has implications for other data explorations we’ll be publishing in the near future.

When we ask questions about the software development industry, it’s important to know that we’re really answering two separate questions that have been “blended” together, and that separating them can sometimes give us more informative answers.

For example, we’re often interested in understanding which technologies drive the most traffic, such as examining technologies like Flash that are shrinking over time. If we were to create a list of the most visited programming technologies, it would be different for high-income and low-income countries:

For instance, in 2017 so far, Python is the second most visited tag among high-income countries, while it’s only the 8th most visited in the rest of the world. My language of choice, R, is the 15th most visited tag in high-income countries, but it doesn’t even make the top 50 most visited tags elsewhere.

This is important context when we’re using Stack Overflow data to learn about the developer ecosystem. An American tech recruiter interested in the future of the industry will need a different set of answers than an Indian student wondering what language to learn, or an investor looking to understand tech companies in Kenya.

In future posts, we’ll sometimes refer back to this division as we continue to explore the worldwide developer ecosystem.

Developers Who Use Spaces Make More Money Than Those Who Use Tabs

Do you use tabs or spaces for code indentation?

This is a bit of a “holy war” among software developers; one that’s been the subject of many debates and in-jokes. I use spaces, but I never thought it was particularly important. But today we’re releasing the raw data behind the Stack Overflow 2017 Developer Survey, and some analysis suggests this choice matters more than I expected.

Spaces make more money than tabs

There were 28,657 survey respondents who provided an answer to tabs versus spaces and who considered themselves a professional developer (as opposed to a student or former programmer). Within this group, 40.7% use tabs and 41.8% use spaces (with 17.5% using both). Of them, 12,426 also provided their salary.

Analyzing the data leads us to an interesting conclusion. Coders who use spaces for indentation make more money than ones who use tabs, even if they have the same amount of experience:

Indeed, the median developer who uses spaces had a salary of $59,140, while the median tabs developer had a salary of $43,750. (Note that all the results were converted into US dollars from each respondent’s currency). Developers who responded “Both” were generally indistinguishable from ones who answered “Tabs”: I’ll leave them out of many of the remaining analyses.

This is an amusing result, but of course it’s not conclusive by itself. When I first discovered this effect, I assumed that it was confounded by a factor such as country or programming language. For example, it’s conceivable that developers in low GDP-per-capita countries could be more likely to use tabs, and therefore such developers tend to have lower salaries on average.

We could examine this by considering whether the effect occurs within each country, for several of the countries that had the most survey respondents.

The effect is smaller in Europe and especially large in India, but it does appear within each country, suggesting this isn’t the sole confounding factor.

As another hypothesis, we know that different types of developers often use different indentation (e.g. with DevOps developers more likely to use spaces and mobile developers more likely to use tabs), often because they use different editors and languages. The Developer Survey asked both about what programming languages each respondent uses (Python, Javascript, etc) and what “type” of developer they are (web developer, embedded developer, etc).

Did we see the same tabs/spaces gap within each of these groups?

Yes, the effect existed within every subgroup of developers. (This gave a similar result even when filtering for developers only in a specific country, or for ones with a specific range of experience). Note that respondents could select multiple languages, so each of these groups are overlapping to some degree.

I did several other visual examinations of possible confounding factors (such as level of education or company size), and found basically the same results: spaces beat tabs within every group. Now that the raw data is available, I encourage other statisticians to check other confounders themselves.

Estimating the effect

If we control for all of the factors that we suspect could affect salary, how much effect does the choice of tabs/spaces have?

To answer this, I fit a linear regression, predicting salary based on the following factors.

Tabs vs spaces
Country
Years of programming experience
Developer type and language (for the 49 responses with at least 200 “yes” answers)
Level of formal education (e.g. bachelor’s, master’s, doctorate)
Whether they contribute to open source
Whether they program as a hobby
Company size

The model estimated that using spaces instead of tabs is associated with an 8.6% higher salary (confidence interval (6%, 10.4%), p-value < 10^-10). (By predicting the logarithm of the salary, we were able to estimate the % change each factor contributed to a salary rather than the dollar amount). Put another way, using spaces instead of tabs is associated with as high a salary difference as an extra 2.4 years of experience.

Conclusion

So… this is certainly a surprising result, one that I didn’t expect to find when I started exploring the data. And it is impressively robust even when controlling for many confounding factors. As an exercise I tried controlling for many other confounding factors within the survey data beyond those mentioned here, but it was difficult to make the effect shrink and basically impossible to make it disappear.

Correlation is not causation, and we can never be sure that we’ve controlled for all the confounding factors present in a dataset, or indeed that the confounders were measured in the survey at all. If you’re a data scientist, statistician, or analyst, I encourage you to download the raw survey data and examine it for yourself. You can find the code behind this blog post here if you’d like to reproduce the analysis. In any case we’d be interested in hearing hypotheses about this relationship.

Though for the sake of my own salary, I’m sticking with spaces for now.

We have something fun for ya. Our latest podcast episode is out! You can check out all our episodes here.

Why Python is Popular Despite Being (Super) Slow

Python is one of the most widely used programming languages, and it has been around for more than 28 years now. One common question arises in mind of most people, especially beginners and newbies, that why Python is popular in mainstream despite being slow? or why programmers or developers don’t care about speed and performance limitations in Python? In this post, I will go through some main reasons for this.

Why is Python Slow in Terms of Speed?

Before diving into details of why Python is popular in mainstream despite being slow, I will shortly explain how or why Python is slow in nature in terms of performance and speed as compared to other popular programming languages like C and C++.

High-level programming language: With Python, the code looks very close to how humans think. For this purpose, it must abstract the details of the computer from you: memory management, pointers,… Hence, it is slower than “lower-level language” like C;
Python is interpreted and not compiled: Sure, this statement is a gross simplification but it’s somehow correct. During the execution, Python code is interpreted at runtime instead of being compiled to native code at compile time;
Python is a dynamically typed language: Unlike “statically-typed” languages like C, C++ or Java, you don’t have to declare the variable type like String, boolean or int. The less you do, the more your computer has to work. For each attribute access, tons of lookup is required. In addition, being very dynamic makes it incredibly hard to optimize Python;
Global Interpreter Lock (GIL): This GIL basically prevents multi-threading by mandating the interpreter only execute a single thread within a single process (an instance of the Python interpreter) at a time.

Why is Python Still so Popular?

I would say that 9/10 times the slower performance of Python does not matter. Below I will discuss some major aspects and reasons.

9/10 times the slower performance of Python does not matter.

End-users just don’t care

Can you really feel the difference between 0.001 seconds or 0.01 seconds? The answer is most likely “No”. Normally, it doesn’t matter too much to the end-users if your program takes just a little bit longer for its execution. As long as we don’t write a program which executes in centuries and totally destroys end-user experience, it’s fine. In case it takes too long, horizontal scaling can be used to solve many bottlenecks that would have been created by Python and make the execution faster.

More Productive

First and foremost reason why Python is much popular because it is highly productive as compared to other programming languages like C++ and Java. It is much more concise and expressive language and requires less time, effort, and lines of code to perform the same operations.

Python code is very simple and easy to read

The Python features like one-liners and dynamic type system allow developers to write very fewer lines of code for tasks that require more lines of code in other languages. This makes Python very easy-to-learn programming language even for beginners and newbies. For instance, Python programs are slower than Java, but they also take very less time to develop, as Python codes are 3 to 5 times shorter than Java codes.

Python is also very famous for its simple programming syntax, code readability and English-like commands that make coding in Python lot easier and efficient.

Execution Speed does not matter as much as Business Speed

There were times when computer run time was to be the main issue and the most expensive resource. But now, things have changed. Computer, servers and other hardware have become much much cheaper than ever and speed has become a less important factor. Today, development time matters more in most cases rather than execution speed in terms of cost as employee’s time has become one of the most, or even the most, expensive resource. Reducing the time needed for each project saves companies tons of money.

As far as the execution speed or performance of the program is concerned, we can easily manage it by horizontal scaling, means getting more servers running to get that level of speed or performance. In this modern era, where we have the very high computing power and multi-core processors that are becoming cheaper by the time, the speed and performance issues can easily be resolved. But it is not the same story for human cost. It will just keep increasing and increasing over time.

In short, the amount of time you can save in the development process will possibly be more and cost-effective than whatever performance and execution speed in the application you get.

Not only does the shorter development process save money, but also improves it your competitiveness. Faster prototype and deliver enable companies to innovate and get ahead of the competition.

As a CEO, which option will you choose? (1) complete a project in 6 months (2) complete exactly the same project in 4 months but you have to pay 20% more for the server. If execution speed is your most concern, then (1) is your choice. But if you focus on development speed and faster innovation, (2) should be your choice.

That’s where Python gains its popularity as the time required to build a program using Python very short as compared to other programming languages.

Is Speed the only factor you should consider?

When choosing any programming language to develop any type of application, there are several tens or hundreds of factors that you should consider, and speed is surely one of them. But, there are other things that also matter like language suitability.

Python has been in the market for a very long time and its community is very big. Thus, it is easy to find Python developers and supports.

In addition, the language has a rich set of standard libraries and frameworks for several purposes. For example, Django and Flask for developing web applications, TensorFlow for deep learning, and pandas for data analysis, etc.

Is Python Good for Speed-Intensive Applications?

So far we have discussed why Python is slow in terms of speed and why Python is popular in mainstream despite being slow. But, what if you strictly require high-performance and fast execution speed in certain applications? In this case, I would say that Python is no good. Sure, you can optimize it but in general, other programming languages should be used. For example, for game development, C# would be a better option.

In short, Python is widely used even when it is somehow slower than other languages because:

Python is more productive
Companies can optimize their most expensive resource: employees
Enable competitiveness improvement by fast innovation
Rich set of libraries and frameworks
Large community

But, it is also not suitable for speed-intensive applications including games that require high-performance and also OS or system-level applications.