Machine learning for dummies

This started as a post about particular applications of machine learning (or ML) in protected area management, but while writing the ‘one paragraph’ introduction into machine learning, it became two paragraphs and then some more. Then it became a post in itself. Obviously, you will find other introductions if you search for them. I suggest that you do so because there is no set definition and dominant versions will differ over time – as with all complex stuff.

1 ML: the one paragraph summary

At present, ML is a simple version of one way of how humans learn : showing examples to the machine to teach it something (class phase), testing if it has learned what it should have learned (exam phase), then more teaching or letting it apply what it learned (work phase).

The following three sections will go into these three phases. After that the whole idea of ML does not sound so good at all, but there are some benefits that may still make it worth the while, which is the topic of the last chapter.

2 Class

The class phase: one shows the machine a lot of examples and say something like ‘notice that this and this is important’, or ‘this example and this example are the same’, or so. The task boils down to some form of pattern recognition, including classification and some form of prediction. Hopefully, the machine will learn to recognize certain patterns in the data, that we want it to learn. The result of this learning is called a ‘model’ : the machine built a model of the pattern that it needs to recognize.

For example, let’s say one wants to teach the computer to identify a sunflower in a picture. One shows it pictures with a sunflower and tell (through tagging for example) it that it is looking at pictures with a sunflower. And one shows it pictures without a sunflower and tell it that it is looking at pictures without a sunflower. The computer then (hopefully) builds a model of a picture with a sunflower

The ground truth

A hard part, but not necessarily complex, is to create the training materials. The machines need to be presented with what the experts call a ‘ground truth’. This is data with preferably perfect highlighting of what is important. So in the example of the sunflowers, one needs pictures that have been correctly tagged with ‘sunflower present’ and ‘no sunflower present’.

Secondly, it does not suffice to show three pictures. One needs a lot of examples. Like, hundreds or thousands or millions depending on how complex the patterns are that one wants the machine to learn to recognize and on how well one wants it to learn. It is a matter of statistics, which is beyond me.

Thirdly, just like with all computer programming, the adage is ‘garbage in, garbage out’. If one learns from the wrong examples, one learns the wrong thing. So, during the training, it would be bad to show a picture with a sunflower and tag it ‘no sunflower present’. And it would be worse to show a lot of those bad examples.

Summarizing, one needs a lot of examples, covering a high diversity of possibilities and they all need to be correctly tagged. The important question to ask is how one can be sure that the ground truth is indeed a ground truth. Perhaps the smartest is not to have a machine creating these examples. That would just reiterate the problem. So then we need humans. Right? Because they are infallible? No they are not. That is why we need machines … back to square one.

Well, one does not have to solve the dilemma. After all, the ground truth does not have to be 100% perfect and neither does the model. But when thinking of a ML application, one needs to ask what would be acceptable, if that can be achieved and how?

The machine

I easily hopped over a difficult part, which is the creation of the machine. The machine obviously is a computer that is programmed in a particular way. I am not familiar with the details, and interestingly, the wiki page on ML that I just checked does not provide any insight. I guess that there are many ways to program a learning machine. The page even mentions that ‘In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step’. If the programmers need help from the very thing that they are programming, then to me, that makes it a dark art. How do we know that the machine is programming itself correctly?

The take-away from this discussion of the class phase : it is simple on one level, but complex on another. Finally, one could do with some skepticism and distrust of the result. One important reason to have an exam.

3 Exam

In short, the result of the class phase is that the machine built a model of what it is supposed to recognize. Now, one wonders : how well did it do that?

To find out, one shows the machine examples that it has not seen before and ask it to point out the important things or do whatever it had to learn. Then one checks if it got it right, or at least right enough for one’s purposes. If not, then they need more or better training, and perhaps reprogramming of the machine. If they passed the exam, you then put them to work in the real world.

It sounds really simple, but the amount of blog posts and medium articles that point out that some data scientists still don’t get it, is staggering. It is like with humans. If one tests with the exact same examples that were used in class, then the test is flawed. A good score may mean that the student has simply memorized the examples. It does not necessarily mean that the student has learned what a sunflower looks like. It is the same with learning machines.

So, the problem of providing a ground truth is extended to the exam phase. If one has no good exam data, then one can not even have an indication that the model works well. Put such a model to work, and someone will suffer the consequences.

4 Work

After the hard work of class and exam has been done, it is time for the machine to do the work.

Usually, this is when most people hear about the existence of the machine. Class and exam are black boxed into ‘a computer’ that does some impressive work.

For humans, the learning is never over. For a ML application that may or may not be the case. Once the training and testing is done, the remaining model can live on and be applied as it is. However, I would argue that it is better if the machine kept learning, for the same reasons why it is good for humans to keep learning. Things change over time, even the shape of a sunflower, and it would be good to keep track of that.

In this post, I have nothing else to say about the work phase. Actual cases and stories of ML applications are needed, which is outside the scope of this post. But look out for the next post in this blog. For now, let’s move on to the last topic of this post: if there are these serious caveats and problems, what is the point to put in all the effort?

5 What is the point?

One point is that whereas people can easily learn to recognize the image of a sunflower in a picture, recognizing sunflowers in a million pictures will take a lot of time. Computers would be a lot faster, and perhaps more important, easier to motivate to even start on the work.

Secondly, whereas humans may still out compete computers in recognizing images, when it comes to textual and numerical data, computers tend do a lot better than humans. Let’s say that one wants to find a pattern in a data set with 100s of thousands of observations, each containing 100 variables. Once the model is trained, the machine will have a fighting chance.

Thirdly, if a task is too big for one computer, the model can be copied to another, and another or to an entire warehouse of computers. In this way, scaling up is relatively straightforward, even though perhaps expensive.

‘Big data’ and ML are often used in the same sentence and these are powerful reasons for that.

Okay, but what about the mistakes? As should have become clear by now, ML is not perfect, or in other words, through ML computers suddenly are able to make honest mistakes. It is indeed shocking compared to the notion of computers as infallible machines that has been around for half a century. However, that notion was false in the first place and humans are quite used to dealing with mistakes. In certain situations, there is no need for a 100% perfect result. Or a few percent of bad results is acceptable if the alternative is a lot worse or completely absent.

After writing these last paragraphs, one final question came to my mind : what about Galaxy Zoo? As I remember, Galaxy Zoo was launched because at that time, back in 2007, computers could not classify the image of a galaxy. They could not be trained to tell if a galaxy’s shape was round, cigar-shaped, or a spiral. It turned into the famous, if not the first, citizen science project. Do they still need volunteers to do such work? Find out for yourself.

24 September 2021

Frank van der Most

Rubber Boots Data

Boosting your FileMaker app