xkcd's Machine Learning Explained: Stirring the Compost of Data for Answers

Machine learning is a method employed in the automation of complex tasks. It usually involves the creation of algorithms that deal with statistical analysis of data and pattern recognition to generate output. The xkcd comic uses a humorous analogy to explain this complex field.

The Comic Strip Analogy: A Compost Pile of Linear Algebra

The comic depicts a scene where Cueball stands next to what looks like a pile of garbage (or compost), with a Cueball-like friend standing atop it. The pile has a funnel (labelled "data") at one end and a box labelled "answers" at the other. Here and there mathematical matrices stick out of the pile. As the friend explains to the incredulous Cueball, data enters through the funnel, undergoes an incomprehensible process of linear algebra, and comes out as answers. The friend appears to be a functional part of this system himself, as he stands atop the pile stirring it with a paddle.

The main joke is that, despite this description being too vague and giving no intuition or details into the system, it is close to the level of understanding most machine learning experts have of the many techniques in machine learning.

Machine Learning as Composting

This comic compares a machine learning system to a compost pile. Composting is the process of taking organic matter, such as food and yard waste, and allowing it to decompose into a form that serves as fertilizer. A common method of composting is to mound the organic matter in a pile with a certain amount of moisture, then "stirring" the pile occasionally to move the less-decomposed material from the top to the interior of the pile, where it will decompose faster. In large-scale composting operations, the raw organic matter added to the pile is referred to as "input".

The "Stirring" Process: Randomization and Avoiding Local Minima

On the topic of 'Stirring', it's a common thing in machine learning to randomize starting conditions to avoid local minima. This does exist in neural networks, as edge weights are typically randomized, but it's also the first step in many different algorithms, such as k-means where the initial centroid locations are randomized, or decision trees where random forests are sometimes used.

The Role of Linear Algebra

'Machine learning' algorithms that can be reasonably described as pouring data into linear algebra and stirring until the output looks right include support vector machines, linear regressors, logistic regressors, and neural networks. Major recent advances in machine learning often amount to 'stacking' the linear algebra up differently, or varying stirring techniques for the compost.

Anyone who's worked with neural networks knows they're still essentially a linear algebra problem, just with nonlinear activation functions.

Recurrent Neural Networks: Adding a Temporal Dimension

A recurrent neural network is a neural network where the nodes affect one another in cycles, creating feedback loops in the network that allow it to change over time. To put it another way, the neural network has 'state', with the results of previous inputs affecting how each successive input is processed. In the title text, Randall is saying that the machine learning system is technically recurrent because it "changes".

The Human Element: Training and Bias

Appearently, there is the issue of people "training" intelligent systems out of their gut feeling: Let's say for example a system should determine whether or not a person should be promoted to fill a currently vacant business position. If the system is taught by the humans currently in charge of that very decision, and it weakens the people the humnas would decline and stenghtens the one they wouldn't, all these people might do is feeding the machine their own irrational biases. One could argue that this, if it happens, is just bad usage and no inherent issue of machine learning itself, so I'm not sure if this thought can be connected to the comic. In my head, it's close to "stirring the pile until the answers look right".

Explainability: An Elusive Goal

Explainability of Machine Learning remains an elusive goal for the most part. Without that we are just grasping in the dark and are likely to rely on ex-post pattern fitting to explain our results.

Read also: Revolutionizing Remote Monitoring

CAPTCHAs: Training AI with Human Labor

A typical CAPTCHA might distort a random sequence of letters and numbers and put it in a strange and/or mixed font and ask a user to type it, or it might show a set of pictures and ask the user which ones contain fire hydrants; these tasks are meant to be easy for humans but obscenely difficult for computers. CAPTCHAs run by Google are also used to train artificial intelligences to get better at these difficult tasks, such as reading poorly-scanned text or identifying objects of interest on the road (the latter being the subject of 1897: Self Driving).

This comic jokes about a malicious CAPTCHA which is being used to train an AI to dominate the world. In order to prevent people from taking shelter, the AI uses the CAPTCHA to ask humans like Cueball to tell it places where they would hide. The implication is that during a robot uprising, the AI, on the side of the robots, would then be able to track down humans much more easily.

Sheltering from the Robot Uprising: A Humorous Take on Disaster Preparedness

Sometimes, the best (or least-worst) response to a disaster is to "shelter in place" until the danger is passed, rather than risk getting caught in the open or in traffic. This is commonly advised in response to biological, chemical, or radiological hazards, or in the case of a violent act committed in the community. If the robot uprising is localized, then sheltering at home would be a fine response, because traveling to the other locations would increase the risk of being spotted and attacked by self-driving cars or aerial drones.

If there is a robot uprising, then traveling to a forest or other nature reserve, far away from developed cities and towns, would reduce the risk of being near a hostile piece of technology. Cars offer some shelter and, more importantly, mobility in one convenient package. Most families own at least one, and they are widespread in human-occupied areas, so even if the car is not as suitable as a long-term shelter (depending on how the road and gasoline/power networks survive the uprising) it makes a fine first step in evacuating to a more permanent hiding place - at least until it becomes a more obvious target for either the hostile machines or fellow escapees who desire it for themselves.

Cities offer thorough selections of supplies and tools that may be harder to come by in more rural areas, but they are also home to lots of robots and automated systems that may participate in the uprising, not to mention humans who may be prime targets for the machines. The log with a board leaning on it is an example of an improvised shelter. Such a shelter could be constructed anywhere with local materials, and would not be marked on any map known to the robots, which are both positives for surviving the onset of the uprising. This is not a viable shelter. Like the improvised shelter, this option can be made almost anywhere and is easy to camouflage, and it offers additional insulation from weather and weapons of mass destruction.

Read also: Boosting Algorithms Explained

The title text imagines a different malicious CAPTCHA which Randall says is "more likely" than the robot-uprising scenario, in which a company or government asks users to identify "disloyal" members of society. Presumably the company or government would then use this information to eliminate such "disloyal" members, either by firing them (company) or jailing, expelling, or executing them (government).

tags: #xkcd #machine #learning #explained