Artificial intelligence - Part 2 - No mumbo jumbo
What is AI and why AI is not magic. In this article, you will learn how AI works.

In the first part, we looked at the use cases of neural networks and realised that they are only tools for a very specific area of application. In this part, we want to explore the technological background in an understandable way.
In "It has long since started", I explained in broad strokes how a neural network is trained. In very simplified terms, I explained that a neural network is fed with large amounts of data, from which you know what the desired result should look like, and "learns" independently which patterns in the data lead to precisely these desired results. The advantage, as already explained in the first part of this series "Artificial intelligence - Part 1 - Looking into the crystal ball", is that the recognition of such patterns no longer has to be done by hand in detective work, but is carried out independently by algorithms.
In this article, I go into more detail about training AI and use examples to show you which parts make up such a neural network and how these parts are connected.

1. The model
The model is the centrepiece of machine learning. It is a structure of so-called neurons, which in turn are summarised in layers. The totality of these layers, their structure and the connections between them form the model. It is not the same as the fully trained neural network, but rather the basic structure that is then trained in the second step. We can visualise the model in a very simplified way like a coordinate system. You define an x-axis and a y-axis as well as a formula, which later forms a graph. In the next step, values can be read from this graph.
Let's take an example: We work for an ice cream parlour and want to find out how expensive our ice cream should be. We are particularly interested in banana ice cream. To be able to predict what the next production of banana ice cream will cost us, we want to have a way of predicting the price of the bananas so that we are no longer surprised by our supplier's invoice. To do this, we design a coordinate system:
A bit empty, the whole thing. So let's add some data to it. Firstly, we need a new parameter to modify the input. The input is the quantity of bananas ordered. The retailer gives us a price per kilo. This results in the following formula: desired quantity × price per kilo. Bananas currently cost around €1.30 per kilogram. If we use this to teach the model, we get the formula:
Input x multiplied by 1.3 gives the predicted price y.
x × 1.3 = y
However, we want to have the bananas delivered and not just buy them at the market stall. To do this, the retailer charges a flat rate of €2 per delivery and we need a new parameter for our formula:
Input x multiplied by the price per kilo (1.3) plus the delivery fee equals the predicted price y.
x × 1,3 + 2 = y
Very good! According to our prediction, we can expect the next delivery of 6kg of bananas to cost us €9.80. Our forecast works. But what if we have different prices in different months because bananas are in season? A bulk discount is also possible with our retailer. We also want to order apples. Then the delivery will be reduced proportionately for the bananas. We need MORE PARAMETERS! MANY MORE PARAMETERS!
That's exactly what neural networks are for. They can take many more parameters into account for the forecast than normal software and, above all, pick out the values for these parameters themselves.
In a very simplified way, an AI model can be imagined as a kind of graph that receives data as input and then reads the corresponding value or prediction from this graph. In reality, however, a neural network would more likely be a whole series of graphs that are connected to each other (neurons). Each neuron processes the input data and passes its output on to the next neurons until the last layer is reached. After the last layer, the result is interpreted.
However, separate software is required for this input and subsequent interpretation. The inference code.
2. Inference Code
Once an AI model has been successfully trained, it can be used to make predictions. This use of the neural network to generate predictions is known as inference. The model receives new input data, processes it according to its internal structure and weighting and generates an output based on the knowledge acquired during training. This output is then converted into a usable result.
The layer that delivers the input, e.g. from a user, another programme or a database, to the model and interprets and returns the output is often referred to as inference code.
In our example, this would be the ice-cream maker, who drains the quantity of bananas from the recipe and enters them into the formula, then notes the result in order to provide the money for the retailer.
3. Neuron
Neural networks consist of artificial "neurons", i.e. nerve cells, that are simulated using programme code. These artificial neurons have very little in common with real nerve cells. Rather, they are all little mathematical formulae. Just like our formula for the graph in the example. At the beginning, the parameters are filled with random values. During training, these values are adjusted again and again until the result of the entire network is as close as possible to the desired result. These parameters are often called weight and bias. Weight is a weighting of the output by adding a certain value and bias is a multiplication of the input.
Depending on its size, a neural network has a whole series of corresponding weight and bias fields, which are adjusted during training in order to optimise the model. The training consists of feeding the model with known input-output pairs and adjusting the two values accordingly in order to obtain the most accurate prediction possible. More on this in a moment. Such an artificial neuron could look like this:
4. Sampling
Sampling is used outside the model. This is a method used in language models to select the next word in an output sequence (the subsequent output sentence or text), whereby the output consists, for example, of probabilities for individual words. The aim of sampling is to minimise the effect of pure prediction based on learned patterns, where the most probable word would always be selected. In language, however, the most probable or most frequent word is not always used, but there is a strong diversification in the choice of words. The closest word in a sentence is not always the one that sounds the most beautiful or interesting. For this reason, a certain random factor is required, which is introduced into the output through sampling.
After the model has calculated the probabilities for an array (list) of words, a random percentage value is generated and the next word is selected according to this value. This step is normally performed by the application or programme in which the model is embedded.
In many implementations, sampling is carried out using the softmax function[3] is used for sampling. The softmax function transforms the output values of the model into a probability distribution, which can then be used for sampling.
In practice, parameters are also used here to help ensure that unusual words are used or that word repetitions are less likely.
5. training of neural networks
There is a whole range of paradigms in the field of machine learning. I will explain some of them in more detail below. They all require gigantic amounts of data in order to derive patterns from them and thus "learn" how to master a task.
a. Training data
Training data, often referred to as a corpus or dataset, is regularly classified as a database. They are organised systematically (e.g. alphabetically), individually accessible and regularly characterised by a labour-intensive and time-consuming development and maintenance phase. This also constitutes a database within the meaning of copyright law, which is protected accordingly. More on this in another article.
In this database, elements that are similar to the later input data are linked to the desired results. In our example, this is a database full of animal photos in which each photo is labelled with the type of animal it shows. This process of labelling data is also known as data annotation.
The database itself can obtain its data from all kinds of sources. People can be hired to create it, photos can be bought from photographers, the internet can be searched by scraping, literature, videos, audio files or similar can be used. This is how Google utilised unsuspecting internet users to generate training data. Everyone knows the infamous grid of nine photos, some of which are difficult to recognise. Usually combined with a request to click on all the tiles showing, for example, a fire hydrant or traffic lights. Some of the photos were known to Google's reCaptcha service. But many were not. The user now diligently clicked on all the image sections that showed hydrants for Google and Google could be relatively certain that a hydrant was actually visible if the same tile was clicked on a corresponding number of times. This not only provided a human interactive proof (HIP for short), but also generated a large amount of training data for the AI. All kinds of copyright issues arise when generating training data, but these are not specific to AI, but relate to the areas of big data and databases per se.
b. Paradigms
1. supervised learning
When teaching a model, enormous amounts of so-called training data are required. This data consists of the elements that are similar to the input data we want to use later, as well as the desired outputs for this training data. This training data is then fed into the neural network and an output is generated. An output is never a definite result, but rather a kind of probability distribution over all possible results. This probability distribution is then compared with the desired result. If the probability of the desired result is too low, the weight and bias values are adjusted until the result is within the desired range. This is done more or less by approximation and trial and error. As a result, the process takes a very long time and requires a lot of computing power.
This process is controlled by the training code. This differs slightly from the inference code, but is very similar to it. The training code contains optimisation techniques for teaching the model and also a feedback system. The inference code often contains components that filter or interpret the output.
- Supervised learning is ideal for tasks such as
- Classification: spam filtering, image recognition, sentiment analysis
- Regression: price forecasting, weather forecasting, modelling of time series
Process example
Suppose the goal is to develop an AI that can recognise animals in photos. In this case, the training data would consist of animal photos from which it is known which animal can be seen. In the next step, the images are fed into the neural network. This could be a photo of a dog, for example. As a result, the network may spit out 60 % cat, 3 % elephant, 26 % mouse and 11 % dog. So we have to adapt the network. This happens automatically, of course. In the next run, we might get 35% dog for the same image. We are getting closer. This happens again and again with many images and an unbelievable number of passes until we get a result of 90% dog for the example image. The network now recognises the dog in the photo quite reliably.
The whole thing must also work well with all other animals that are to be recognised. In practice, several specialised neural networks are often combined for such recognition. One may recognise fish very well, another may recognise pets. A third could, for example, recognise whether the network is better at recognising fish or pets. A whole cluster of neural networks may be necessary to achieve reliable results in the end.
Large amounts of data are labelled manually or automatically. For example, animal photos are labelled with the animal's name.
2. self-supervised learning
Unlabelled data sets are used in self-supervised learning. The model has to find the labels, i.e. the output target, itself. The advantages are obvious: there is no need for data annotation, which is a major cost and time factor.
For example, large amounts of text from the Internet can be made available to a model as data for training. By hiding individual text parts, the model can be trained to find the missing text parts independently.
The disadvantage of this method, however, is that a fine-tuning phase is necessary in which the model learns to fulfil exactly one task reliably. In the process, the previously acquired knowledge is refined and utilised.
Self-supervised learning is used in the following areas in particular:
- Image processing: image categorisation, object segmentation, image generation
- Natural language processing: text generation, machine translation, text summarisation
- Time series analysis: anomaly detection, prediction
3. reinforcement learning
Reinforcement learning should be familiar to anyone who has seen the first-class film "War Games". The system learns through a reward system. Before training begins, it is determined which result of a decision by the system gives which score.
Example:
A neural network is to learn to play chess. To do this, the network would be given a chessboard on which it can move the pieces according to the rules. Now the network would play against a chess computer. The network receives a certain number of points for each piece it captures. For example, one point for each pawn, three points for the bishops and so on. The king scores significantly more points than all the other pieces. Conversely, points are deducted from the net for lost pieces. The net then tries to increase its score with each pass. At some point, the net will have "learnt" to checkmate the opponent as efficiently as possible.
Reinforcement learning is used in areas such as
- Robotics: Controlling robots in dynamic environments
- Games: Development of AI players that learn complex game strategies
- Optimisation: Optimisation of resource allocation in complex systems
c. Example
The following graph shows the prices for the last three orders of bananas. You can try to draw a straight line through all three points as accurately as possible. The closer the line is to all three points, the more precise the prediction of the price for more or fewer bananas. In reality, the graph would have a few more dimensions. For example, the current petrol price, as this influences the transport costs. With more parameters, it becomes increasingly difficult to find the perfect line manually.
6. From model to text
The model input: The user's input, for example with ChatGPT
The input is given to the neural network by the input code.
The model output: Probabilities for the next words or characters
The model generates as output a probability distribution over an array (a list) of words or characters. These probabilities reflect how likely the model considers it is that each word or character will be next in the sequence, i.e. in the final output.
Selecting the next word: sampling and greedy approach
Various techniques can be used to determine the final next word or character. One common method is sampling. This involves randomly generating a value according to the probability distribution and selecting the word or character with the corresponding probability. This makes the output language appear more natural, because we humans also always formulate sentences differently, even though we want to express the same thing.
Alternatively, you can use the greedy approach, which selects the word or character with the highest probability and introduces less randomness into the output.
Generation of the probability distribution: activations, weighting, bias and softmax
The probability distribution is generated from the activations of the neurons in the model, by weighting and distortion, and by applying special mathematical functions such as the softmax function. These operations transform the raw output of the model into a valid probability distribution that is suitable for text generation.
Diversification through probabilities: Several options for the next step
By outputting probabilities via the word array, the model can make the generated texts more flexible and consider different options for the next word suggestion. This leads to greater variety and naturalness in the generated texts.
7. No mumbo jumbo
We have now learnt about the process leading up to a trained neural network. Anyone who has Excel can take a look inside the black box. A file with a good 1.25 gigabytes in size can be downloaded from https://github.com/ianand/spreadsheets-are-all-you-need/releases. These Excel spreadsheets contain over 124 million parameters of a predecessor AI of ChatGPT. To be precise, a basic version of GPT-2, so that everyone can understand that AI is not just mumbo jumbo after all, but simply IT.