#
Road to Neuropia
Road to Neuropia

This is an article about C++ programmer’s entry to machine learning by implementing a feed forward neural network. I started to write this publication as a blog post, but along the journey I found more aspects I had to cover. This is not an attempt to be any kind of research paper and purposefully breaks many golden rules of making science. Along with the blog writing best practices the conclusions here are rather subjective opinions, or at least should be taken as so.

The first part introduces the motivation and explains some background behind the task, and then sums up the backpropagation algorithm implementation. The second part discusses from the coding perspective and focuses mostly on performance, as for coder the time is always limited in supply and thus a topic of interest. The last part finally changes perspective to more on data science field and discuss ways to improve network learning capability.

I would like to thank my colleagues, especially Data Scientist Dr. Henry Joutsijoki, for the effort commenting on the context during the creation cycles, and others for all their valuable comments and efforts to improve this article.

# Content

# Introduction

A few years back my colleague came up with a coding assignment called Neuropia for recruitment assessment. The aim was to implement a simple machine learning solution. Yet the actual evaluated goal was to design an interface for perceptrons and multilayer feed-forward neural network and utilize that with a minimal neural network. The assignee was supposed to use the network to resolve logical binary gate having two inputs and one output with expected results. For example, if the network has been trained to act as NAND gate, then input array 1, 1 would produce 0, and other logical combinations would produce 1.

Neuropia revealed to be notoriously difficult. In general, it has been pretty tedious to adjust values to work just as a NAND gate. Implementing a single gate was quite tricky, and the most straightforward approach was just to set the activation function to implement a logical gate directly. That is a really simple artificial neuron and therefore fulfills the requirement, but is not a perceptron. Therefore such an implementation will not address the benefits and purpose of neural networks, thus a more comprehensive answer would be nice.

Not so long ago I started to follow recommendable Youtube channel 3Blue1Brown and watched episodes of neural networks and backpropagation algorithm. Along with that I got the idea to implement Neuropia in a way it can learn and be applied for more complex tasks than just resolving NAND logic.

# Implementing Learning Network

The Neuropia C++ interface implements two classes called Neuron and Layer. Neurons represent perceptrons and hold weights, i.e. multiplier for each input value and a bias, value that is summed up to the result, and the activation function that modifies the sum before passing it to the next neurons.

Layers implement a double linked list containing vector of Neurons and a mutable buffer to hold the most recent calculation results of this layer. In this implementation each layer is fully connected, meaning that each neuron in a layer is connected to each neuron in the succeeding layer. See Figure 1 a simple network that has an input layer of two inputs, a single hidden layer and an output layer with a single output.

The set of layers is a network and on creating the network, the first layer is the input layer, the last is the output layer, and layers in between are called hidden layers. The input layer will not modify data anyhow, it just passes the data to the next layer. The neuron properties inside the layer can be set individually or layer-wise.

## Logistic Gate

One may find interesting that resolving an XOR gate problem, at least one hidden layer is required. Here we first create a network of three layers where the input layer has two neurons as well as a hidden layer, and the output layer has a single neuron to emit a boolean output value. The network is illustrated in Figure 1. All neurons are initialized to their default values.

```
auto network = Neuropia::Layer(2).join(2).join(1);
```

The `Layer::join`

function has few overloads. The variant used here creates a Layer with a given amount of neurons and bind them to the caller layer. The first layer is automatically an input layer, and the last an output layer.

Before training the network all weights and biases must initially set to random values.

```
network.randomize();
```

The train function is repeatedly called with input data and expected output data, aka labels. The third parameter, learning rate will be discussed later, but for the reference, a value 0.05 is used here. The `TRAININGS`

constant defines the needed training iterations, also called epochs.

```
while(++loops < TRAININGS) {
const auto train = trainData[random()];
network.train(train.inputs.begin(), train.labels.begin(), MY_LEARNINGRATE);
}
```

Verify the results and see the magic happen.

```
for(const auto& input : verifyData) {
std::cout << input << "->" << network.feed(input) << std::endl;
}
```

As a result we will get output for the gates in Figure 2:

From the results, you can see how network outputs probabilities are close to 0 and 1 as expected. It is Interesting that with a certain probability the output is completely incorrect as the training algorithm is not immune to getting stuck in the local minima. However, the given learning rate, and the other hyperparameters as discussed later, would help to configure the network so that this unfortunate case will not occur.

## Backpropagation Implementation

This Neuropia implementation uses backpropagation algorithm for training. I am not going through the math behind the implementation since there are more than plenty of mathematical papers, articles, books, youtube videos, and other sources. I hope the audience here will find it more interesting to dig into the programmer’s implementation perspective.

The backpropagation algorithm optimizes network for given training data, and if the training data represents the learnable subject well, the results can be applied in more general when feeding previously unseen data into the trained network.

The optimization process uses a gradient descent method that gradually finds the minima on the multidimensional plane where network data has at best adopted for the input. Later on, I will use the ‘minimum’ to refer that singular point where the network would theoretically reach its perfectness, i.e. neurons have values for their optimal stage.

Before we will dive deep into the algorithm, a few words about the Matrix class used. The backpropagation algorithm here could be implemented without matrices, and Neuron or Layer do not internally use any matrices. However, implementing the backpropagation algorithm is such a complex beast that the matrices are excellent tools to utilize. Due to the initial design, there is likely some performance hit when data has to be imported in between Neurons and matrices, and therefore for any real-world implementation the internal data layout would likely be a matrix. This Matrix class is implemented earlier for a completely different purpose and that may (or may not) explain a few of its peculiarities. Originally implemented from pure simplicity perspective and performance was not even considered as a requirement. However, I analyzed it with Valgrind and similar tools and fixed a few most obvious issues there. Please note that ‘*’ and ‘*= ‘ operators do elemental multiplication (also known as Hadamart product) per row and column same way as ‘+’ and ‘-’ operations. The matrix multiplication itself is done with `multiplication`

function.

So, backpropagation algorithm goes through network backwards (“hence the name” - as all tutorials keep saying), compares each layer input and output, calculates layer’s proportional error contribution and tunes gradually weights and biases to minimize the error between read output and expected output.

I will go through here the internal implementation of the training function. The Neuropia::Layer API function does forward feed first so this internal function gets the real network output for the training iteration. Please note that the code found in the repository may be different from the presented here as it may have been faced some further development.

```
void train(IteratorItInput inputs, IteratorItOutput expectedOutputs, double learningRate, DerivativeFunction df) {
```

The `Layer::feed`

function is the one used for passing values through the network. It takes input iterators to data as parameters and returns an output vector. Naturally, the input vector size has to match with the number of input layer neurons and the output vector then has the size of the output vectors.

```
const auto out = feed(inputs, inputs + m_neurons.size());
ValueVector expectedValues(out.size());
std::copy(expectedOutputs, expectedOutputs + out.size(), expectedValues.begin());
backpropagation(out, expectedValues, learningRate, df);
}
```

The `backpropagation`

function has output and expected output parameters as std::vectors and hence they have to be exported as matrices.

```
void Layer::backpropagation
(const std::vector<double>& outValues, const std::vector<double>& expectedValues, double learningRate, DerivativeFunction df) {
const auto expected = Matrix<double>::fromArray(expectedValues, Matrix<double>::VecDir::row);
auto lastValues = Matrix<double>::fromArray(outValues, Matrix<double>::VecDir::row);
```

At first, an error, the difference between expected and output values, is calculated. There are other options to get the error than just a linear delta between the result and expected values. The subtraction here is called a cost function and other approaches may improve the network performance, but to keep this simple, this implementation is fine.

```
auto errors = expected - lastValues;
```

The output layer is set as the initial layer.

```
auto lastLayer = outLayer();
```

Next, we extract the buffered values from the last hidden layer. In other words, values calculated within the feed function are called and stored into a Layer object. For each layer, the latest neuron output is cached in the mutable container to be used for backpropagation.

With an error and data of each Neuron, we know how it is off and therefore we are able to get its gradient. Then based on gradient we know how to adjust weights and biases to go towards minima on the next round.

```
auto layerData = Matrix<double>::fromArray(previousLayer(lastLayer)->m_outBuffer, Matrix<double>::VecDir::row);
```

So we loop through each layer.

```
for(;;) {
```

For the gradient, the derivative of `lastValues`

, i.e. current layer output is needed. Neuropia lets set your own activation function for perceptrons, but for training also its derivative function must be given. There will be more discussion about different activation functions later on in this document - but as an example, we can assume that sigmoid function is used. Sigmoid function, defined as is a default activation function for a Neuron class and its derivative is as . It looks a bit complicated, but the beautiful thing here is that within perceptrons each neuron output is an activation function, and therefore, the sigmoid function derivative function can be written as (in C++).

```
DerivativeFunction df = [](double value) -> double { return value * (1.0 - value);});
```

Hence we have an expression to get directions of changing values from the current Layer data. The Matrix::map function return a new Matrix where a given function is applied to each Matrix value.

```
const auto gradientDelta = lastValues.map(df);
```

As we know the error and its direction, we can construct the gradient scaled with a learning rate. Since the training algorithm is going towards minima only in average, it is crucial not to take too big steps and thus miss the minima, or otherwise, take too small, and iteration takes too long - or even ends up into arbitrary local minima without the ability to find a more optimal minima. Note that the learning rate can also be dynamically changed when learning progress. It can take good leaps in the beginning and end up to have tiny little steps towards the end to optimize minima approaching.

```
const auto gradients = learningRate * errors * gradientDelta;
```

We have a gradient that let us adjust the bias and weight values for the layer neurons.

```
for(auto i = 0U; i < gradients.rows(); i++) {
auto& n = lastLayer->m_neurons[i];
n.setBias(n.bias() + gradients(0, i));
}
```

The obvious complexity of the expression above is due to the Neuropia API. As matrix notation that can just be written as “biases += gradients”. But we have to inject the bias values back to the network and hence the complexity of the expression.

Calculating the weights for Neuron is pretty similar operation even the weights are stored as std::vector for each input value, and therefore the implementation, as well as math expression, is slightly more complex.

For matrix multiplication we need current layer output data transposed.

```
const auto layerDataTransposed = layerData.transpose();
const auto layerDeltas = Matrix<double>::multiply(gradients, layerDataTransposed);
```

Retrieve weights data as matrix.

```
const auto weightsData =
Matrix<double>::fromArray(lastLayer->m_neurons,
previousLayer(lastLayer)->size(),
[](const Neuron& n, int index) {
return n.weight(index);
});
```

If you wonder where that size of vector comes from, note the network is fully connected and thus the amount of weights is just the amount previous layer neurons. For readers not familiar with modern C++ syntax, `fromArray`

function accepts three parameters: the vector data is retrieved and its size (there are also iterator overloads in the Matrix class) and a function that does mapping from a Neuron to a weight value.

So, now we have new weights and we can add an adjusting delta to it.

```
const auto weights = weightsData + layerDeltas;
```

The weights are written back to the network.

```
for(auto j = 0U; j < weights.rows(); j++) {
auto& n = lastLayer->m_neurons[j];
for(auto i = 0U; i < weights.cols() ; i++){
n.setWeight(i, weights(i, j));
}
}
```

Ok, that is all about it, for this layer! Then we go for the next layer. We need errors of the next layer. As layer’s each neuron’s each weight vector contributes to the current error, we get a new error matrix by having multiply between each weight vector and current error vector. Please see details how that is possible in Backpropagation Step by Step. Personally, the explanation get me somehow convinced.

```
const auto weightsDataTransposed = weightsData.transpose();
errors = Matrix<double>::multiply(weightsDataTransposed, errors);
```

Now we are ready to go backward to the previous layer - unless it is an input layer and we are done.

```
lastLayer = previousLayer(lastLayer);
if(lastLayer == nullptr || lastLayer->isInput())
break; //we hit the input layer
```

Then we just get new buffered values for coming loop round.

```
lastValues = layerData;
layerData = Matrix<double>::fromArray(previousLayer(lastLayer)->m_outBuffer, Matrix<double>::VecDir::row);
```

And we are done! It was not that bad, right?

But hold on, there is more - the rabbit hole continues! I had a nice implementation and wanted to apply it to a bit more complex dataset than just the most simple logistic functions. Next, we look at what happens when we utilize the network to classify a set of 60000 images of handwritten numbers and look closer at the training.

# Closer Look to Implementation

The MNIST database of handwritten digits is considered to be ‘Hello World’ for classifiers, containing a training set of 60000 images and test set of 10000 images. The data is stored in Idx format. I created a simple `IdxReader`

class that implements an iterator over a file. The image data is stored in 28 x 28 8 bit grayscale arrays and therefore the resolving neural network input layer has 784 input neurons. The output layer has naturally ten neurons, one for each digit.

The first implementation was passing all training data to the backpropagation algorithm. The approach was not very pragmatic as calculation took several hours and so I changed training to enact stochastic gradient descent. It means that an arbitrary amount of training data samples are passed to the training algorithm in random order until the result is assumed to be good enough. `IdxRandomReader`

can be used to hop randomly over data to construct a random image and label sample sets.

## Hyperparameters

Defining the number of hidden layers and other characteristics of the network have a big impact on how it behaves and performs. These features are called hyperparameters, and getting them right can be tricky. In the Neuropia there are several hyperparameters, but at first, we will focus on just three of them: Iterations, learning rate, and network topology.

- Iterations is the number of epochs network is executed while training.
- Learning rate defines how big delta each Gradient Descent step would take.
- Topology is the structure of the hidden layers. A common approach for topology is suppressing series within hidden layers. In the notation used here [32,16] topology mean a network of 784, 32, 16 and 10 layers.

In Table 1 there is a comparison of those hyperparameter combinations. Learning rate is fixed to be 0.01 and training time is capped to 30s. So instead of defining iterations, I implemented an alternative hyperparameter as maximum training time. The Iterations field in the table defines epochs in 30 seconds. The Time field represents the seconds of 10000 test images are fed through the generated network, i.e. speed efficiency of the network. The Accuracy defines how well the generated network then recognizes the verification test data, i.e how well network output will match with a given label.

Topology | Iterations | Time | Accuracy % |
---|---|---|---|

[16,16] | 146033 | 0.19 | 88.9 |

[32,16] | 74845 | 0.37 | 91.6 |

[256,128] | 5750 | 3.14 | 89.2 |

To improve these results I applied linearly decreasing learning rate. The stochastic gradient descent algorithm takes random steps that in general go towards goal until the result is seen good enough (or a certain number of iterations are done, or time’s out). As I mentioned earlier, if steps are too small the found minima is more likely only a local and there would be a deeper pit elsewhere (i.e. more accurate network state). Yet the step is too big, there is a higher probability to jump over the minima and never get close enough for an optimal solution. In Table 2 the learning rate decreases linearly from 0.3 to 0.01, and we can see immediately some improvements. Neuropia only supports a linear change of learning rate even though some logarithmic curve may provide better results as learning itself is very non-linear.

At this phase, it is good to point out that there is a certain amount of variation within numbers per test execution and therefore given here far too accurately. Thus consider all numbers only vaguely demonstrative. The variation within accuracies is about 1% units in between measurements.

Topology | Iterations / 30s | Time | Accuracy % / 15s | Accuracy % / 30s | Accuracy % / 60s | Accuracy % / 120s |
---|---|---|---|---|---|---|

[16] | 155451 | 0.18 | 87.3 | 91.1 | 91.7 | 91.6 |

[16] | 155451 | 0.18 | 87.3 | 91.1 | 91.7 | 91.6 |

[32] | 82189 | 0.33 | 88.7 | 91.6 | 92.0 | 93.7 |

[16,16] | 151388 | 0.19 | 90.4 | 91.8 | 92.8 | 92.9 |

[16,16,16] | 146969 | 0.20 | 87.6 | 89.3 | 89.6 | 89.3 |

[32,16] | 78819 | 0.37 | 90.6 | 92.2 | 93.4 | 94.5 |

[128] | 19001 | 1.39 | 86.5 | 89.3 | 91.8 | 93.2 |

[256,128] | 6346 | 3.36 | 88.7 | 90.8 | 93.1 | 94.1 |

[128,32,16] | 18283 | 1.32 | 77.0 | 91.3 | 92.2 | 93.3 |

As we can see the training time vs. accuracy is surprisingly stable. It is fairly easy to reach about 93% accuracy using any kind of network topology, but as we can see in the last third of this writing going beyond that would require significantly more effort. There is no easy way to speed up the training just varying the network topology. Nonetheless, with this data, two layers seem to provide marginally the best result, yet the difference seems to be so small that it really does not matter. However the verification time, the value really measures how efficiently the network is utilized, seems to have significant variation and benefits from small hidden layers.

## Parallel training

I have eight cores on my laptop. Each core can execute two threads simultaneously. That made me wonder how to utilize that processing power since the results above are gained using only a single thread. By using Mini-Batch Gradient Descent the train data is split into sub-batches. The sub-batches can be executed parallel and the conclusion is combined as a single result. Since most of the work is executed in parallel, the overall execution should be faster than just using a single thread and a single batch.

I consider here two solutions to implement parallelism for training. The first one is kind of evolutionary: a random network is copied into threads that run mini-batches simultaneously, each of the networks is verified and the best one is selected for the next round and then fed with mini-batches again.

The alternative approach is to use the average of each mini-batch. First, train mini-batches in parallel and then merge networks by applying average weights and biases into the network used for the next round. Later on, these networks are called as Evolutional and Averaging.

In both approaches, it is assumed that gradient descents run faster towards minima when the network can be adjusted during the training. However, at first results were not so promising and I had to look closer to my code, and the analysis revealed that big issue was my Matrix class: the implementation used dynamic memory allocation extensively.

Therefore I implemented a dedicated C++ allocator that is taking advantage of the assumption that there are only a limited amount of different dimension matrices used. I wrote a free list algorithm that does not free memory upon deallocation, but instead implementation stores it into a list where consecutive allocation can get for re-allocation of a matrix of the same size. Furthermore, as I wanted to avoid any synchronization objects, the free list memory pool is implemented per thread so that, the pool is managed within thread local storage, and thus follows its life cycle. This design should also protect implementation from certain cache issues that are further elaborated in Dr. Dobb’s article.

For both implementations, I needed two more hyperparameters: concurrent threads used and the batch size. In the Evolutional I also had a verification batch size. After an extensive amount of trial and error rounds, I was able to find hyperparameters where Averaging was just about 2% more efficient than just a single threaded one as shown in Table 3. Using very small, only eight-item batches and just four parallel threads there was some improvement. Nevertheless, I would not rule out that there are more optimal parameters. This may be applied to Evolutional implementation that is about 3% less efficient to find MNIST numbers. The Matrix optimization above had also some positive impact on the single thread case.

Training | Accuracy % |
---|---|

Single thread | 91.6 |

Evolutional | 89.6 |

Averaging | 93.3 |

# Improving Network

The previous chapter was most about the discuss how to improve the speed of the network by adjusting the topology and doing parallel calculations. However usually the more important issue is improving the predictability of network, i.e. how accurately network is expected to work.

Since the accuracy above 93% seemed to be quite easy to achieve, I looked further ways to improve it. At first my target was set to 95%, but along promising improvement, I raised it to 97%. That is a 67% improvement to incorrect results vs. 93% and generally can be considered a pretty good result for the feed-forward networks.

## Ensemble

Ensemble is like crowd knowledge; by collecting several results, combined, they shall form an improved output. The Neuropia network layers can be stored and restored to and from a file, therefore I was able to train multiple networks and reload them to form my ensemble results. There are several ways to conclude an ensemble output, but the most straightforward is hard and soft voting.

The network output layer results are given as a vector of probabilities. For example within MNIST that is ten values between 0 - 1.0. The index of the highest value is considered to be the network output. Then for hard voting, the index that appears the majority of networks is concluded to be the ensemble output. For soft voting, each of the probabilities is summed up and the index that has overall the highest value is concluded to be the ensemble output.

In Table 4 there are eight trainings that form an Ensemble. The Accuracy field represent the performance of each network after training.

Training | Accuracy |
---|---|

16 | 91.3% |

32 | 92.2% |

16,16 | 90.8% |

16,16,16 | 86.2% |

32,16 | 92.6% |

128 | 90.6% |

256,128 | 87.0% |

128,32,16 | 84.1% |

After reloading networks and forming an ensemble we get unified accuracies. The hard voting gives us 91.8% accuracy and the soft voting 92.7%. The hard voting was worse than the best network in the ensemble and soft voting was better, but only within error marginals.

The ensemble obviously works, but I suppose the networks trained are too homogenous; meaning that they initially have too good a mutual understanding of the result, whether the conclusion is correct or incorrect. The ensemble may shine if the party contains learners as support vector machines or decision tree, not just feed-forward neural networks as Neuropia implements, with mostly identical hyperparameters.

## Linear Unit Functions

The sigmoid function, introduced earlier, is the original activation function for the backpropagation algorithm. The Neuropia API let you introduce own activation functions and I implemented two commonly used alternatives that are developed since the early days of neural network research: Leaky ReLu and ELu. I was first testing plain ReLu, that is equal to Leaky parameter set to 0, but that made too often my network to “die”. The network is dead when any further training will not change its output values as a result of gradients have zero values so that they cannot be changed anymore with the used activation function. The issue is also called the vanishing gradient problem. I set the leaky parameter to 0.05 as it seems quite well to keep the network alive when testing with MNIST data. In Table 5 there is a comparison of different activation functions. ELu and ReLu seem to converge faster, but at some point in the network, they may explode by having weight values to beyond 64-bit floating point accuracy. Note that a single hidden layer with Leaky ReLU gave me so far the best result.

Activation Function | 30000 iterations [32,16] Topology | 50000 iterations [32,16] Topology | 100000 iterations [32,16] Topology | 200000 iterations [32,16] Topology | 100000 Iterations, [500] Topology |
---|---|---|---|---|---|

Sigmoid | 87.5% | 89.6% | 92.6% | 92.3% | 94.3% |

ELu | 91.2% | 92.1% | 92.3% | 93.7% | N/A% |

Leaky ReLu | 91.3% | 93.0% | 92.4% | N/A | 95.8% |

Both ReLu and ELu functions improve network compared to sigmoid function, but they both make training more unstable, meaning that it is easy to get the network to die or explode during training. The exploding means that values will increase beyond the numerical range of used number types. There are certain ways to help the issue dropout and L2 regularization discussed later.

One thing to note regarding activation function is the network initialization. Earlier, in Logistic Gate, I was using Layer::randomize function that set weight and bias values equally distributed between -1 and 1. However, considering how network values are supposed to develop during the training, that is not optimal. Neuropia `Layer::initalize`

function provides better tuned initial weights and biases to help network converge faster and, for non-sigmoid activation functions, not facing vanishing death or explosion to infinity that easily.

## Dropout

Dropout is an interesting concept to improve network training. The dropout turns off random Neurons during training, and therefore it is not that easy to have overfitting networks. The dropout can also be considered as an enormous ensemble as each epoch constructs a different topology. Since the trained number of neurons is smaller over a single epoch, more training iterations are needed when dropout is applied. The DropoutRate hyperparameter defines the number of neurons in each layer that have been shut down. Except for the output layer that is left intact.

However, after applying numerous testing rounds, any dropout applied network was not able to provide improved network accuracy. By increasing the number of epochs the dropout network will eventually gain as good accuracy as networks without the dropout. My explanation is that the topologies used for MNIST are relatively simple, there is no overfitting, and therefore no benefit of using dropout. For example, a simple topology of 500-neuron hidden layer and 150000 iterations will provide 96.8% accuracy without dropout and 93.3% with 50% dropout. Dropout is also assumed to speed up the training as there are not that many neurons to calculate, but with Neuropia the difference is only 5%. That will not anyhow compensate the number of required iterations as without dropout the same network goes up to 94.8% just with 50000 iterations.

During these tests, the ReLu activation function seems quite easily climb over the 96% accuracy barrier and that would encourage me to do last attempts to pass the 97%.

## L2 regularization

L2 regularization adds a damping hyperparameter for the network. It regulates extreme weight values and therefore helps with overfitting and, what has been more important with improving MNIST recognition accuracy, L2 regularization decreases the chance of network to explode.

There are alternatives for how to apply the hyperparameter and for Neuropia the implementation changes gradient values as follows:

```
if(lambdaL2 > 0.0) {
const auto L2 = gradients.reduce<double>(0, [](auto a, auto r){return a + (r * r);}) /
static_cast<double>(gradients.rows());
const auto l = lambdaL2 * L2;
gradients.mapThis([l](auto v){return v - l;});
}
```

To test L2 regularization the epoch I decided to use a bit more complex network.

- 150000 epocs
- 500,100 topology
- InitStrategy auto
- LearningRateMin 0.00001
- LearningRateMax 0.02

Activation Function | L2 | Accuracy |
---|---|---|

ReLu | 0 | 97.82% |

ReLu | 0.001 | 97.85% |

ReLu, Sigmoid | 0 | 97.82% |

ReLu, Sigmoid | 0.001 | 97.85% |

As seen in Table 6, there is not a big difference between results. Actually, all of them are well inside any error margins. The L2 regularization does not seem to have a significant impact and nor the extra case where the second hidden layer is using Sigmoid activation function. The training time was pretty static 36 minutes using 2014 year Macbook.

However, the results were so close to 98% accuracy that I decided to give it one more shot: And yes! When applying 200000 epochs, which would take a hefty 51 minutes of running, I am able to gain 98.1% accuracy with Neuropia!

What next, towards 99%? I will not go any further here. However, I assume the parallel training with the ensemble that may be doable - and of course research more any hyperparameter alternatives.

# Summary

This Neuropia implementation is not supposed to replace your favorite neural network library. The sole purpose of this article is to offer a different view from the programmer’s perspective, as an alternative to most of the tutorials and introductions that are written by Data Scientists and Mathematicians. Here I tried to ventilate the topic and focus on issues that matter from a programmer’s perspective - and naturally, just have some fun.

As a result, I was able to implement a feed-forward neural network that is able to analyze MNIST data with 98% accuracy. Presumably, 99% could be possible by just doing more epochs and figure out a set of hyperparameters that will not make the network to explode, but with a certain likelihood that would require improved learning rate and cost function implementations as suggested earlier.

Maybe the lesson learned here is that the universe of hyperparameters is vast and weird, and it is frustratingly difficult to find optimized values for the network. It looks like the secret of neural networks is not in their implementation, that seems to be relatively simple, but the (black) art of getting all the possible properties and parameters right. The MNIST issue is very simple, and even with that, there seems to be endless possibilities how to adjust the values.

Neuropia is implemented using standard C++ 14. Tested on Clang, GCC, and MSVC17. All the code discussed here: Neuropia, IdxReader, Matrix class and training code is available in my repository in Github under permissive MIT license. There is also a lot of code not explored here as working with parameters and details of applying training. Happy cloning and exploring!