Artificial neural nets (ANNs) have classically been composed of algorithms which can ‘learn’ to perform specific functions without being programmed for specific tasks. ANNs are, in short, function approximators. The rub is that, because neural nets are built in a layered fashion, to scale up the net, one would always have to add-on more and more layers, which makes swift scale-up of any sizable magnitude intrinsically difficult. Interestingly, David K. Duvenaud has crafted a theoretical framework for a neural net without any layers thus providing a number of fascinating potential applications, the first and foremost being increased scalability. Yet, before we come to that, a refresher on standard models will prove useful to those unfamiliar with the topic (if you are already intimately familiar with ANNs, skip to part 4).
1) Basic Anatomy Of ANNs
“[a artificial neural network is] a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. — In “Neural Network Primer: Part I” by Maureen Caudill, AI Expert, Feb. 1989.
All neural nets function in the same way, by first constructing a input vector(s) which is then modified by a series of weights and threshold(s) to produce a output(s) which are analogous to artificial neurons (dendrite, axon and soma). Of course, one neuron alone is not a ‘net’ and so every neural net — to be a net — must have at least two neurons. Every net is a function of its input vector(s), weight vector(s) and threshold vector(s) or, to put it another way, z = F (x, w, t), where z equals the function (F) and x is input, w is weight and t is threshold.
z, however, is merely a possible output (sum of some combination of inputs) not necessarily the desired output(s). To control for the desired output(s) a new function can be applied which can be called, d, or, d = g (x), which can then be checked through a performance function, p, where, p = ||d – z||² — however, for our purposes in understanding the base-architecture, we needn’t delve further into the math.
2) Perceptron Neurons
Crafted in the 1950s by the American psychologist Frank Rosenblatt, perceptrons are a assemblage of binary inputs to produce a single binary output. Thus, a simple example of perceptron functionality may be represented formally as: x¹ + x² + x³ = y¹, wherein the output is either 1 or 0. Represented graphically, a perceptron input-output function would look like the image below.
In the written and graphic representation, 3 inputs are shown, however, a perceptron may have more or less than three inputs. To compute the 1 or 0 output in the perceptron, a series of threshold values. To put the threshold(s) operation(s) mathematically, let us consider the following sentences:
- Do you want to radically extend your life?
- Do your friends want to radically extend their lives?
- Is there a way to radically extend human life?
Placing 1, 2 and 3 in correspondence to the binary input variables yields:
- x1 = 1 — if [you want to radically extend your life.]
- x1 = 0 — if [you do not want to radically extend your life.]
- x2 = 1 — if [your friends want to radically extend their lives.]
- x2 = 0 — if [your friends do not want to radically extend their lives.]
- x3 = 1 — if [there is a way to radically extend human life.]
- x3 = 0 — if [there is not a way to radically extend human life.]
*this process would continue in the same fashion regardless of the number of inputs
For ranking, ‘weighted’ variables are introduced, written simply as w. The ‘weight’ is used to introduce priority of another variable, so if one writes:
- w1 = 5
- w2 = 2
- w3 = 1
w1 denotes how much one cares about applying radical life extension in relation to their own life; because it is a larger number than w2 and w3, it means that radical life extension is more important to you than either your friends opinion of it or whether or not there is a way to radically extend human life; w2 indicates that one cares more about whether one’s friends want to radically extend their lives than one does about whether or not there is yet a way to do so, yet cares less about both then 5 (w1). Thus, the larger the number the “heavier” the “weight,” that is to say, the priority.
3) Sigmoid Neurons
Though perceptrons are extremely useful they are quite rigid, meaning that it is difficult to change variables within a perceptron network without causing large changes in the output. For example, if one were trying to get a perceptron network to correctly identify 5 and it was misidentifying the 5 as a 4, one might then attempt to modify weights or biases to get the system to correctly identify the 5. The problem is that changes made will effect the whole system in ways which (depending on the complexity of the total system) can be extremely difficult to control and cause all kinds of problems. Further, this makes system learning difficult.
To fix this problem, sigmoid neurons are introduced.
Sigmoid neurons are akin to perceptrons, they have weights (w¹, w², w³,… etc) and a bias (represented as b) however, their weights and bias are such that when changes are made to them, the resulting change in the output is slight (smaller than in perceptrons). It is this tiny difference that allows systems using sigmoid neurons to learn. Just like a perceptron, sigmoid neurons have inputs x¹, x², x³,…, however, they are not binary, that is, they are not either only a 0 or a 1 and instead can assume the values between 0 and 1, such as 0.001, 0.633 and so forth. Further, sigmoid neuron output is not 0 or 1 either, but instead is: σ(w ⋅ x + b) wherein σ (sigma) is described as the sigmoid function, sometimes, alternatively written as the logistic function (inwhich case the neuron itself is referred to as a logistic neuron). This is to say: a sigmoid neurons output may be any real number between 0 and 1.
The best way to conceptualize sigmoid functions are as smoothed out variants of step functions (which give only 0 or 1).
In machine learning systems, perceptron and sigmoid neurons are layered together with the input neurons which feed into some number of hidden layers (hidden simply means: neither input not output) and those hidden layers then feed into the output neuron. With this arrangement, the more layers there are, the more complex and sophisticated the potential of the system. Those models described above, however, are mono-directional, that is to say, they only feed information forward — from the input layer to the hidden layer to the output layer — and never back (from output to hidden layer to input), hence, they are called feedforward artificial neural nets (or FANNs/FNNs if you want a brisk annotation). It is important to remark that feedforward neural nets are not the only kind as it is possible to create feedback loops within a system; models which utilize feedback loops are typically described as recurrent neural networks and it is these kinds of models which most closely (at least thus far in the history of machine learning) mimick the human brain.
4) David Duvenaud’s Layerless Neural ‘Net’
Now that we have satisfied ourselves as to the operation of the two standard neuron models, let us turn out attention to Duvenaud’s model, which differs markedly. What is most immediately remarkable about Duvenaud’s system is that it operates completely without layers. Thus, even though one could theoretically continuously keep adding layers to increase system granularity, in practice this is untenable because it means that optimal granularity requires a infinite number of layers (which obviously cannot be practically implemented).
To solve this problem, Duvenaud and his team simply replace layers with calculus equations. In this way, technically speaking, the neural net is no longer a net — as there are no interconnected nodes — but rather one continuous whorl of calculation. Thus, in place of ANN, Duvenaud and his co-authors describe their model instead as a ODE solver or Ordinary Differential Equations. solver. Doesn’t exactly roll off the tongue but it concisely describes the system.
At this point one may be wondering what is particularly special about a layerless ‘net’?
Consider a factory where everything is moved around by a bunch of different robots; then, consider another factory wherein the floor is one continuous circuit of sliding panels. The first kind of factory is akin to the two standard neuron models, whereas the second type of factory is more akin to Duvenaud’s ODE model; neither is necessarily, intrinsically better than the other, but each has unique applications. Where the ODE model shines is through training. In standard ANN, the # of layers must be pre-determined, before training begins. Because of this, one will only find out how accurate the model is AFTER training is complete. ODE flips this and instead allows the designer to specify the accuracy FIRST and thus, then allow the training to fit the accuracy, rather than the other way round and allows the incorporation of information regardless of the time it is introduced into the system. The downside to using this method is that with a standard ANN the time needed to for training is known whereas with ODEs, the training time is an unknown.
Duvenaud’s paper (provided in full below) provides the conceptual structure for just such a system however he cautions its “not ready for prime time” at least, not yet.
- Ben Goertzel et al. (2007) Artificial General Intelligence. Artificial General Intelligence Research Institute.
- David Duvenaud et al. (2018) Neural Ordinary Differential Equations. Vector Institute.
- Han Yu et al. (2018) Building Ethics Into Artificial Intelligence. Conference paper.
- Ian Goodfellow et al. (2016) Deep Learning. MIT Press.
- Karen Hao. (2018) A Radical New Neural Network Design Could Overcome Big Challenges In AI. MIT Technology Review.
- Prof. Patrick Winston. (2015) 12a: Neural Nets. MIT.
σ = lowercase sigma.
Thanks for reading. If you found this article useful and wish to support our work you may do so here.