While Linear Algebra is a powerful tool for understanding the world around us, it is limited in its ability to describe the complex and interconnected nature of the universe. To overcome these limitations, we must turn to the study of higher dimensional mathematics. We need to learn about tensors.
Tensors
For the purposes of Machine Learning, we can think of tensors as multidimensional arrays. They are a generalization of vectors and matrices, and they can be used to represent a wide variety of mathematical objects, including vectors, scalars, and matrices. For instance and image, with 3 color channels can be represented as a 3-dimensional tensor and thus a dataset of such images can be represented as a 4-dimensional tensor.
Tensors can perform many of the same operations as vectors and matrices, such as addition, subtraction, and multiplication, but they can also be used to represent more complex mathematical objects - we won’t delve into such concepts here. However, as tensors have more than 2 dimensions, transposes behave a little bit different - we need to specify the dimensions we want to transpose.
Fortunately, pytorch can and will handle all the needed tensor operations for us. We can create tensors using the torch.tensor function.
Note that the transpose method is used to transpose the tensor b. For tensors with 3+ dimensions, the transpose method can be used to swap 2 dimensions around, while the permute method can be used to change the order of the dimensions.
a = torch.tensor([[[1, 2, 3], [4, 5, 6]]])# Dimensions of aprint("Raw tensor:")print(a)print(a.size())# Swap dimensions 0 and 1b = a.transpose(0, 1)print("\nTranspose 0 and 1:")print(b)print(b.size())# Permute the tensor dimensions to (2, 0, 1)c = a.permute(2, 0, 1)print("\nPermute to (2, 0, 1):")print(c)print(c.size())
Finally, if we want to insert a dimension of size 1 into a tensor, we can use the unsqueeze method. If we want to remove a dimension of size 1, we can use the squeeze method. We can use expand in conjucntion with unsqueeze to copy a tensor along a new dimension. Finally, we can use flatten to flatten a tensor into a 1D tensor or vector and unflatten to reshape a vector into the desired-shape tensor.
a = torch.tensor([[[1, 2, 3], [4, 5, 6]]])# Dimensions of aprint("Raw tensor:")print(a)print(a.size())# Insert a dimension of size 1 at position 0b = a.unsqueeze(0)print("\nUnsqueeze at position 0:")print(b)print(b.size())# Remove the dimension of size 1 at position 0c = a.squeeze(0)print("\nSqueeze at position 0:")print(c)print(c.size())# Expand a tensor along a new dimensiond = a.unsqueeze(0).expand(2, -1, -1, -1) # -1 means the size is unchangedprint("\nExpand along a new dimension:")print(d)print(d.size())# Flatten a tensor into a 1D tensore = a.flatten()print("\nFlatten a tensor into a 1D tensor:")print(e)print(e.size())# Unflatten a vector into a tensor of desired shapef = e.unflatten(0, (1, 2, 3))print("\nUnflatten a vector into a tensor of desired shape:")print(f)print(f.size())
Raw tensor:
tensor([[[1, 2, 3],
[4, 5, 6]]])
torch.Size([1, 2, 3])
Unsqueeze at position 0:
tensor([[[[1, 2, 3],
[4, 5, 6]]]])
torch.Size([1, 1, 2, 3])
Squeeze at position 0:
tensor([[1, 2, 3],
[4, 5, 6]])
torch.Size([2, 3])
Expand along a new dimension:
tensor([[[[1, 2, 3],
[4, 5, 6]]],
[[[1, 2, 3],
[4, 5, 6]]]])
torch.Size([2, 1, 2, 3])
Flatten a tensor into a 1D tensor:
tensor([1, 2, 3, 4, 5, 6])
torch.Size([6])
Unflatten a vector into a tensor of desired shape:
tensor([[[1, 2, 3],
[4, 5, 6]]])
torch.Size([1, 2, 3])
Multivariable Calculus
As we just saw, any tensor can be flattened into a vector, as such, we can focus on vector calculus for now (PyTorch will handle all implementation details for us anyways). Let’s start with the basics of multivariable derivatives.
First, let’s define a function of two variables:
\(f(x, y) = x^2 + xy + y^2\)
We can take the derivative of this function with respect to \(x\) and \(y\) respectively, by treating the other variable as a constant:
\(\frac{\partial f}{\partial x} = 2x + y\)
\(\frac{\partial f}{\partial y} = x + 2y\)
We can also take the gradient of this function, which is a vector of partial derivatives. Note however, this should be a row-vector as opposed to the usual column-vector.
\(\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} & \frac{\partial f}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x + y & x + 2y \end{bmatrix}\)
We can also package \(x\) and \(y\) into a xolumn-vector, \(\textbf{x} = (x, y)\). Thus, \(f(\textbf{x}) = \textbf{x}_1^2 + \textbf{x}_1x_2 + \textbf{x}_2^2\). Then computing the gradient of \(f\) with respect to \(\textbf{x}\) is the same as computing the partial derivatives of \(f\) with respect to each component of \(\textbf{x}\). The result is still a row-vector.
The derivative of such a function can be obtained by taking the derivative of each component of the function with respect to \(x\). It retains it’s shape.
The derivative of such a function is a matrix called the Jacobian matrix. It is a matrix of partial derivatives of the function with respect to each component of the input vector. If the input vector has \(n\) components and the output vector has \(m\) components, the Jacobian matrix will have \(m\) rows and \(n\) columns.
The Jacobian matrix is the key concept of multivariable caculus. With the jacobian in mind, we can easily expand the concept of the chain rule to multivariable functions. Consider the following composition of functions:
We can also have a function that depends on multiple vectors, and outputs a single vector. The partial derivatives of this function will be jacobian matrices. Finally, we need to consider dependency chains. Consider the following relationship between vectors functions: \(\textbf{u}(\textbf{x}, \textbf{y})\), \(\textbf{v}(\textbf{x}, \textbf{y})\), and \(\textbf{w}(\textbf{u}, \textbf{v})\). Then the partial derivatives are:
As \(\textbf{w}\) depends on \(\textbf{x}\) through \(\textbf{u}\) and \(\textbf{v}\), we need to add each contribution to the derivative. The same is true for the derivative with respect to \(\textbf{y}\).
As we observe, multivariate derivatives are not that different from the single variable case with the right perspective. This is just scraping the surface of multivariable calculus, but it is all we’ll need.