Deep Learning for Symbolic Mathematics
This blog post was co-authored with Reemet Ammer, Alfred Saidlo, and Jatan Shrestha as a part of the Neural Network’s course project at the University of Tartu. This article attempts to illustrate the idea behind Deep Learning for Symbolic Mathematics and strives to reproduce the results outlined in the paper by Guillaume Lample and François Charton.
Motivation
AlexNet singlehandedly kickstarted the Deep Learning Revolution in 2012 by successfully demonstrating the power of deep neural networks to win the ImageNet image classification challenge by leaving the competitors in the dust. The advancement of Graphical Processing Units (GPUs), large labeled datasets in an era of Big Data, and the rapid improvement of deeper neural architectures led to crucial breakthroughs in a wide range of problems.
Applications of Deep Learning
Deep learning methods are the current state of the art in many applications pertaining to Computer Vision, Speech Recognition, and Natural Language Processing. Deep Learning has exhibited stellar effectiveness in pattern recognition, natural language processing, and machine translation- a symbol manipulation task but has failed to demonstrate any success in Symbolic computations. Deep Neural Networks capable of classifying thousands of images with a frightening degree of accuracy are inferior in menial mathematical tasks like integer multiplication.
But what if we could use the current state-of-the-art architectures used for machine translation to solve symbolic computation. The current architectures for machine translation work with sentences considered as sequences of tokens and need no domain-specific knowledge such as grammar and dictionaries. Moreover, they can perform complex tasks without any supervision. So, a central theme behind this paper is to perceive mathematics as a language, generate large datasets of problems and their solutions, and solve a problem by directly translating it into its solution.
A good representation is all you need
Machine Translation for symbolic mathematics follows the same workflow as any other NLP / Machine translation tasks:
- Represent problems and solutions as sequences to use seq2seq models
- Generate large datasets of problems and solutions
- Train models to translate problems into their solution
So, how do we represent problems and solutions as sequences? The idea is to convert the expressions to trees and then from trees to sequences.
Expressions to trees
Trees to sequences
Now the expressions can be processed by seq2seq models.
Sneak peek into the Data generation process
The paper mainly considers three tasks in Symbolic Mathematics:
- Symbolic Integration
- First-order differential equations
- Second-order differential equations
Now the question remains, how do we generate the datasets? Creating large datasets of problems and their solutions boil down to a Supervised Learning problem. The process of generating a random problem and its solution involves the formation of a random unary-binary tree. The operators are selected at random as their internal nodes, and sample constants or variables are left as their leaves.
There are three different ways to generate data, i.e., a problem and their solution for the symbolic integration.
The first one is the Forward (FWD) approach
- This approach generates a random function ff, computes its antiderivative F using an external symbolic framework ( Sympy, Mathematica, etc. ), and Add (f, F)(f,F) to the training set
- However, this approach is slow and limited to functions the framework can integrate.
The alternative is the Backward (BWD) approach
- This approach generates a function FF, compute its derivative, and Add (f, F)(f,F) to the training set
- BWD leads to long problems with short solutions whereas FWD leads to short problems with longer solutions.
The middle ground is the Integration by parts (IBP) approach
- This approach generates random functions FF and GG, computes their derivatives ff and gg and if f \,* Gf∗G is in the training set, compute the integral of F \, * gF∗g with :
Ordinary Differential Equations (ODE) have a slightly bit more complicated approach, thus they’re omitted from this blog post. Five different datasets were generated using FWD, BWD, IBP, ODE1, and ODE2 approaches. Expressions with up to
- n = 15n=15 internal nodes,
- L = 11L=11 leaf values,
- p_2 = 4p2=4 binary operators (+, -, x, /+,−,x,/) , and
- p_1 = 15p1=15 unary operators (exp, log, sqrt, sin, cos, tan, …., tanh^{-1}exp,log,sqrt,sin,cos,tan,….,tanh−1) were considered.
Attention is all you need
Transformer Model was leveraged to tackle the seq2seq task. The architecture is an example of a non-recurrent neural network with an attention mechanism. It comprises of 6 layers, 8 attention heads, 512 dimensions, and was trained over cross-entropy loss with Adam optimizer.
Transformer Architecture
The Seq2Seq model takes an input sequence of arbitrary length and transforms it into some output sequence of arbitrary length.
Input sequence in German transformed to output sequence in English
Moreover, the beam search is used at inference to look for a token by token generation of the most likely answer. Using Beam search at inference allows the generation of several solutions.
Results
The evaluation process involves
- Testing the trained models on 5000 held-out examples from the same dataset
- No overlap, no test problem has been seen during training
- The solutions are checked with an external framework- SymPy
- Integration: derive the solution and subtract from the problem
- ODE: feed the solution into the equation and check the result
- Uses beam search and if a solution is incorrect, simply try the next one up to 10 or 50 solutions.