Deep Learning for Symbolic Mathematics

5 min readJun 2, 2021

This blog post was co-authored with Reemet Ammer, Alfred Saidlo, and Jatan Shrestha as a part of the Neural Network’s course project at the University of Tartu. This article attempts to illustrate the idea behind Deep Learning for Symbolic Mathematics and strives to reproduce the results outlined in the paper by Guillaume Lample and François Charton.

Motivation

AlexNet singlehandedly kickstarted the Deep Learning Revolution in 2012 by successfully demonstrating the power of deep neural networks to win the ImageNet image classification challenge by leaving the competitors in the dust. The advancement of Graphical Processing Units (GPUs), large labeled datasets in an era of Big Data, and the rapid improvement of deeper neural architectures led to crucial breakthroughs in a wide range of problems.

Applications of Deep Learning

Deep learning methods are the current state of the art in many applications pertaining to Computer Vision, Speech Recognition, and Natural Language Processing. Deep Learning has exhibited stellar effectiveness in pattern recognition, natural language processing, and machine translation- a symbol manipulation task but has failed to demonstrate any success in Symbolic computations. Deep Neural Networks capable of classifying thousands of images with a frightening degree of accuracy are inferior in menial mathematical tasks like integer multiplication.

But what if we could use the current state-of-the-art architectures used for machine translation to solve symbolic computation. The current architectures for machine translation work with sentences considered as sequences of tokens and need no domain-specific knowledge such as grammar and dictionaries. Moreover, they can perform complex tasks without any supervision. So, a central theme behind this paper is to perceive mathematics as a language, generate large datasets of problems and their solutions, and solve a problem by directly translating it into its solution.

A good representation is all you need

Machine Translation for symbolic mathematics follows the same workflow as any other NLP / Machine translation tasks:

Represent problems and solutions as sequences to use seq2seq models
Generate large datasets of problems and solutions
Train models to translate problems into their solution

So, how do we represent problems and solutions as sequences? The idea is to convert the expressions to trees and then from trees to sequences.

Expressions to trees

Trees to sequences

Now the expressions can be processed by seq2seq models.

Sneak peek into the Data generation process

The paper mainly considers three tasks in Symbolic Mathematics:

Symbolic Integration

First-order differential equations

Second-order differential equations

Now the question remains, how do we generate the datasets? Creating large datasets of problems and their solutions boil down to a Supervised Learning problem. The process of generating a random problem and its solution involves the formation of a random unary-binary tree. The operators are selected at random as their internal nodes, and sample constants or variables are left as their leaves.

There are three different ways to generate data, i.e., a problem and their solution for the symbolic integration.

The first one is the Forward (FWD) approach

This approach generates a random function ff, computes its antiderivative F using an external symbolic framework ( Sympy, Mathematica, etc. ), and Add (f, F)(f,F) to the training set
However, this approach is slow and limited to functions the framework can integrate.

The alternative is the Backward (BWD) approach

This approach generates a function FF, compute its derivative, and Add (f, F)(f,F) to the training set
BWD leads to long problems with short solutions whereas FWD leads to short problems with longer solutions.

The middle ground is the Integration by parts (IBP) approach

This approach generates random functions FF and GG, computes their derivatives ff and gg and if f \,* Gf∗G is in the training set, compute the integral of F \, * gF∗g with :

Ordinary Differential Equations (ODE) have a slightly bit more complicated approach, thus they’re omitted from this blog post. Five different datasets were generated using FWD, BWD, IBP, ODE1, and ODE2 approaches. Expressions with up to

n = 15n=15 internal nodes,
L = 11L=11 leaf values,
p_2 = 4p2=4 binary operators (+, -, x, /+,−,x,/) , and
p_1 = 15p1=15 unary operators (exp, log, sqrt, sin, cos, tan, …., tanh^{-1}exp,log,sqrt,sin,cos,tan,….,tanh−1) were considered.

Attention is all you need

Transformer Model was leveraged to tackle the seq2seq task. The architecture is an example of a non-recurrent neural network with an attention mechanism. It comprises of 6 layers, 8 attention heads, 512 dimensions, and was trained over cross-entropy loss with Adam optimizer.

Transformer Architecture

The Seq2Seq model takes an input sequence of arbitrary length and transforms it into some output sequence of arbitrary length.

Input sequence in German transformed to output sequence in English

Moreover, the beam search is used at inference to look for a token by token generation of the most likely answer. Using Beam search at inference allows the generation of several solutions.

Results

The evaluation process involves

Testing the trained models on 5000 held-out examples from the same dataset
No overlap, no test problem has been seen during training
The solutions are checked with an external framework- SymPy
Integration: derive the solution and subtract from the problem
ODE: feed the solution into the equation and check the result
Uses beam search and if a solution is incorrect, simply try the next one up to 10 or 50 solutions.

Insert Side by side table of comparisons once ready