Functional release update: Multilingual, you?

There’s a Swedish couple that lives in my building, and every now and then the wife will turn to her husband and ask him, “What clock is it?”. Turns out “clock” and “the time” are the same word in Swedish, and while this anecdote proves flubbed semantics can be charming, it also reminds you that you can learn a lot about the nature of language processing by taking your headphones off in the elevator. But I digress.

When we opened our offices in the UK and began to expand into European markets, our receipt technology faced the very-human challenge of learning foreign languages.

Our AI-powered receipt management solution has been trained to “read” English receipts by exposing it to hundreds of thousands of English receipts over many years. This is the neural network approach we’ve talked about here, here, and here. It has allowed us to build a tool that continually improves itself, surpasses competitors in both accuracy and scalability, and—so long as users were uploading receipts from English-speaking countries—guarantees the most efficient and painless experience. 

For all intents and purposes, our solution is a native English speaker. So what happened when it was exposed to receipts in new languages? Much like my neighbor, it got its wires crossed.

How did we fix it?

If we wanted to continue using our neural network approach to train our machine to learn other languages, we’d have to collect thousands of receipts from every country we wanted to enter, exposing our machine to dozens of new languages, simultaneously, over a sustained period of time—over, and over, and over again. This approach works well if you have tons of data, but as we entered new markets, short of rummaging through garbage bags for receipts, we wouldn’t be able to get enough data to leverage this approach solely.

So, our machine learning team decided to rethink how our solution looked at a receipt. They determined that while there are infinite variations, in general, receipts have more commonalities than differences. With that in mind, they built a new model that moved away from just reading characters, and instead looked at the entire structure.

Let’s dive in deeper…

To explain how it all works, let’s recap our receipt-reading process into three components. (Note: This is a highly simplified breakdown, and is not meant to offend any technical folk).

  1. Input
    This is the receipt data––the individual characters and numbers––that are picked up by our Optical Character Recognition system and fed to our system for processing.

  2. Neural Network
    This is the “brain” part of the process, which takes the data input and makes sense of it. The neural network is what figures out that the characters T, O, T, A, and L mean the word “TOTAL”. The details of the neural network architecture aren’t necessary to understand here, but you can read a more in-depth explanation in Part II of our machine learning series.

  3. Output
    The output is the receipt information extracted and classified by relevant data fields, such as the merchant name, the date of purchase, the total amount spent, and so on.

Then and now

Before, our neural network essentially viewed receipts as one long string of characters. It would pick up each character individually, and then put them together into a word to figure out the output. This character-based approach meant that the machine would read a receipt on a character-by-character basis, a little like this:

In our new structural model, the neural network stays relatively untouched––the architecture of the network is the same. The input, however, is different. Instead of inputting individual characters, we’re inputting entire words (or numbers) along with their contexts or “meaning”.

Where in our pure-deep learning model we relied on the system to “self-teach” the meaning of a word or number, in our new model we supplement our network with human, domain expertise. We feed the system information about receipts beforehand, signalling it to look for certain features when classifying a piece of receipt information.

For example, as humans, we know that when we see the keywords “HST” or “VAT”, that the numbers that follow are taxes. With our new model, we’ve supplied the machine with this knowledge as well, so that when it sees certain keywords, it’s triggered to recognize what follows.

To expand this approach to new geographies, say Spain, we had to do the translating beforehand and inform our machine to be on the “lookout” for Spanish keywords like “impuesto”, which mean taxes.

Beyond keywords, we described other features of a data field, such as their location on a receipt, or what percentage of an input is digits or symbols.

For example, we know that receipt totals are located on the bottom of most receipts, and when the word “Total” is followed by a certain amount of space, followed by a dollar sign, followed by a string of numbers separated by a period, that this is our “Total”. So, with our new model, the same receipt as above would be processed more like this:

Because we’re relying on features instead of the language itself, our solution is able to perform well on a wider range of receipts—even if they’re in a language it hasn’t seen before.

So…does it work?

Surprise, it does! We’ve seen improvements in training speed, accuracy, and processing times in multiple languages that we’ve tested it on.

But there’s more.

Not only did it help our machine learn to read receipts in new languages—it improved how we processed English receipts, too! Our structural model significantly improved the predictive power of our deep learning algorithm, improving the accuracy of English receipts too. Even more amazingly, the new model is 7X faster than our old one, which means we can train our receipts faster, run more experiments, and ultimately accelerate the pace of our innovation—all while improving processing times for a superior end-user experience.

Supplementing our model with structural resemblances between receipts means that we aren’t reliant on language alone, and we don’t need to train our machine on individual dialects. With this approach, we can continue training our models in English and our machine’s learning will scale across languages.

For us, that means less clock spent hunting for foreign receipts and more clock spent building solutions our users need.