How did the machine read nutritional facts?

Published in

Analytics Vidhya

7 min readApr 20, 2021

How it was possible to read nutritional tables with OCR, Tesseract and a lot of computer vision!

Some time ago I was immersed in a project that worked with the formation of a data lake of food data, collecting from products in general to the nutritional information of a mass of food, however at a certain point it was realized that most of the nutritional information was inserted in images and not in text, making it difficult to web scrapping with the Scrapy framework and Python.

This problem opened the opportunity to learn something that has been present in technology for a long time and gained percussion with David H. Shepard. This technology is well known as Optical Character Recognition (OCR). In fact, the following video is very cool to get a sense of the early days of OCR:

So the challenge was that from an image, it would be possible to get the data from nutritional tables in text.

Tesseract e PyTesseract

Searching about OCR, I came across a tool that is currently leading in use, as well as being known for its great efficiency. This tool is Tesseract.
Searching about OCR, I came across a tool that is currently leading in use, as well as being known for its great efficiency. This tool is Tesseract.

Logo for Google’s Tesseract OCR software. Font: Wikipedia.

Tesseract is an open source engine under the Apache 2.0 license that is currently owned by Google, and aims to apply optical character recognition. Its initial implementation happened with the C language, being developed by HP. For more details, feel free to look at the repository on GitHub or refer to the following article, which is also quite complete on the subject:

Optical Character Recognition (OCR) for Low Resource languages with Tesseract version.

medium.com

However, with the growing use of the Python language, the community took up this cause and developed a wrapper that was named Pytesseract, covering all the features that belong to the original project. Pytesseract is also an open source project that is available on GitHub or ready for use on the Python Package Index (PyPI).

This engine and package were essential for the development of the project, to the point that they became a requirement (which is covered in the project repository).

Theory of Tables

Calories on the New Nutrition Facts Label. Font: FDA.

As mentioned in the beginning of the project the challenge was focused on reading nutritional tables, and unlike a controlled environment, when it comes to nutritional tables, we have many words that are sometimes disconnected, besides horizontal lines and vertical lines that separate the words from the values, so the question is: how to make it easier for the machine to read them? That is, the mission was to remove these lines and columns and leave only the words and values, so that we would not “divert” the machine’s attention to what it did not need to know.

To solve this problem two things were done.

The first was to use horizontal and vertical kernels to identify where these lines were, generating a binarized image. To do this we used OpenCV, a complete and very popular Python imaging library. Below is the result of the lines found.

image of binarized lines. Font: Author. — Image of binarized lines. Font: Author.

But once you have these lines identified, how do you delete them from the image?

K-means

The answer was no less than K-means, an unsupervised machine learning algorithm that is commonly used for clustering, which needs no external inputs for its operation, needing only to determine the number of K-means, i.e. the number of clusters required for your problem.

But what’s with the K-means after all? What does it have to do with the problem of lines?

As described in its brief introduction, k-means is used for grouping, so instead of deleting the rows we performed a color clustering on the image and overwrote (instead of deleting) the rows by the predominant color of the image, and usually when it comes to a nutrition table, the predominant colors are for the background, rows, letters and sometimes details, respectively.

For this reason, also, the infamous elbow method was not used to define the amount of Ks, because we have a situation with the known number of clusters needed.

The result of this process is shown in the image below, as are the functions used to overwrite the lines detected a priori. The code for using k-means can be found in detail in the sklearn library or in the project repository.

Table before and after transformation, respectively. Font: Author.

Functions used to remove lines of an image. Font: Author.

EAST — Efficient and Accurate Scene Text Detector

At this stage we still don’t have a machine reading, but we use EAST, a scene text detector that is also open source with the code available on GitHub and, in this project, was used with the goal of making the machine work even easier, taking away distractions and focusing on the image text.

As I said, EAST alone only locates where there is text, but does not read it, a process which we can assimilate to an illiterate, who knows that there is text there, but has no idea what is written. Its result can be seen in the image below:

EAST: An Efficient and Accurate Scene Text Detector. Font: Youtube. — EAST: An Efficient and Accurate Scene Text Detector. Font: YouTube.

There are excellent tutorials and publications on the Internet that teach how to use this kind of technology, of which I can mention:

OpenCV Text Detection (EAST text detector) - PyImageSearch

In this tutorial you will learn how to use OpenCV to detect text in natural scene images using the EAST text detector…

www.pyimagesearch.com

After applying EAST with a series of morphological filters, we then read the words in the Western style, i.e., from top to bottom and from left to right, making it closer to what is done in human reading, as is demonstrated in the following image.

At this stage, we can assimilate no longer an illiterate, but a child who is learning to read and understands a few things. But how can “this child’s” reading be corrected or improved?

SymSpell

SymSpell is an alternative algorithm to the Symmetric Delete spelling correction algorithm, and SymSpell has been found to be 1000x faster at performing this task, working with a dictionary that is loaded into memory, and supports a number of languages. Its repository is on GitHub and can be accessed from the following link.

wolfgarbe/SymSpell

Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm The…

github.com

Continuing with our assimilation, in this case we can say that SymSpell is like a teacher for our reading, which applies corrections by working with a similarity distance between words.

FlowChart Implementation

The whole process of reading the image with the nutrition table until you have the text corrected and ready for use is summarized in the following image.

Furthermore, the entire implementation of this project is available in both the Python Package Index (PyPI) and can be used in just 3 lines, as in the following example.

Example of Nkocr use. Font: Author.

It is worth saying that this project is part of the open source community and has new implementation participation, including yours! Feel free to contribute pull requests and issues.

Lucs1590/Nkocr

This is a module to make specifics OCRs at food products and nutritional tables. As a prerequisite of this project, we…

github.com

Logo of the project. Font: Author.

Thank you very much for reading, I hope it has added something to your life. Feel free to contact me for more information!

References

tesseract-ocr/tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…

github.com

sklearn.cluster.KMeans - scikit-learn 0.24.1 documentation

K-Means clustering. Read more in the . Parameters n_clustersint, default=8 The number of clusters to form as well as…

scikit-learn.org

1000x Faster Spelling Correction algorithm (2012)

Update1: An improved SymSpell implementation is now 1,000,000x faster. Update2: SymSpellCompound with Compound aware…

wolfgarbe.medium.com

OpenCV: OpenCV-Python Tutorials

Edit description

docs.opencv.org

A quick overview of the implementation of a fast spelling correction algorithm

Spellcheckers and autocorrect can feel like magic. They’re at the core of everyday applications — our phones, office…

medium.com

How did the machine read nutritional facts?

Table of contents:

Tesseract e PyTesseract

Optical Character Recognition (OCR) for Low Resource languages with Tesseract version.

Theory of Tables

K-means

EAST — Efficient and Accurate Scene Text Detector

OpenCV Text Detection (EAST text detector) - PyImageSearch

In this tutorial you will learn how to use OpenCV to detect text in natural scene images using the EAST text detector…

SymSpell

wolfgarbe/SymSpell

Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm The…

FlowChart Implementation

Lucs1590/Nkocr

This is a module to make specifics OCRs at food products and nutritional tables. As a prerequisite of this project, we…

References

tesseract-ocr/tesseract

This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new…

sklearn.cluster.KMeans - scikit-learn 0.24.1 documentation

K-Means clustering. Read more in the . Parameters n_clustersint, default=8 The number of clusters to form as well as…

1000x Faster Spelling Correction algorithm (2012)

Update1: An improved SymSpell implementation is now 1,000,000x faster. Update2: SymSpellCompound with Compound aware…

OpenCV: OpenCV-Python Tutorials

Edit description

A quick overview of the implementation of a fast spelling correction algorithm

Spellcheckers and autocorrect can feel like magic. They’re at the core of everyday applications — our phones, office…

Written by Lucas de Brito Silva