Invoice Extraction Made Easy with LayoutLM and Datasaur

Why Automate Invoice Extraction

Invoices are a fundamental part of any business. They contain crucial information like dates, amounts, and vendor details that need to be recorded accurately. However, manually extracting this information from various invoice formats can be time-consuming and error-prone. Automating this process saves time, reduces mistakes, and allows your team to focus on more important tasks.

Introduction to LayoutLM

Imagine if a computer could read and understand documents just like a human does, but faster and without errors. That's exactly what LayoutLM does. Developed by Microsoft, LayoutLM is an advanced technology that can read and interpret documents by understanding both the text and how it's laid out on the page. It's like giving your computer a pair of eyes and a brain!

The open-source LayoutLM model is available on Hugging Face Models Catalogue and can be fine-tuned to meet specific dataset needs.

Making Data Preparation Easy with Datasaur

Before LayoutLM can work its magic, it needs to learn from examples. This is where Datasaur comes in. Datasaur is a user-friendly platform that helps you prepare your documents so that LayoutLM can learn from them. Think of it as teaching a new employee how to do a task by showing them the ropes.

A Simple Guide to Automating Invoice Extraction

Extracting Text from Invoices
Start by turning your scanned invoices into editable text using Optical Character Recognition (OCR) technology. OCR acts like a scanner that converts images of text into actual text that the computer can process.‍
Labeling Important Information
With Datasaur, you can easily highlight and label key pieces of information on your invoices, such as: (1) Invoice Date, (2) Total Amount,(3) Vendor Name, (4) Itemized Charges. This process is similar to using a highlighter on paper documents to mark important details.‍
Teaching LayoutLM
Once your invoices are labeled, LayoutLM uses this information to learn what to look for in new invoices. This teaching process is called "training." The more examples you provide, the smarter LayoutLM becomes.‍
Automating the Extraction Process
After training, LayoutLM can automatically extract information from new invoices, regardless of their format. This means you can process large numbers of invoices quickly and accurately.

Case Study

This section elaborates on the document processing workflow utilizing the LayoutLM model and Datasaur for labeling. We used the invoice dataset, which includes:

Train Set: 630 invoice documents
Validation Set: 180 invoice documents
Test Set: 90 invoice documents

The dataset used the following entity labels:

In this tutorial, we are also going to show Datasaur’s user-friendly Human-In-The-Loop process in the next iteration labeling process by integrating a fine-tuned LayoutLM as a labeling assistant.

Labeling with Datasaur

At Datasaur, we simplify complex data preparation with an efficient workflow—from extracting text transcriptions from scanned documents to producing a dataset ready for training. Here is the required data preparation pipeline, fully supported by Datasaur:

Generate OCR Results: Extract text from scanned invoices using OCR providers supported by Datasaur. If you prefer to use your own OCR, we can also integrate it with the Datasaur app.

Labeling: Use Datasaur’s intuitive interface to annotate the extracted text, associating specific entities with their respective labels. You can also edit the text if the OCR fails to recognize certain characters.Export: Once labeling is complete, export the labeled data with File Transformer feature, which converts data into a format compatible with LayoutLM for model fine-tuning. File Transformer itself is a customizable feature that allows pre- and post-processing of labeling project files using a custom script in TypeScript. Learn more in our File Transformer documentation.

Export: Once labeling is complete, export the labeled data with File Transformer feature, which converts data into a format compatible with LayoutLM for model fine-tuning. File Transformer itself is a customizable feature that allows pre- and post-processing of labeling project files using a custom script in TypeScript. Learn more in our File Transformer documentation.

Currently, we have provided File Transformer templates specifically to export invoice labeling projects to the input format required by LayoutLM

Here is the format of the exported file that is ready for fine-tuning LayoutLM

Fine-tuning LayoutLM

We fine-tuned the LayoutLM model using the Inside–Outside (IO) format. In this format, entities in the exported file are tagged with labels derived from the Datasaur labeling process. At the same time, non-labeled text is marked as O. The table below shows a comparison of tagged texts in Datasaur and the corresponding exported file, which is now ready for LayoutLM fine-tuning:

The exported file was then fine-tuned on the LayoutLM model utilizing the Transformers Trainer from Hugging Face. The Transformers Trainer is a comprehensive supporting various NLP tasks, including a specified trainer for LayoutLM. For more detailed guidance, refer to the Transformers - LayoutLM documentation.

Validation performance

Based on this validation result, the fine-tuned LayoutLM has successfully fitted our dataset and is ready to be deployed as an endpoint for inference.

Assisted Labeling

AI is powerful, but human oversight is essential for confirmed accuracy. Datasaur enables this crucial human-in-the-loop approach, allowing you to easily review and refine AI outputs. This synergy combines AI efficiency with human expertise: AI handles the heavy lifting, while your insights ensure accuracy and relevance. The result? Superior outcomes that neither could achieve alone.”

When your model is ready, you can utilize the ML-Assisted Labeling Custom API feature to help you label by predicting entities for unlabeled documents. Simply create new project, upload your new samples, build a custom API, and let the model do the labeling for you. Your task is to review the applied labels and make any necessary corrections. By involving humans in the loop, this process will enhance both labeling efficiency and label quality.

Navigate to the ML Assisted extension and choose a CustomAPI provider.

Integrate your own LayoutLM inference endpoint (custom inference scripts with compatible input and output format are required), then click "Predict Labels.”
Review and refine the predictions. You can either accept or reject all predictions. This streamlined process not only saves time but also reduces the chance of human error.

Why This Approach Works

Saves Time: Automate repetitive tasks so your team can focus on more strategic work.
Reduces Errors: Minimize mistakes that can occur with manual data entry.
Handles Variety: Process invoices in different formats without extra effort.
Scalable Solution: As your business grows, this system adapts without the need for additional resources.

Conclusion

Automating invoice extraction doesn't have to be complicated or reserved for tech giants. With Datasaur and LayoutLM, businesses of all sizes can leverage AI to make invoice processing faster, easier, and more accurate.

Ready to transform the way you handle invoices? Reach out to us at sales@datasaur.ai to learn more about how we can help streamline your workflow with cutting-edge AI solutions.