Analyzing PDF and images with deep learning using Amazon Textract

Mrityujay Kumar Singh
14 min readJan 11, 2021

In this post, we take a look at how to analyze PDF and images with deep learning using Amazon Textract

Introduction

Today, a lot of data are stored in the form of PDF, images, and other documents, and it’s a hectic process to scan those documents and convert them to structured data. The standard approach here is to use any OCR tools, but those are highly error-prone, and manual interventions are required. This is a time-consuming process and is not efficient.

Amazon has come up with a new AWS service, Amazon Textract, to solve this problem. Amazon Textract uses machine learning to instantly read and process any type of document, accurately extracting printed text, handwriting, forms, tables, and other data without the need for any manual effort or custom code.

With Textract, you can quickly automate manual document activities, enabling you to process millions of document pages in hours. Once the information is captured, you can take action on it within your business applications to initiate the next steps for a loan application, tax document, enrollment form, or medical claims processing. Additionally, you can create smart search indexes, or add in human reviews with Amazon Augmented AI to review nuanced or sensitive data.

Why use Amazon Texttract

The following are common use cases for using Amazon Textract:

  • Creating an intelligent search index — Amazon Textract enables you to create libraries of text that is detected in image and PDF files.
  • Using intelligent text extraction for natural language processing (NLP) — You can use Amazon Textract to extract text into words and lines. It also groups text by table cells if Amazon Textract document table analysis is enabled. Amazon Textract provides you with control over how text is grouped as input for NLP.
  • Accelerating the capture and normalization of data from different sources — Amazon Textract enables text and tabular data extraction from a wide variety of documents, such as financial documents, research reports, and medical notes. With Amazon Textract Analyze Document APIs, you can easily and quickly extract unstructured and structured data from your documents.
  • Automating data capture from forms — Amazon Textract enables structured data to be extracted from forms. With Amazon Textract Analysis APIs, you can build extraction capabilities into existing business workflows so that user data that are submitted through forms can be extracted into a usable format.

With synchronous processing, Amazon Textract can analyze single-page documents for applications where latency is critical. Amazon Textract also provides asynchronous operations to extend support to multipage documents.

Feature Of Amazon Texttract

Key-value pair extraction

Amazon Textract enables you to detect key-value pairs in document images automatically so that you can retain the inherent context of the document without any manual intervention. This makes it easy to import the extracted data into a database or to provide it as a variable into an application.

Table extraction

Amazon Textract preserves the composition of data stored in tables during extraction. This is helpful for documents that are primarily composed of structured data, such as financial reports or medical records that have column names in the top row of the table followed by rows of individual entries.

Bounding boxes

All extracted data is returned with bounding box coordinates. The coordinates make up a polygon frame that encompasses each piece of identified data, such as a single word, a line, or a table. This helps be able to audit where a word or number came from in the source document. It also helps to guide the user in document search systems that return scans of original documents as the search result.

Confidence scores

When information is extracted from documents, Amazon Textract returns a confidence score for everything it identifies so that you can make an informed decision about how you want to use the results.

Benefits of Amazon Textract

Extract data quickly and accurately

Amazon Textract makes it easy to quickly and accurately extract data from documents and forms. Amazon Textract automatically detects a document’s layout and the key elements on the page, understands the data relationships in any embedded forms or tables, and extracts everything with its context intact. This means you can instantly use the extracted data in an application or store it in a database without a lot of complicated code in between

No code or templates to maintain

With Amazon Textract’s pre-trained machine learning models, you don’t need to write code for data extraction. This is because the models have already been trained on tens of millions of documents from many industries — including invoices, receipts, contracts, tax documents, sales orders, enrollment forms, benefit applications, insurance claims, and policy documents. You no longer need to maintain code for every document or form you might receive, or worry about how page layouts change over time.

Lower document processing costs

Amazon Textract’s text extraction API enables you to process documents for $1.50 per 1,000 pages. Whether you process a few hundred documents a year or millions, Amazon Textract provides OCR and structured data extraction (forms and tables) at a very low cost, and you only pay for what you use. There are no upfront commitments or long-term contracts.

Easily implement human reviews

With the addition of Amazon Augmented AI, you can build-in human reviews to manage nuanced or sensitive workflows that require human judgment to get high confidence predictions or to audit predictions on an on-going basis.

How It Works

Amazon Textract Working

Amazon Textract enables you to detect and analyze the text in single or multi-page input documents.

Amazon Textract provides operations for detecting text only and operations for analyzing text that finds deeper relationships, such as form data and tables.

Amazon Textract provides synchronous operations for processing small, single-page, documents, and for getting near real-time responses. Amazon Textract also provides asynchronous operations that you can use to process larger, multipage documents. Asynchronous responses aren’t in real-time.

When an Amazon Textract operation processes a document, the results are returned in an array of the section called “Block” objects. A Block object contains information that’s detected about items, including their location on the document and their relationship to other items on the document.

Getting Started with Amazon Textract

This section provides topics to get you started using Amazon Textract. If you’re new to Amazon Textract

Set Up an AWS Account and Create an IAM User

Before you use Amazon Textract for the first time, complete the following tasks:

  1. Sign Up for AWS
  2. Create an IAM User

Sign Up for AWS

When you sign up for Amazon Web Services (AWS), your AWS account is automatically signed up for all released services in AWS. You’re charged only for the services that you use.

With Amazon Textract, you pay only for the resources you use. For more information about Amazon Textract usage rates, see Amazon Textract pricing. If you’re a new AWS customer, you can get started with Amazon Textract for free. For more information, see AWS Free Usage Tier.

If you don’t have an AWS account, perform the steps in the following procedure to create one.

To create an AWS account

  1. Open https://portal.aws.amazon.com/billing/signup.
  2. Follow the online instructions.

Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.

Create an IAM User

Services in AWS, such as Amazon Textract, require that you provide credentials when you access them. This is so that the service can determine whether you have permissions to access the resources owned by that service. The console requires your password. You can create access keys for your AWS account to access the AWS CLI or API.

If you signed up for AWS, but you haven’t created an IAM user for yourself, you can create one by using the IAM console. Follow the procedure to create an IAM user in your account.

  1. To create an IAM user and sign in to the console
  2. Create an IAM user with administrator permissions in your AWS account. For instructions, see Creating Your First IAM User and Administrators Group in the IAM User Guide.
  3. As the IAM user, sign in to the AWS Management Console by using a special URL. For more information, see How Users Sign In to Your Account in the IAM User Guide.

Note

An IAM user with administrator permissions has unrestricted access to the AWS services in your account. The code examples in this guide assume that you have a user with the AmazonTextractFullAccess permissions. AmazonS3ReadOnlyAccess is required for examples that access documents that are stored in an Amazon S3 bucket. Depending on your security requirements, you might want to use an IAM group that’s limited to these permissions. For more information, see Creating IAM Groups.

Set Up the AWS CLI

The following steps show you how to install the AWS Command Line Interface (AWS CLI) that the examples in this documentation use. There are several different ways to authenticate AWS SDK calls. The examples in this guide assume that you’re using a default credentials profile for calling AWS CLI commands and AWS SDK API operations.

To set up the AWS CLI and the AWS SDKs

  1. Download and install the AWS CLI and that you want to use. For references AWS CLI Setup Guide
  2. Create an access key for the user that you created in Create an IAM User.
  3. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  4. In the navigation pane, choose Users.
  5. Choose the name of the user that you created in Create an IAM User.
  6. Choose the Security credentials tab.
  7. Choose to Create an access key. Then choose Download .csv file to save the access key ID and secret access key to a CSV file on your computer. Store the file in a secure location. You will not have access to the secret access key again after this dialog box closes. After you’ve downloaded the CSV file, choose Close.
  8. Set credentials in the AWS credentials profile file on your local system, located at:
  • ~/.aws/credentials on Linux, macOS, or Unix.
  • C:\Users\USERNAME\.aws\credentials on Windows.

This file should contain the following lines:

  1. Set the default AWS Region in the AWS config file on your local system, located at:
  • ~/.aws/config on Linux, macOS, or Unix.
  • C:\Users\USERNAME\.aws\config on Windows.

This file should contain the following lines:

Substitute the AWS Region you want (for example, “us-west-2”) for four_aws_region.

Note

If you don’t choose a Region, then us-east-1 is used by default.

Detecting and Analyzing Text in Single-Page Documents

Amazon Textract can detect and analyze the text in single-page documents that are provided as images in JPEG or PNG format. The operations are synchronous and return results in near real-time.

You can use Amazon Textract synchronous operations for the following purposes:

Text detection — You can detect lines and words on a single-page document image by using the DetectDocumentText operation.

Text analysis — You can identify relationships between detected text on a single-page document by using the AnalyzeDocument operation.

Calling Amazon Textract Synchronous Operations

Amazon Textract operations process document images that are stored on a local file system, or document images stored in an Amazon S3 bucket. You specify where the input document is located by using the Document input parameter. The document image can be in either PNG or JPEG format.

Request

Documents Passed as Image Bytes

You can pass a document image to an Amazon Textract operation by passing the image as a base64-encoded byte array. An example is a document image that’s loaded from a local file system. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations.

The image bytes are specified in the Bytes field of the Document input parameter. The following example shows the input JSON for an Amazon Textract operation that passes the image bytes in the Bytes input parameter.

Note

If you’re using the AWS CLI, you can’t pass image bytes to Amazon Textract operations. Instead, you have to reference an image that’s stored in an Amazon S3 bucket.

The following code shows how to load an image from a local file system and call an Amazon Textract operation.

Documents Stored in an Amazon S3 Bucket

Amazon Textract can analyze document images that are stored in an Amazon S3 bucket. You specify the bucket and file name by using the S3Object field of the Document input parameter. The following example shows the input JSON for an Amazon Textract operation that processes a document stored in an Amazon S3 bucket.

The following example shows how to call an Amazon Textract operation using an image stored in an Amazon S3 bucket.

Response

The following sample is the JSON response from a call to DetectDocumentText.

{“Blocks”: [{“BlockType”: “string”,”ColumnIndex”: number,”ColumnSpan”: number,”Confidence”: number,”EntityTypes”: [ “string” ],”Geometry”: {“BoundingBox”: {“Height”: number,”Left”: number,”Top”: number,”Width”: number},”Polygon”: [{“X”: number,”Y”: number}]},”Id”: “string”,”Page”: number,”Relationships”: [{“Ids”: [ “string” ],”Type”: “string”}],”RowIndex”: number,”RowSpan”: number,”SelectionStatus”: “string”,”Text”: “string”}],”DetectDocumentTextModelVersion”: “string”,”DocumentMetadata”: {“Pages”: number}}

Detecting Document Text with Amazon Textract

To detect text in a document, you use the DetectDocumentText operation, and pass a document file as input. DetectDocumentText returns a JSON structure that contains lines and words of detected text, the location of the text in the document, and the relationships between detected text.

You can provide an input document as an image byte array (base64-encoded image bytes) or as an Amazon S3 object. In this procedure, you upload an image file to your S3 bucket and specify the file

AWS CLI

This AWS CLI command displays the JSON output for the detect-document-text CLI operation.

Replace the values of Bucket and Name with the names of the Amazon S3 bucket and document that you uploaded in the s3 bucket.

aws textract analyze-document \

— document ‘{“S3Object”:{“Bucket”:”bucket”,”Name”:”document”}}’ \

— feature-types ‘[“TABLES”,”FORMS”]’

Analyzing Document Text with Amazon Textract

To analyze the text in a document, you use the AnalyzeDocument operation, and pass a document file as input. AnalyzeDocument returns a JSON structure that contains the analyzed text.

You can provide an input document as an image byte array (base64-encoded image bytes) or as an Amazon S3 object. In this procedure, you upload an image file to your S3 bucket and specify the file name.

AWS CLI

This AWS CLI command displays the JSON output for the detect-document-text CLI operation.

Replace the values of Bucket and Name with the names of the Amazon S3 bucket and document that you uploaded in s3 bucket.

aws textract analyze-document \

— document ‘{“S3Object”:{“Bucket”:”bucket”,”Name”:”document”}}’ \

— feature-types ‘[“TABLES”,”FORMS”]’

Detecting and Analyzing Text in Multipage Documents

Amazon Textract can detect and analyze the text in multipage documents that are in PDF format. Multipage document processing is an asynchronous operation. Asynchronous processing of documents is useful for processing large, multipage documents. For example, a PDF file with over 1,000 pages takes a while to process. Processing the PDF file asynchronously allows your application to complete other tasks while it waits for the process to complete.

Multipage documents must be in PDF format. Single-page documents processed with asynchronous operations can be in JPEG, PNG, or PDF format.

Calling Amazon Textract Asynchronous Operations

Amazon Textract provides an asynchronous API that you can use to process multi-page documents in PDF format. You can also use asynchronous operations to process single-page documents that are in JPEG, PNG, or PDF format.

Amazon Textract asynchronously processes a document that’s stored in an Amazon S3 bucket. You start processing by calling a Start operation, such as StartDocumentTextDetection. The completion status of the request is published to an Amazon Simple Notification Service (Amazon SNS) topic. To get the completion status from the Amazon SNS topic, you can use an Amazon Simple Queue Service (Amazon SQS) queue or an AWS Lambda function. After you have the completion status, you call a Get operation, such as GetDocumentTextDetection to get the results of the request.

Starting Text Detection

You start an Amazon Textract text detection request by calling StartDocumentTextDetection. The following is an example of a JSON request that’s passed by StartDocumentTextDetection.

{“DocumentLocation”: {“S3Object”: {“Bucket”: “bucket”,”Name”: “image.pdf”}},”ClientRequestToken”: “DocumentDetectionToken”,”NotificationChannel”: {“SNSTopicArn”: “arn:aws:sns:us-east-1:nnnnnnnnnn:topic”,”RoleArn”: “arn:aws:iam::nnnnnnnnnn:role/roleopic”},”JobTag”: “Receipt”}

The response to the StartDocumentTextDetection operation is a job identifier (JobId). Use JobId to track requests and get the analysis results after Amazon Textract has published the completion status to the Amazon SNS topic. The following is an example:

{“JobId”:”270c1cc5e1d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca90fc180e3"}

Getting the Completion Status of an Amazon Textract Analysis Request

Amazon Textract sends an analysis completion notification to the registered Amazon SNS topic. The notification includes the job identifier and the completion status of the operation in a JSON string. A successful text detection request has a SUCCEEDED status. For example, the following result shows the successful processing of a text detection job.

{“JobId”: “642492aea78a86a40665555dc375ee97bc963f342b29cd05030f19bd8fd1bc5f”,”Status”: “SUCCEEDED”,”API”: “StartDocumentTextDetection”,”JobTag”: “Receipt”,”Timestamp”: 1543599965969,”DocumentLocation”: {“S3ObjectName”: “document”,”S3Bucket”: “bucket”}}

Getting Amazon Textract Text Detection Results

To get the results of a text detection request, first ensure that the completion status that’s retrieved from the Amazon SNS topic is SUCCEEDED. Then call GetDocumentTextDetection, which passes the JobId value that’s returned from StartDocumentTextDetection. The request JSON is similar to the following example

{“JobId”: “270c1cc5e1d0ea2fbc59d97cb69a72a5495da75851976b14a1784ca90fc180e3”,”MaxResults”: 10,”SortBy”: “TIMESTAMP”}

The GetDocumentTextDetection response JSON is similar to the following. The total number of pages that are detected is available from DocumentMetadata. The detected text is returned in the Blocks array.

{“DocumentMetadata”: {“Pages”: 1},”JobStatus”: “SUCCEEDED”,”Blocks”: [{“BlockType”: “PAGE”,”Geometry”: {“BoundingBox”: {“Width”: 1.0,”Height”: 1.0,”Left”: 0.0,”Top”: 0.0},”Polygon”: [{“X”: 0.0,”Y”: 0.0},{“X”: 1.0,”Y”: 0.0},{“X”: 1.0,”Y”: 1.0},{“X”: 0.0,”Y”: 1.0}]},”Id”: “64533157-c47e-401a-930e-7ca1bb3ac3fa”,”Relationships”: [{“Type”: “CHILD”,”Ids”: [“4297834d-dcb1–413b-8908–3b96866ebbb5”,”1d85ba24–2877–4d09-b8b2–393833d769e9",”193e9c47-fd87–475a-ba09–3fda210d8784",”bd8aeb62–961b-4b47-b78a-e4ed9eeecd0f”]}],”Page”: 1}]}

Configuring Amazon Textract for Asynchronous Operations

The following procedures show you how to configure Amazon Textract for use with an Amazon SNS topic and an Amazon SQS queue.

  1. Set up an AWS account to access Amazon Textract and Create an IAM User.

Ensure that the user has at least the following permissions:

  • AmazonTextractFullAccess
  • AmazonS3ReadOnlyAccess
  • AmazonSNSFullAccess
  • AmazonSQSFullAccess
  1. Install and configure the required AWS SDK.
  2. Create an Amazon SNS topic. Prepend the topic name with AmazonTextract. Note the topic Amazon Resource Name (ARN). Ensure that the topic is in the same Region as the AWS endpoint that you’re using.
  3. Create an Amazon SQS standard queue by using the Amazon SQS console. Note the queue ARN.
  4. Subscribe the queue to the topic you created in step 3.
  5. Permit the Amazon SNS topic to send messages to the Amazon SQS queue.
  6. Create an IAM service role to give Amazon Textract access to your Amazon SNS topics. Note the Amazon Resource Name (ARN) of the service role.
  7. Add the following inline policy to the IAM user that you created in step 1:

{“Version”: “2012–10–17”,”Statement”: [{“Sid”: “MySid”,”Effect”: “Allow”,”Action”: “iam:PassRole”,”Resource”: “arn:Service role ARN from step 7”}]}

Giving Amazon Textract Access to Your Amazon SNS Topic

Amazon Textract needs permission to send the completion status of an asynchronous operation to your Amazon SNS topic. You use an IAM service role to give Amazon Textract access to the Amazon SNS topic. When you create the Amazon SNS topic, you must prepend the topic name with AmazonTextract — for example, AmazonTextractMyTopicName.

  1. Sign in to the IAM console (https://console.aws.amazon.com/iam).
  2. In the navigation pane, choose Roles.
  3. Choose to Create a role.
  4. For Select type of trusted entity, choose AWS service.
  5. For Choose the service that will use this role, choose EC2.
  6. Choose Next: Permissions.
  7. For Attach permissions policies, select the check box next to the policy AmazonTextractServiceRole. To display the policy in the list, enter part of the policy name in the Filter policies query filter.
  8. Choose Next: Tags.
  9. You don’t need to add tags, so choose Next: Review.
  10. In the Review section, for Role name, enter a name for the role (for example, TextractRole). In Role description, update the description for the role in Role description, and then choose to Create role.
  11. Choose the new role to open the role’s details page.
  12. In the Summary, copy the Role ARN value and save it.
  13. Choose Trust relationships.
  14. Choose Edit trust relationship, and update the trust policy as follows:

{“Version”: “2012–10–17”,”Statement”: [{“Effect”: “Allow”,”Principal”: {“Service”: “textract.amazonaws.com”},”Action”: “sts:AssumeRole”}]}

Conclusion

Amazon textract is the best tool for scenarios like this and with the support of the machine, learning textract avoid human interactions that are requested and hence automate the entire process.

--

--