Form 16 and Amazon Textract ..

Mani
3 min readJul 22, 2019

This is that time of the year to file taxes in India, where you need to file taxes by July 31st, 2019 !! Imagine you get your Form 16 in a pdf document and you want to extract the important data elements from the document, how will you do it? Imagine you are creating an application to parse hundreds/thousands of scanned documents that have been uploaded to you, how will you do it?

From https://aws.amazon.com/textract/ — “Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.” .

I managed to get hold of a simple pdf which looks similar to what a Form 16 looks ..

Sample Form 16 form with dummy data

Amazon Textract is available in US East (N.Virginia and Ohio), US West (Oregon) and EU (Ireland) as of today (July 22nd 2019). You may have to check on your specific data and security compliance requirements to see if this service/solution meets your requirements.

Like any other AWS service, Textract can be accessed either via the AWS Management console or calling the API. Your document must be in JPEG, PNG or PDF format.

When called via the AWS management console, the service quickly and accurately extract data from documents, forms, and tables. It automatically detects a document’s layout and the key elements on the page, understands the data relationships in any embedded forms or tables, and extracts everything with its context intact .. From the Amazon Textract documentation — “Amazon Textract’s pre-trained machine learning models eliminate the need to write code for data extraction, because they have already been trained on tens of millions of documents from virtually every industry, including contracts, tax documents, sales orders, enrollment forms, benefit applications, insurance claims, policy documents and many more.”

Amazon Textract — AWS Management Console

There are also several useful utilities out there, I found one such utility called the Textractor — https://github.com/aws-samples/amazon-textract-textractor

When I ran the following command, I obtained the JSON & CSV versions of the document using Amazon Textract. The tool can also utilize the other AWS AI services like Amazon Comprehend, Amazon Comprehend Medical and Amazon Translate to generate insights or translate detected text !!

You get extracted data and insights (because I selected the -insights option which uses Amazon Comprehend) for every page ..

Summary

Bottom-line, whether its Form 16 or other documents, Amazon Textract provides OCR and structured data extraction (forms and tables) at very low cost and using a pay as you go model. By using the API’s, we can easily process millions of documents and leverage other AWS services to create and end to end solution !!

And, Remember, July 31st 2019 is the deadline to pay taxes in India ;-)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Mani
Mani

Written by Mani

Principal Solutions Architect at AWS India, and I blog/post about interesting stuff that I am curious about and which is relevant to developers & customers.

No responses yet

Write a response