How To Process PDFs And Documents With AWS Comprehend And GCP

How To Process PDFs And Documents With AWS Comprehend And GCP

October 10, 2020 AWS big data consulting data science 0
AWS Natural Language

Photo by Eric Krull on Unsplash.

Parsing and processing documents can provide a lot of value for almost every department in a company. This is one of the many use cases where natural language processing (or NLP) can come in handy.

NLP is not just for chatbots and Gmail predicting what you are going to write in your email. NLP can also be used to help break down, categorize, and analyze documents automatically. For example, perhaps your company is looking to find relationships through all of your contracts or you’re trying to categorize what blog posts or movie scripts are about.

This is where using some form of NLP processing could come in very handy. It can help break down subjects, repetitive themes, pronouns, and more from a document.

Now the question is, how do you do it?

Should you develop a custom neural network from scratch that will break down sentences, words, meaning, sentiment, etc?

This is probably not the best solution — at least not for your initial MVP. Instead, there are lots of libraries and cloud services that can be used to help breakdown documents.

In this article, we are going to look at three options and how you can implement these tools to analyze documents with Python. We are going to look into AWS Comprehend, GCP Natural Language, and TextBlob.


AWS Comprehend

AWS Comprehend is one of many cloud services that AWS provides that allows your team to take advantage of neural networks and other models without the complexity of building your own.

In this case, AWS Comprehend is an NLP API that can make it very easy to process text.

What is great about AWS Comprehend is that it will automatically break down what concepts like what entities, phrases, and syntax are involved in a document. Entities are particularly helpful if you are trying to break down what events, organizations, persons, or products are referenced in a document.

There are plenty of Python libraries that make it easy to break down nouns, verbs, and other parts of speech. However, those libraries aren’t built to label exactly where those nouns fall as far as categories.

Let’s look at an example.

For all the code examples in this article, we will be using the text below:

We will take that text example and run it through the code below where the variable plain-text is:

 

AWS Comprehend output

Once you run the code above, you will get an output like the one below. This is a shortened version. However, you can still see the output. For example, you can see QUANTITY was labeled with 30 minutes and five years — both of which are quantities of time:

 

As you can see, AWS Comprehend does a great job of breaking down organizations and other entities. Again, it is not limited to only breaking down entities. However, this feature is one of the more useful ones when attempting to look for relationships between documents.


GCP Natural Language

Google has created a very similar NLP cloud service called Cloud Natural Language.

It offers a lot of similar features, including entity detection, custom entity detection, content classification, and more.

Let’s use GCP’s version of natural language processing on a string. The code below shows an example of using GCP to detect entities:

GCP Natural Language output

The GCP output is similar to that of AWS Comprehend. However, you will notice that GCP also breaks down similar words and tries to find metadata that is related to the original word:

 

TextBlob And Python

Besides using cloud service providers, there are libraries that can also extract information from documents. In particular, the TextBlob library in Python is very useful. Personally, it was the first library I learned to develop NLP pipelines with.

It is far from perfect. However, it does a great job of parsing through documents.

It offers parts of speech parsing like AWS Comprehend and GCP Natural language as well as sentiment analysis. However, on its own, it won’t categorize what entities exist.

It is still a great tool to break down the basic word types.

Using this library a developer can break down verbs, nouns, or other parts of speech and then look for patterns. What words are commonly used? Which specific phrases or words are attracting readers? Which words are common with other nouns?

There are still a lot of questions you can answer and products you can develop depending on your end goal.

Implementing the TextBlob library is very simple.

No need to connect to an API in this case. All you will need to do is import the library and call a few classes.

This is shown in the code below:

TextBlob output

Here is the output of TextBlob. You will see a lot of similar words that are pulled out using both AWS and GCP. However, there isn’t all the extra labeling and metadata that come with the APIs. That’s what you are paying for (amongst a few other helpful features) with both AWS and GCP:

 

And with that, we have covered three different ways you can use NLP on your documents or raw text.


NLP Doesn’t Have to Be Hard — Sort Of

NLP is a great tool to help process documents, categorize them, and look for relationships. Thanks to AWS and GCP, many less technical developers can take advantage of some NLP features.

That being said, there are a lot of hard aspects to NLP. For example, having to develop chatbots that are good at tracking conversations and context isn’t an easy task. In fact, there is a great series here on Medium where Adam Geitgey covers just that. You can read more in the article Natural Language Processing Is Fun.

Good luck with whatever your next NLP project is.

If you would like to read more about data science and data engineering. Check out the articles and videos below.

4 SQL Tips For Data Scientists

How To Analyze Data Without Complicated ETLs For Data Scientists

What Is A Data Warehouse And Why Use It

Kafka Vs RabbitMQ

SQL Best Practices — Designing An ETL Video

5 Great Libraries To Manage Big Data With Python