The results of Google Cloud Natural Language analyses are shown in text content asset pages.


What's new

censhare 2020.1: There are no requirements related to the size of the files to be analyzed. This changed compared to prior censhare releases.

Prerequisites

Google Cloud Natural Language is a service provided by Google. The use of this Google service can result in additional costs that are invoiced directly by Google. censhare cannot influence or control these costs and therefore cannot be held responsible for them.

Context

"Google Cloud Natural Language" is a cloud service by Google for text analysis. censhare uses this service to analyze text content stored in your system. It returns the content categories and entities such as persons or companies that are relevant to the text. 

Prerequisites

  • For asset automation, the automatic server action for Google Cloud Natural Language must be configured and executed regularly on the system.

  • For manual execution, the manual server action must be configured.

  • To be able to trigger the analyses from web client, the server action must be enabled.

  • The Google Cloud Natural Language functionality is only available for certain text content assets.

Introduction

Google Cloud Natural Language is a cloud service for text analysis. censhare uses this service to find content categories in text. Google Cloud Natural Language returns a list of known entities that are mentioned in the text. For example, entities can be public persons or companies. If Google finds a Wikipedia web page for an entity, this page is linked accordingly in censhare.

censhare analyzes the following content assets:

  • "Text" assets with the following file content: plain, ICML or XML text

  • "Text" assets with the DOCX file content

  • "Image" assets with PDF MIME-type

  • "PDF" assets with PDF MIME-type

The analysis can be triggered automatically by the server automation or executed manually by yourself. The results are presented in the corresponding content asset page.

The asset page shows you a list with the found content categories. The asset page also shows the confidence category that Google calculates for each result. The confidence value indicates how certain Google is that the found content category matches the analyzed text.

A similar value is used for the found entities: the salience. Calculated by Google, the salience value represents the importance of the entity for the text.

Depending on the text, Google Cloud Natural Language can find many content categories or entities. Not all of them might be interesting for your case, especially if the respective confidence and salience values are very low. Therefore, censhare applies thresholds for the confidence and salience value that filter results that are lower than a predefined value. Due to this, the content asset page may show fewer results than Google delivered. The confidence and salience thresholds are defined by an administrator. Please, contact your administrator with any further questions.

To limit the size of the text files to be analyzed (especially with the service automation), censhare can apply a threshold for the text size. This is due to performance reasons of the system. If the file from the content asset to be analyzed exceeds this size the analysis is not executed.

Each entity and content category is stored as an asset in censhare. If an entity or content category is assigned to several text assets, the text assets are linked to the same entity or content category asset. Therefore, if you open such an asset, you will see which content assets share the same content category or mention the same entity.

If you reanalyze a content asset, censhare updates the asset respectively. This means that content category and entity information is added or removed according to the new results. If the confidence or salience results differ from prior analysis results, censhare updates the according entries in the content asset as well. If the threshold values change between two analyses, and changes the number of results, the entries shown in the content asset update accordingly.

Mapping of content categories

Google has a list of content categories that it uses to analyze text. On the other side, censhare has list of content categories that it supports. These lists can differ because changes on the Google side are not reflected on the censhare side. The content categories in censhare are stored as assets.

When running an analysis, censhare receives a list from Google with all content categories that are found. If a content category found by Google exists as an asset, censhare creates a relation between the content asset and the content category asset. If no content category asset is found for a result from Google, the result is skipped in censhare because censhare does not create new categories automatically. If you are missing categories in censhare that Google has found and want the category added, contact your censhare administrator.

The content category assets in censhare have the "Module/Content category" asset type. If you want to see a full list of all supported categories, you can search for this asset type in the "Detail search". Just open the Detail search and select the desired asset type in the "Type" field.

At this time, Google only supports content categories in English.

Create entity assets

censhare will create an asset 3753819, if the salience found is higher than the threshold defined. If an entity asset already exists in censhare, censhare uses the existing entity asset. If there is a Wikipedia web page for an entity, censhare also creates a "Wikipedia web page" asset respective use the existing one.

In censhare, entities can have one of the following asset types:

  • Consumer good

  • Company

  • Event

  • Location

  • Other

  • Person

  • Work of art

Contact your administrator if you want to change or extend that list.

Note: Google currently supports 10 languages for entities analysis.


Go to https://cloud.google.com/natural-language/docs/languages to see the full list of languages with the ISO-639-1-Code.

Execute analysis

You can start the Google Cloud Natural Language analysis:

  • from the asset currently open in its asset page

  • from a search result page:

    From here, you can select one or more PDF assets. When you select more than one, you can carry out the action for all of them at once. If one or more of the selected assets do not have the "PDF document" MIME type, the action will not be available.

  1. Select the content asset(s) to be analyzed from a search result list, or open the required asset in an asset page.

  2. From the page actions respective asset action menu, select "Others/Google Natural Language API".

View results

Content asset

To see the results for an analyzed text, open the asset page of the text asset. There are two property widgets on the Overview tab that show the results:

  • "Mentions" widget

  • "Content category" widget

In the Mentions widget, you see the entities with their categories that Google has found. Google assigns each entity to a category like "Person" or "Event". This allows censhare to group the results by these categories. Next to each entry the salience value is shown that Google has provided.

censhare stores each found entity as an asset in the system. The asset type of this asset is defined the by Google assigned category. If you click on the name or the type of an entity in the widget, censhare opens the corresponding entity asset. There, you can see if other text assets share this entity, too.

Note: Google may find more results than the ones presented in the widget. This happens if found entities have a lower salience value than the threshold defined in your system by an administrator.

In the Content category widget, you see the content categories that Google has assigned to the analyzed text asset. Next to each content category title, the widget shows the Confidence value that indicates how much the analyzed text matches the assigned content category.

Each content category is stored as an asset in the system. If you click on the name or the asset type of the content category, censhare opens the corresponding content category asset. There, you can see if there are other text content assets assigned to this content category.

Entity asset

Each entity asset has a "Mentioned in" widget that shows all assets that refer to this entity. This can be a content asset or Wikipedia web page asset that is about this entity.

Each entry also shows the salience value that the entity has for the analyzed content of the linked asset. To see the related asset, just click on it. censhare will open it in a new asset page.

Content category asset

Each content category asset has a "Content category of" widget. This widget shows all content assets that are assigned to this content category. If you click the asset name or type of an entry, censhare will open this content asset in a new asset page.

Result

The "Google Natural Language" server action is executed for the selected asset(s) and the results are stored in censhare. The results of the analyzes are shown in the respective widgets in the actual asset.