Overview
The Entity Extraction service of the Text Analysis Entity Extraction package is part of a series of enterprise-grade, natural-language products that eliminate "noise" in unstructured text sources by highlighting salient information. The package's Entity Extraction service identifies an average of 36 entity types per language, including:
People
Name designator (“Ms.”)Title (“President”)
Person (“Barack Obama”)
People (“Greeks”)
Language (“Greek”)
Places
Address 1 (“53 State Street Floor 16”)Address 2 (“Boston, MA 02109”)
Locality (“Boston”)
Minor region (“Napa County”)
Major region (“Nevada”)
Country (“Brazil”)
Continent (“South America”)
Geographic feature (“Mount Fuji”)
Geographic area (“Scandinavia”)
Organizations and products
Commercial organization (“AT&T”)Educational organization (“University of Washington”)
Other organization (“FBI”)
Product (“iPhone”)
Ticker (“NYSE:SAP”)
Times and dates
Day (“Monday”)Date (“2/14/2016”)
Month (“August”)
Year (“1966”)
Time (“3:47 pm”)
Time period (“3 days”; “from 9 to 5 pm”)
Holiday (“Memorial Day”)
Numbers
Currency (“17 euros”)Measure (“217 meters”)
Percent (“10%”)
Phone (“610-661-1000”)
National identification number (“555-12-3456”)
Internet
Email address (“john.doe@sap.com”)IP address (“165.14.2.0”)
URL (“www.sap.com”)
Social media ID (“@SAP”)
Social media topic (“#SAPHANA”)
Use case
An example use of this service is an application that analyzes "hash tag" topics used in social media posts. You crawl social media sites and use simple string matching to save all words that begin with a pound sign (#). After analyzing your database, you realize you erroneously saved CSS color codes, for example,#FC3208
, in HTML pages as topics. You switch to using the Entity Extraction service because it automatically detects HTML pages, removes markup such as CSS syntax, and identifies social media topics. You replace your over-simplified string matching code with a call to this service and look in the response for entities
whose label attribute value is TOPIC_TWITTER
and store the value of the text attribute.See the Tutorials div for examples of how you might code this in Python.
This service supports 14 languages:
- Arabic
- Chinese (Simplified)
- Chinese (Traditional)
- Dutch
- English
- Farsi
- French
- German
- Italian
- Japanese
- Korean
- Portuguese
- Russian
- Spanish
The service accepts input in a wide variety of formats:
- Abobe PDF
- Generic email messages (.eml)
- HTML
- Microsoft Excel
- Microsoft Outlook email messages (.msg)
- Microsoft PowerPoint
- Microsoft Word
- Open Document Presentation
- Open Document Spreadsheet
- Open Document Text
- Plain Text
- Rich Text Format (RTF)
- WordPerfect
- XML
The size of each input file is limited to 1 MB.
API Reference
/
/
Identify entities such as people, places, and organizations contained within a document. For a detailed explanation of entity extraction, including which entities are available in which languages, see the Entity Extraction section of the SAP HANA Text Analysis Language Reference Guide.
An empty entities array is normal
It is not an error if the service returns an empty entities array. Not all text contains entities as defined and recognized by the service. For example, the English sentence "It's the end of the world as we know it, and I feel fine" contains no entities, nor do any of these translations:
Spanish:
German:
Korean:
Russian:
Es el fin del mundo tal como lo conocemos, y me siento bien.
Es ist das Ende der Welt, wie wir es kennen, und ich fühle mich gut.
우리가 알고있는대로 그것은 세상의 종말이며, 나는 기분이 좋아집니다.
Это конец света, как мы его знаем, и я чувствую себя прекрасно.
The label and labelPath members
Some of the entity types that the service identifies are given general categories and then subdivided into more specific classifications. For example, URI
is a general category with the subtypes EMAIL
, IP
, and URL
. The label member of the entities array is the most specific entity type of an extracted entity. The labelPath member includes the general category and the subtype, separated by a forward slash ("/").
That means, if a document contains the web address "http://www.sap.com" in its text, that string is extracted as an entity with its label attribute's value set to URL
and its labelPath set to URI/URL
.
If the entity type does not have subtypes, for example, PERSON
, the label and labelPath values are identical.
Default Language
The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.
Setting a subset of languages
You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.
Meaning of the textSize value
The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.
Annotated JSON schema
The JSON schema contains the descriptions of the objects and members of the JSON response that the Entity Extraction service returns. To read the schema, click the POST link in the API Reference, then click the RESPONSE tab.
Further references
You can find extensive details on the capabilities and behavior of SAP's entity extraction technology in the Entity Extraction chapter of the SAP HANA Text Analysis Language Reference Guide (PDF).
Python Tutorial
This tutorial mirrors the use case described in the Overview.
Extract "hash tag" topics from social media
In this tutorial, you are using the Entity Extraction service to find topic tags, for example, "#Brexit", "#selfie", and "#InternationalWomensDay", in social media posts. You store the tags in a collection that you will analyze later.
Get an access token
To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.import requests
import json
# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'
s = requests.Session()
# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']
Call the service
The POST request body for this service includes a single value: the text upon which to perform entity extraction. Your variable socialpost contains a post that your application read from a social platform's API. In some cases, the post is in HTML, in others it is plain text. You pass it asapplication/binary
and let the service automatically determine its format, remove markup if present, and return hashtags found in the remaining text.# The Entity Extraction service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-entities/v1/'
# HTTP request headers
req_headers = {}
# Set content-type to 'application/json' to pass plain text to the service. Specify the text's encoding as UTF-8.
req_headers['content-type'] = 'application/octet-stream'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)
# Make the REST call to the Entity Extraction service. Pass the binary data in raw form. Do not base64-encode the data.
response = s.post(url = service_url, headers = req_headers, data = socialpost)
Here is a sample, HTML-formatted post:
<!DOCTYPE html>
<!--[if gt IE 8]><!--> <html lang="en" class="no-js logged-in "> <!--<![endif]-->
<head><meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link href="https://www.foo.com/" rel="alternate" hreflang="x-default" />
<link rel="mask-icon" href="//foo-a.akamaihd.net/images/ico/favicon.svg" color="#262626">
</head>
<body class="">
<b>@morsefit76</b>, <b>@emily.g.davies</b>, <b>@everydayrenee1</b> and <b>@katiaeloera</b> like this<br>
<b>@super_dupes</b> Love the motivation <i>#strong</i> <i>#healthylife</i> <i>#takecareofyourself</i>
</body>
</html>
The first few lines of the JSON response this service would return for that post are:
{
"mimeType": "text/html",
"entities": [
{
"sentence": 1,
"text": "@morsefit76",
"label": "ID_TWITTER",
"paragraph": 1,
"offset": 0,
"normalizedForm": "",
"id": 1,
"labelPath": "SOCIAL_MEDIA/ID_TWITTER"
},
And the last few lines of the response are:
{
"sentence": 1,
"text": "#takecareofyourself",
"label": "TOPIC_TWITTER",
"paragraph": 1,
"offset": 127,
"normalizedForm": "",
"id": 8,
"labelPath": "SOCIAL_MEDIA/TOPIC_TWITTER"
}
],
"textSize": 148,
"language": "en"
}
Each entity extracted from the input appears in the response in the "entities" array, in order of appearance. Every entity has eight attributes:
- sentence
- text
- label
- paragraph
- offset
- normalizedForm
- id
- labelPath
For a detailed description of each attribute in the response, see the link to the JSON schema in the Details section of this service.
In your application, you are interested only in the value of the text attribute when the label attribute's value is TOPIC_TWITTER
. The function save_hashtag
is your sample application's way of storing extracted hashtags.
# Print result
if response.status_code == 200:
# De-serialize the JSON reply and get the entities list.
response_dict = json.loads(response.text)
# If the service returns no entities, it's not an error. For example "I am not a crook." contains
# no entities as far as this service is concerned. Thus, the 2nd parameter of the get() call is
# left out.
entities = response_dict.get('entities')
for e in entities:
e_is_a_hashtag = None
for key, value in e.iteritems():
if (key == 'label' and value == 'TOPIC_TWITTER'):
e_is_a_hashtag = True
if key == 'text':
hashtag = value
if e_is_a_hashtag:
save_hashtag(hashtag)
else:
print 'Error', response.status_code
print response.text
If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.
Use our private help channel. Receive updates over email and contact our specialists directly.
If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.