Overview
The Enterprise Fact Extraction service of the Text Analysis Fact Extraction package is part of a series of enterprise-grade, natural-language products that find domain-specific relationships between entities in input texts. The package's Enterprise Fact Extraction service identies enterprise events to track what’s happening in the business world. Are there any changes in senior management at a company of interest? Who is releasing new products? Which acquisitions are in the works?
The service can monitor these activities by identifying membership information such as a person’s affiliations, management changes, product releases, mergers and acquisitions, and organizational information such as founder, location, or contact information.
Use Case
An example use of the service is an application that scans corporate press releases in multiple formats, such as PDF, Microsoft Word, RTF, and HTML. Your application can extract, from those documents, information about company acquisitions and plot a graph of which companies acquired others and for how much.You can pass the press release documents to the Enterprise Fact Extraction service, which automatically determines the file format, extracts the text, determines the language in which they are written, and returns any entities of a corporate nature contained in the press releases.
You can identify events to save, for example saving only the acquisition events where a price is mentioned, indicated by entities of the type
BuyEvent
that have a child entity of type Price
. In addition to the price, you can save the names of the companies involved. For example, if a press release contains the sentences "Muttski Enterprises acquired Bow-Wow Technologies for $13bn. Buddy Coddo was promoted to Vice President of Operations.", you can save only the following child entities of the BuyEvent
type found in the first sentence:Text
Muttski Enterprises
Bow-Wow Technologies
$13bn
Entity Type
OrganizationA
OrganizationB
Price
Description
The buyer
The acquired company
The price OrganizationA paid for OrganizationB
The service supports text in English.
The service accepts input in a wide variety of formats:
- Abobe PDF
- Generic email messages (.eml)
- HTML
- Microsoft Excel
- Microsoft Outlook email messages (.msg)
- Microsoft PowerPoint
- Microsoft Word
- Open Document Presentation
- Open Document Spreadsheet
- Open Document Text
- Plain Text
- Rich Text Format (RTF)
- WordPerfect
- XML
The size of each input file is limited to 100 kB.
API Reference
/
/
Extract entities related to enterprises (businesses), such as personal affiliations, product releases, and mergers & acquisitions, in English input documents. Detailed explanations of the enterprise entities that text analysis can extract are available in the Enterprise Fact Extraction section of the SAP HANA Text Analysis Language Reference Guide.
An empty entities array is normal
It is not an error if the service returns an empty entities array. Not all text contains entities as defined and recognized by the service. For example, the English sentence "It's the end of the world as we know it, and I feel fine" contains no entities, nor do any of these translations:
Spanish:
German:
Korean:
Russian:
Es el fin del mundo tal como lo conocemos, y me siento bien.
Es ist das Ende der Welt, wie wir es kennen, und ich fühle mich gut.
우리가 알고있는대로 그것은 세상의 종말이며, 나는 기분이 좋아집니다.
Это конец света, как мы его знаем, и я чувствую себя прекрасно.
The label and labelPath members
Some of the entity types that the service identifies are given general categories and then subdivided into more specific classifications. For example, URI
is a general category with the subtypes EMAIL
, IP
, and URL
. The label member of the entities array is the most specific entity type of an extracted entity. The labelPath member includes the general category and the subtype, separated by a forward slash ("/").
That means, if a document contains the web address "http://www.sap.com" in its text, that string is extracted as an entity with its label attribute's value set to URL
and its labelPath set to URI/URL
.
If the entity type does not have subtypes, for example, PERSON
, the label and labelPath values are identical.
How parent and child entities are linked in the JSON response
Some entities are composed of other entities. This hierarchical relationship is indicated in the JSON response by an extra attribute, parent, that appears only on child entities. It contains an integer value that matches the id value of the parent.
The parent entity always appears before its children in the entities array. The order in which children appear is not guaranteed.
Default Language
The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.
Setting a subset of languages
You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.
Meaning of the textSize value
The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.
Annotated JSON schema
The JSON schema contains the descriptions of the objects and members of the JSON response that the Entity Extraction service returns. To read the schema, click the POST link in the API Reference, then click the RESPONSE tab.
Further references
You can find extensive details on the capabilities and behavior of SAP's entity extraction technology in the Entity Extraction chapter of the SAP HANA Text Analysis Language Reference Guide (PDF).
Python Tutorial
This tutorial mirrors the use case described in the Overview. This tutorial shows how to retrieve and store entities. It does not illustrate the plotting phase of the application described in the use case.
Plot corporate acquisitions
In this tutorial, you are using the Enterprise Fact Extraction service to extract corporate acquisitions from press releases in a variety of formats such as PDF, Microsoft Word, RTF, and HTML. Your app plots a graph showing who acquired whom and for how much.
Get an access token
To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.import requests
import json
# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'
s = requests.Session()
# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']
Call the service
The POST request body for this service includes a single value: the text on which to perform enterprise fact extraction. The variable press_release contains the raw binary data read from a press release file.# The Enterprise Fact Extraction service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-enterprisefacts/v1/'
# HTTP request headers
req_headers = {}
# Set content-type to 'application/octet-stream'. The service automatically
# determines the content type is HTML and removes the markup prior to performing sentiment
# analysis.
req_headers['content-type'] = 'application/octet-stream'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)
# Make the REST call to the Enterprise Fact Extraction service. Pass the binary
# data in raw form. Do not base64-encode the data.
response = s.post(url = service_url, headers = req_headers, data = press_release)
Taking the sample text from the use case presented in the Overview, , the JSON response from the service starts with the following:
{
"mimeType": "text/plain",
"entities": [
{
"sentence": 1,
"text": "Muttski Enterprises acquired Bow-Wow Technologies for $13bn",
"label": "BuyEvent",
"paragraph": 1,
"offset": 0,
"normalizedForm": "",
"id": 1,
"labelPath": "BuyEvent"
},
{
"parent": 1,
"sentence": 1,
"text": "Muttski Enterprises",
"label": "OrganizationA",
"paragraph": 1,
"offset": 0,
"normalizedForm": "",
"id": 2,
"labelPath": "OrganizationA"
},
The last few lines of the response are:
{
"parent": 13,
"sentence": 2,
"text": "Operations",
"label": "Organization",
"paragraph": 1,
"offset": 107,
"normalizedForm": "",
"id": 18,
"labelPath": "Organization"
}
],
"textSize": 119,
"language": "en"
}
Each entity extracted from the input text appears in the response's entities array in order of appearance. Enterprise entities have a hierarchical (parent←→child) relationship. This diagram illustrates how the entities in this tutorial's example text are related to one another:
Muttski Enterprises acquired Bow-Wow Technologies for $13bn)-->orgA(OrganizationA
Muttski Enterprises); root1-->action1(Action
acquired); root1-->orgB(OrganizationA
Bow-Wow Technologies); root1-->price(Price
$13bn); root2(HireEvent
Buddy Coddo was promoted to Vice President of Operations)-->person(Person
Buddy Coddo); root2-->action2(Action
promoted); root2-->position(Position
Vice President); root2-->organization(Organization
Operations);
classDef blackBox fill::#000000; class root1,root2,orgA,orgB,price,action1,action2,person,position,organization blackBox;
Some entities have a ninth attribute: parent, which associates child entities with the parents they comprise.
You care only about acquisition events in which a price is mentioned, such as those whose label attribute is
BuyEvent
with a child whose label is Price
. See the Details section for a link to a list of all Enterprise entity types. You store only the OrganizationA, OrganizationB, and Price child entities of BuyEvents
.Thus, given the example text, you store three of the
BuyEvent
entity's children, but you do not store the HireEvent
entity nor anything under it.The method
store_acquisition
is defined in your application; its code is not included here.# Process the result
if response.status_code == 200:
# De-serialize the JSON reply and get the entities list.
response_dict = json.loads(response.text)
# If the service returns no entities, it is not an error, it just means no
# enterprise entities are contained in the text.
# Thus, the 2nd parameter of the get() call is left out.
entities = response_dict.get('entities')
found_one_buyevent = None # found at least 1 action in the response
for e in entities:
entity_type = e.get('label')
# At a BuyEvent? They always precede their children in the
# entity array.
if (entity_type == 'BuyEvent'):
# If already encountered another BuyEvent in the array,
# now's the time to save it.
if (found_one_buyevent and curr_price != ''):
store_acquisition(curr_org_a, curr_org_b, curr_price)
found_one_buyevent = True
curr_buyevent_id = e.get('id')
curr_price = ''
elif (found_one_buyevent and e.get('parent') == curr_buyevent_id):
# This is a child entity of the current BuyEvent. Save it if it's
# either OrganizationA, OrganizationB, or Price.
if (e.get('label') == 'OrganizationA'):
curr_org_a = e.get('text')
elif (e.get('label') == 'OrganizationB'):
curr_org_b = e.get('text')
elif (e.get('label') == 'Price'):
curr_price = e.get('text')
# End of entities array. If found any, the last one still needs to be
# stored.
if (found_one_buyevent and curr_price != ''):
store_acquisition(curr_org_a, curr_org_b, curr_price)
else:
print 'Error', response.status_code
print response.text
If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.
Use our private help channel. Receive updates over email and contact our specialists directly.
If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.