Linguistic Analysis

Overview

The Linguistic Analysis service of the Text Analysis Linguistic Analysis package is part of a series of enterprise-grade, natural-language products for gaining new insights from unstructured text sources. This service accepts documents, social media posts, email messages, search queries, and other forms of text, returning individual words and their parts of speech, stems, sentence and paragraph numbers, and other information that applications such as search engines and natural language computation can use.

Some of the capabilities the service provides are:

Tokenization -- separating words and punctuation and discarding spaces: “card-based payment systems.” → “card” “based” “payment” “systems” "."
Stemming -- returning the root or roots of a word: “ran”,"running", "runs" → “run”; “Häuser” → “Haus” (Stems are the root or roots of a word. For example, "run" is the stem of "run", "running", "ran", and "runs")
Part-of-speech tagging -- indicating the sense or function of a word given its context: “quickly” = adjective; “Jeffrey” = proper noun
Noun group identification -- find clusters of words around a NOUN used to give more information about a person or a thing: "The mobile operating system is geared to process text data." → “mobile operating system”, “text data”
Language identification -- determine the dominant language used in a document: "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." → "fr"

Use cases

This section describes a few potential uses of the Linguistic Analysis service.

Text search

Assume you are creating a search index for a set of documents in a variety of formats, such as plain text, PDF, Microsoft Word or Excel, in English, German, French, Italian, Portuguese, and Spanish. You apply the common practice of storing the stems of the words that appear in documents.

Later, when a user enters a search query, you search for the stems of the words in the query. Storing and searching for stems finds documents even if the search query contains different forms of a word than appear in the document. For example, if the document contains the sentence "In 2016, Bernie Sanders ran for the presidency." and the search query is "Sanders' run for presidency", the document still matches the query.

In the response from this service, the token attribute contains the word as it originally appeared in the document and stems contains the word's stem(s). However, if the original word is a stem, stems is empty and normalizedToken contains the stem.

You pass your documents and user queries to the Linguistic Analysis service and use the value of the stems or normalizedToken attributes of the response, ignoring the token attribute.

Because this service can process plain text, PDF, Word, Excel, and other formats, you can pass all documents for indexing directly to the service, using the content-type of application/octet-stream in the request header, without converting them to text first. User queries are in plain text, so you pass them with a content-type ofapplication/json.

Because this service supports all six of the languages in your collection, you can pass all documents and queries to the service and know it will return the correct stems.

Experimental information retrieval algorithms Using HANA

You have a collection of documents upon which you want to experiment with some interesting information retrieval algorithms. All of the algorithms require this information for every word in the document collection:

original form as it appears in the document
its stem(s)
its position, to measure how far apart words are from each other
part of speech
sentence number
paragraph number

The collection might contain documents written in languages other than English, but you want to experiment only with those in English.

To experiment with your algorithms, you create a HANA database table containing the words in the document collection.

You open each document file, pass it to the Linguistic Analysis service, and check the language code to make sure it is "en", for English. If it is, you save each token, stems, normalizedToken, offset, partOfSpeech, sentence, and paragraph returned by the service, along with the document's filename, in your HANA database. When your database contains a sufficient number of documents, you begin your experimentation while continuing to grow your database in the background.

This service can identify 33 languages: Arabic, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Farsi, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian (Bokmål), Norwegian (Nynorsk), Polish, Portuguese, Romanian, Russian, Serbian (Cyrillic), Serbian (Latin), Slovak, Slovenian, Spanish, Swedish, Thai, and Turkish.

The service accepts input in a wide variety of formats:

Abobe PDF
Generic email messages (.eml)
HTML
Microsoft Excel
Microsoft Outlook email messages (.msg)
Microsoft PowerPoint
Microsoft Word
Open Document Presentation
Open Document Spreadsheet
Open Document Text
Plain Text
Rich Text Format (RTF)
WordPerfect
XML

The size of each input file is limited to 1 MB.

The tenant parameter is not required because the service is stateless and no data is persisted.

API Reference

/

post

Tokenize input and return the normalized form, part of speech, and stem of each token in addition to positional information. An in-depth description of linguistic analysis is available in the SAP HANA Text Analysis Language Reference Guide.

The partOfSpeech property can be any of the following values:

abbreviation
adjective
adverb
auxiliary verb
conjunction
determiner
interjection
noun
number
particle
preposition
pronoun
proper name
punctuation
verb
unknown

These parts of speech are simplified from finer grained values used internally within text analysis, as explained in the Note at the end of Structure of the $TA Table in the SAP HANA Text Analysis Developer Guide. The internal parts of speech of each language can be found in the Language-Specific Part-of-Speech Tagging Examples section of the SAP HANA Text Analysis Language Reference Guide.

The partOfSpeech property can be any of the following values:

abbreviation
adjective
adverb
auxiliary verb
conjunction
determiner
interjection
noun
number
particle
preposition
pronoun
proper name
punctuation
verb
unknown

Request
Response

Headers

Authorization: required (string)
Used to send a valid OAuth2 access token.
Example:
```
Bearer access_token
```

Query Parameters

languageCodes: (string - default: All supported language codes.)
Comma-separated list containing 2-letter language codes. Include all languages in which input might possibly be written.
Supported language codes:
- ar - Arabic
- ca - Catalan
- cs - Czech
- da - Danish
- de - German
- en - English
- es - Spanish
- fa - Persian
- fr - French
- hr - Croatian
- hu - Hungarian
- id - Indonesian
- it - Italian
- iw - Hebrew
- ja - Japanese
- ko - Korean
- nb - Bokmal
- nl - Dutch
- nn - Nynorsk
- pl - Polish
- pt - Portuguese
- ro - Romanian
- ru - Russian
- sh - Serbo-Croatian
- sk - Slovak
- sl - Slovenian
- sr - Serbian
- sv - Swedish
- th - Thai
- tr - Turkish
- zh - Simplified Chinese
- zf - Traditional Chinese
Example:
```
en
```

Body

Type: application/json

Example:

{
  "text": "The façade of the café is fake."
}

Type: application/octet-stream

Example:

file-as-binary-stream

HTTP status code 200

Body

Type: application/json

Schema:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "id": "http://www.sap.com/schemas/json/taaas",
    "title": "Text Analysis as a Service (TAaaS) results",
    "definitions": {
        "output_token": {
            "type": "object",
            "properties": {
                "normalizedToken": {
                    "description": "A normalized representation of the token. Normalization includes converting words to lower case, converting umlauts (ä to ae, for example), and removing diacritics. This value is empty when the partOfSpeech property is \"punctuation\".",
                    "type": "string"
                },
                "offset": {
                    "description": "The offset in characters relative to the beginning of the document. If the document's MIME type is other than text/plain, offset is relative to the document after text analysis converted it to plain text.",
                    "type": "number",
                    "minimum": 0
                },
                   "paragraph": {
                    "description": "The relative paragraph number containing the token (indicates that the nth paragraph contains this token).",
                    "type": "number",
                    "minimum": 1
                },
                "partOfSpeech": {
                    "description": "",
                    "type": "string"
                },
                "sentence": {
                    "description": "The relative sentence number containing the token (indicates that the nth sentence contains this token).",
                    "type": "number",
                    "minimum": 1
                },
                "stems": {
                    "description": "The token's base form(s); i.e., the forms referenced in a dictionary. For example, the singular nominative for nouns or the infinitive for verbs. This property is empty unless the token has a stem that differs from the token.",
                    "type": "array",
                    "minItems": 0,
                    "items": [
                        {
                            "type": "string"
                        }
                    ]
                },
                "token": {
                    "description": "The original, non-normalized form of the word as it appeared in the input.",
                    "type": "string"
                }
            }
        },
        "entity": {
            "type": "object",
            "properties": {
                "id": {
                    "description": "The ordinal position of this entity among other entities found in the input.",
                    "type": "number",
                    "minimum": 1
                },
                "label": {
                    "description": "The linguistic or semantic type of the entity, for instance \"PERSON\" or \"StrongPositiveSentiment\".",
                    "type": "string"
                },
                "labelPath": {
                    "description": "Identical to the label property unless the type is hierarchical; e.g., \"SOCIAL_MEDIA/TOPIC_TWITTER\". In this example, the label property would be TOPIC_TWITTER.",
                    "type": "string"
                },
                "normalizedForm": {
                    "description": "A normalized representation of the entity. For more information, see the description of normalizedToken in this schema.",
                    "type": "string",
                    "minLength": 0
                },
                "offset": {
                    "description": "The offset in characters relative to the beginning of the document. If the document's MIME type is other than text/plain, offset is relative to the document after text analysis converts it to plain text.",
                    "type": "number",
                    "minimum": 0
                },
                "paragraph": {
                    "description": "The relative paragraph number containing the entity (indicates that the nth paragraph contains this entity).",
                    "type": "number",
                    "minimum": 1
                },
                "parent": {
                    "description": "The value of the parent entity's \"id\" property. This property is not included if the token has no parent. Used to indicate that there is a lingustic relationship between two entities.  For example, it is used by sentimentAnalysis to relate topics to their enclosing sentiments.",
                    "type": "number",
                    "minimum": 1
                },
                "sentence": {
                    "description": "The relative sentence number containing the entity (indicates that the nth sentence contains this entity).",
                    "type": "number",
                    "minimum": 1
                },
                "text": {
                    "description": "The original, non-normalized form of the entity as it appeared in the input.",
                    "type": "string"
                }
            }
        }
    },
    "type": "object",
    "properties": {
        "language": {
            "description": "2-letter code indicating the primary language of the input text.",
            "type": "string",
            "minLength": 2,
            "maxLength": 2
        },
        "mimeType": {
            "description": "MIME type of the input.",
            "type": "string"
        },
        "textSize": {
            "description": "Number of characters in the input (after conversion to plain text if mimeType other than text/plain).",
            "type": "number"
        },
        "tokens": {
            "description": "",
            "type": "array",
            "items": [
                {
                    "$ref": "#/definitions/output_token"
                }
            ]
        },
        "entities": {
            "description": "Entities extracted from the input when entityExtraction or a variety of factExtraction invoked, otherwise empty.",
            "type": "array",
            "minItems": 0,
            "items": [
                {
                    "$ref": "#/definitions/entity"
                }
            ]
        }
    },
    "required": ["language","mimeType","textSize"]
}

Example:

{
  "language": "en",
  "mimeType": "text/plain",
  "textSize": 33,
  "tokens": [
    {
      "normalizedToken": "the",
      "offset": 0,
      "paragraph": 1,
      "partOfSpeech": "determiner",
      "sentence": 1,
      "stems": [],
      "token": "The"
    },
    {
      "normalizedToken": "facade",
      "offset": 4,
      "paragraph": 1,
      "partOfSpeech": "noun",
      "sentence": 1,
      "stems": [],
      "token": "fa\u00e7ade"
    },
    {
      "normalizedToken": "of",
      "offset": 11,
      "paragraph": 1,
      "partOfSpeech": "preposition",
      "sentence": 1,
      "stems": [],
      "token": "of"
    },
    {
      "normalizedToken": "the",
      "offset": 14,
      "paragraph": 1,
      "partOfSpeech": "determiner",
      "sentence": 1,
      "stems": [],
      "token": "the"
    },
    {
      "normalizedToken": "cafe",
      "offset": 18,
      "paragraph": 1,
      "partOfSpeech": "noun",
      "sentence": 1,
      "stems": [],
      "token": "caf\u00e9"
    },
    {
      "normalizedToken": "is",
      "offset": 23,
      "paragraph": 1,
      "partOfSpeech": "verb",
      "sentence": 1,
      "stems": [
        "be"
      ],
      "token": "is"
    },
    {
      "normalizedToken": "fake",
      "offset": 26,
      "paragraph": 1,
      "partOfSpeech": "adjective",
      "sentence": 1,
      "stems": [],
      "token": "fake"
    },
    {
      "normalizedToken": ".",
      "offset": 30,
      "paragraph": 1,
      "partOfSpeech": "punctuation",
      "sentence": 1,
      "stems": [],
      "token": "."
    }
  ]
}

HTTP status code 400

Request syntactically incorrect. Any details will be provided within the response payload.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 400,
  "message": "There are validation problems, see details section for more information",
  "moreInfo": "https://api.yaas.io/patterns/errortypes.html",
  "type": "validation_violation",
  "details": [
    {
      "field": "hybris-tenant",
      "message": "size must be between 1 and 36",
      "type": "invalid_header"
    }
  ]
}

HTTP status code 401

Given request is unauthorized. Bad or expired token. Reauthenticate the user. Any details will be provided within the response payload.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status" : 401,
  "message" : "Authorization: Unauthorized. Bearer TOKEN is invalid",
  "type" : "insufficient_credentials",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

HTTP status code 403

Given authorization scopes are not sufficient and do not match required scopes.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 403,
  "message": "Given request does not have required scopes. It is not authorized to perform this operation.",
  "type": "insufficient_permissions"
}

HTTP status code 500

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status" : 500,
  "message" : "Invalid server settings. Please contact administrator.",
  "type" : "internal_service_error",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

HTTP status code 503

Occasionally, this error occurs during processing. Please consider implementing a retry mechanism in your client application for stable processing.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 503,
  "message": "A temporary service unavailability was detected. Refer to the error details response for a re-attempt strategy.",
  "type": "service_temporarily_unavailable",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

HTTP status code 504

This error occurs if text analysis takes longer than 20 seconds. There could be two reasons for this error:

Processing takes longer because the current service load is high. Please try again later. You could also consider implementing a retry mechanism in your client application for more stable processing.
The text to be processed is too big or too complex. Sometimes, even small texts take a long time to process. Please split your text in smaller chunks and send it separately to the service.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 504,
  "message": "Service is not reachable: Upstream service connection timeout.",
  "type": "service_temporarily_unavailable",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

The stems and normalizedToken members

The stems member is an array because a word can have more than one possible stem. For example, the word "driving" has the stem "drive" in the verb sense, as in "driving a car." However, it is a stem as an adjective, as in "driving rain." If the word appears in the context of a sentence, the service can usually determine which sense is in use. If there is no context or it is ambiguous, the service returns all possible stems.

When a word is used in an input document in its stem form, the stems member is empty and instead the stem appears in the normalizedToken member's value.

Default Language

The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.

Setting a subset of languages

You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.

Meaning of the textSize value

The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.

Annotated JSON schema

The JSON schema contains the descriptions of the objects and members of the JSON response that the service returns. To read the schema, click the POST link in the API Reference then switch to the RESPONSE tab.

Further references

You can find extensive details on the capabilities and behavior of SAP's linguistic analysis technology in the Linguistic Analysis chapter of the SAP HANA Text Analysis Language Reference Guide.

Python Tutorials

Simple example with static data and default options

In this Python tutorial, you have some static text and you want to determine the part of speech for each word.

Get an access token

To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.

import requests
import json

# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'

s = requests.Session()

# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']

Call the service

The POST request body includes a single value: the text upon which to perform linguistic analysis. You could load the contents of a binary file, web page, text stream, or other resource into the request. In this example, you use static text that you typed into your Python source code.

# The Linguistic Analysis service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-linguistics/v1/'

# The example text whose parts of speech you want to see. 
service_text = 'Guten Morgen, Herr Veezner. Wie geht es? Haben Sie etwas zu verzollen?'

# HTTP request headers
req_headers = {}

# Set content-type to 'application/json' to pass plain text to the service. Specify the text's encoding as UTF-8.
req_headers['content-type'] = 'application/json; charset=utf-8'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)

# Make the REST call of the Language Identification service
response = s.post(url = service_url,  headers = req_headers, data = json.dumps({'text':service_text}))

Display the returned JSON response on the console. The first 25 lines of output from this tutorial should be:

{ 
    "mimeType": "text/plain",  
    "tokens": [ 
        { 
            "normalizedToken": "guten",  
            "sentence": 1,  
            "stems": [ 
                "gut" 
            ],  
            "token": "Guten",  
            "paragraph": 1,  
            "partOfSpeech": "adjective",  
            "offset": 0 
        },  
        { 
            "normalizedToken": "morgen",  
            "sentence": 1,  
            "stems": [ 
                "Morgen" 
            ],  
            "token": "Morgen",  
            "paragraph": 1,  
            "partOfSpeech": "noun",  
            "offset": 6 
        },

Each token in the input appears in the response, in the "tokens" array, in order of appearance. Every token has seven attributes:

normalizedToken

sentence

stems

token

paragraph

partOfSpeech - This is the attribute you are interested in for this tutorial.

offset

The JSON schema for the response contains a detailed description of each attribute. See the Details section of this service for a link to the schema.

The last few lines of the response are:

        { 
            "normalizedToken": "?",  
            "sentence": 3,  
            "stems": [],  
            "token": "?",  
            "paragraph": 1,  
            "partOfSpeech": "punctuation",  
            "offset": 69 
        } 
    ],  
    "textSize": 70,  
    "language": "de" 
}

The textSize attribute indicates the number of characters (70) in the input text, and the language attribute contains the ISO 639-1 code of the input text. The code for German is "de".

# Print result
if response.status_code == 200:
        # De-serialize the JSON reply and get the tokens list.
        response_dict = json.loads(response.text)
        response_tokens = response_dict.get('tokens','Oops - return is missing the tokens attribute.')
        print 'Input tokens and their parts of speech'
        for t in response_tokens:
                for key, value in t.iteritems():
                        if key == 'token':
                                curr_token = value
                        if key == 'partOfSpeech':
                                curr_POS = value
                print('token: ' + curr_token + '\t\tpart of speech: ' + curr_POS)
else:
    print 'Error', response.status_code
    print response.text

Specify a list of languages in a binary document

In this tutorial, you know that the text must be in one of three languages: English, French, or Spanish. You have a Microsoft Word file, named French.docx, that contains the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." You will print the part of speech of each token.

This example does not show the required steps to obtain an access token. Please see the first example in this tutorial for those instructions.

Read the contents of the Word file

filename = "French.docx" 
f = open('French.docx', 'rb')
service_binarydata = f.read()

Identify the language as either English, Spanish, or French

This tutorial's POST request includes the languageCodes parameter to force the service to select from three languages.

# Append "?languageCodes=en%2Ces%2Cfr" to the Linguistic Analysis URL to restrict identification to Spanish ('es'), English ('en'), or French ('fr'). Separate each language with a comma, represented as '%2C' within URLs.
service_url = 'https://api.beta.yaas.io/sap/ta-linguistics/v1?languageCodes=es%2Cen%2Cfr'

# HTTP request headers
req_headers = {}
# This example uses a different content-type value to pass binary data.
req_headers['content-type'] = 'application/octet-stream'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)

# Make the REST call to the Linguistic Analysis service. Pass the binary data in raw form. Do not base64-encode the data.
response = s.post(url = service_url,  headers = req_headers, data = service_binarydata)

Print each token and its part of speech

As in the first example in this tutorial, the JSON response includes attributes about the document and every token within it:

{ 
    "mimeType": "application/msword", 
    "tokens": [ 
        { 
            "normalizedToken": "je", 
            "sentence": 1, 
            "stems": [], 
            "token": "Je", 
            "paragraph": 1, 
            "partOfSpeech": "pronoun", 
            "offset": 0 
        }, 
        { 
            "normalizedToken": "suis", 
            "sentence": 1, 
            "stems": [ 
                "\u00eatre" 
            ], 
            "token": "suis", 
            "paragraph": 1, 
            "partOfSpeech": "auxiliary verb", 
            "offset": 3 
        }, 
        { 
            "normalizedToken": "desole",

The mimeType attribute shows that the binary data was correctly identified as Microsoft Word.

The last few lines of the service's response are:

        {    
            "normalizedToken": ".", 
            "sentence": 2, 
            "stems": [],  
            "token": ".", 
            "paragraph": 1,   
            "partOfSpeech": "punctuation", 
            "offset": 58 
        }    
    ],   
    "textSize": 64, 
    "language": "fr" 
}

The value returned for textSize is 64, while the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." contains 59 characters. The conversion of Microsoft Word files to a plain-text equivalent within the Text Analysis services accounts for this apparent discrepancy. The accented characters 'é' and 'ç' are internally represented by multi-byte characters in the service.

Rather than print everything in the response, print just the tokens and their parts of speech.

if response.status_code == 200:
    # De-serialize the JSON reply and get the tokens list.
    response_dict = json.loads(response.text)
    response_tokens = response_dict.get('tokens','Oops - return is missing the tokens attribute.')
    print 'Input tokens and their parts of speech'
    for t in response_tokens:
        for key, value in t.iteritems():
            if key == 'token':
                curr_token = value
            if key == 'partOfSpeech':
                curr_POS = value
        print('token:' + curr_token + '\tpart of speech: ' + curr_POS)
else:
    print 'Error', response.status_code
    print response.text

Create a search index

This final example is a small modification of the previous one. You are creating a search index for a set of documents, as described in the first of the Use Cases in this tutorial.
Your example search index application's Python interface contains a search_index object whose store_term method takes a document identifier, a string (the stem), and the string's offset into the document.

if response.status_code == 200:
    # De-serialize the JSON reply and get the tokens list.
    response_dict = json.loads(response.text)
    response_tokens = response_dict.get('tokens','Oops - return is missing the tokens attribute.')
    for t in response_tokens:
        for key, value in t.iteritems():
            if key == 'stems':
                curr_stems = value
            if key == 'partOfSpeech':
                curr_POS = value
            if key == 'normalizedToken':
                curr_normalizedToken = value
            if key == 'offset':
                curr_offset = value
        if curr_POS != 'punctuation': # don't store punctuation in the index
            # The stem's attribute is empty if the stem form is used in the document.
            # In that case, store the normalizedToken, which is the lowercase form 
            # if the word was capitalized only because it began a sentence, or is the
            # unaccented form if the word contained accented characters.
            if not curr_stems: 
                search_index.store_term(filename,curr_normalizedToken,curr_offset)
            else:
                # A word can have more than one stem. For example, the stem for "driving"
                # in "the driving rain" is "driving", but in "I was driving"
                # the stem is "drive", so use a for loop.
                for s in curr_stems:
                    search_index.store_term(filename,s,curr_offset)
else:
    print 'Error', response.status_code
    print response.text

Send feedback
If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.
Get Help
Use our private help channel. Receive updates over email and contact our specialists directly.
hybris Experts
If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.

Overview

Use cases

Text search

Experimental information retrieval algorithms Using HANA

API Reference

/

/

post /

Headers

Query Parameters

Body

HTTP status code 200

Body

HTTP status code 400

Body

HTTP status code 401

Body

HTTP status code 403

Body

HTTP status code 500

Body

HTTP status code 503

Body

HTTP status code 504

Body

The stems and normalizedToken members

Default Language

Setting a subset of languages

Meaning of the textSize value

Annotated JSON schema

Further references

Python Tutorials

Simple example with static data and default options

Get an access token

Call the service

Specify a list of languages in a binary document

Read the contents of the Word file

Identify the language as either English, Spanish, or French

Print each token and its part of speech

Create a search index