Language Identification

Overview

The Language Identification service of the Text Analysis Linguistic Analysis package is part of a series of enterprise-grade natural-language products for gaining new insights from unstructured text sources. This service identifies the language of text in file types such as a document, an e-mail, or a web page. For example, the system can identify language as shown: “Les premiers taxis sans chauffeur ont commencé à circuler jeudi 25 août, à Singapour.” → French.

Use Cases

As an example, assume you have a collection of news articles submitted from readers from around the world. The articles are in various formats, such as PDF, web pages, and Microsoft Word. After a user indicates the languages they can read, you want to show them articles written only in those languages. When building your database of articles, you pass each one to this service and tag it with the language this service returns.

If your collection contains articles in either English or French but no other languages, you can restrict this service to return only English or French by specifying those two languages in the languageCodes parameter.

In another application, you create a search index for those news articles so that users can search for articles containing specific terms. You first invoke this service to determine the article's language and then call the Linguistic Analysis service to produce the individual tokens you place in your search index. You set the Linguistic Analysis service's languageCodes parameter to the value returned from this service.

This scenario illustrates how you might use two Text Analysis services in sequence. However, calling Language Identification prior to Linguistic Analysis is unnecessary because the Linguistic Analysis service and all other text analysis services perform language identification internally when you do not set its languageCodes parameter. If you do not know the language of the text you are passing to a text analysis service, skip calling the Language Identification service and call just the service you desire without setting its languageCodes parameter.

This service can identify 33 languages: Arabic, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Farsi, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian (Bokmål), Norwegian (Nynorsk), Polish, Portuguese, Romanian, Russian, Serbian (Cyrillic), Serbian (Latin), Slovak, Slovenian, Spanish, Swedish, Thai, and Turkish.

The service accepts input in a wide variety of formats:

Abobe PDF
Generic email messages (.eml)
HTML
Microsoft Excel
Microsoft Outlook email messages (.msg)
Microsoft PowerPoint
Microsoft Word
Open Document Presentation
Open Document Spreadsheet
Open Document Text
Plain Text
Rich Text Format (RTF)
WordPerfect
XML

The size of each input file is limited to 1 MB.

The tenant parameter is not required because the service is stateless and no data is persisted.

API Reference

/

post

Given a list of languages, determine the one in which a document is most likely written. Returns the language of the input, its MIME type, and its length. Accuracy of the result is dependent upon the amount of input text supplied. If less than 30 characters are supplied, the language code that is first in languageCodes is returned, or en if languageCodes is missing.

Request
Response

Headers

Authorization: required (string)
Used to send a valid OAuth2 access token.
Example:
```
Bearer access_token
```

Query Parameters

languageCodes: (string - default: All supported language codes.)
Comma-separated list containing 2-letter language codes. Include all languages in which input might possibly be written.
Supported language codes:
- ar - Arabic
- ca - Catalan
- cs - Czech
- da - Danish
- de - German
- en - English
- es - Spanish
- fa - Persian
- fr - French
- hr - Croatian
- hu - Hungarian
- id - Indonesian
- it - Italian
- iw - Hebrew
- ja - Japanese
- ko - Korean
- nb - Bokmal
- nl - Dutch
- nn - Nynorsk
- pl - Polish
- pt - Portuguese
- ro - Romanian
- ru - Russian
- sh - Serbo-Croatian
- sk - Slovak
- sl - Slovenian
- sr - Serbian
- sv - Swedish
- th - Thai
- tr - Turkish
- zh - Simplified Chinese
- zf - Traditional Chinese
Example:
```
de,en,tr,es,fr,ja,zh,zf
```

Body

Type: application/json

Example:

{
  "text":"Dieser Beispieltext ist auf Deutsch."
}

Type: application/octet-stream

Example:

file-as-binary-stream

HTTP status code 200

Body

Type: application/json

Schema:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "id": "http://www.sap.com/schemas/json/taaas",
    "title": "Text Analysis as a Service (TAaaS) results",
    "definitions": {
        "output_token": {
            "type": "object",
            "properties": {
                "normalizedToken": {
                    "description": "A normalized representation of the token. Normalization includes converting words to lower case, converting umlauts (ä to ae, for example), and removing diacritics. This value is empty when the partOfSpeech property is \"punctuation\".",
                    "type": "string"
                },
                "offset": {
                    "description": "The offset in characters relative to the beginning of the document. If the document's MIME type is other than text/plain, offset is relative to the document after text analysis converted it to plain text.",
                    "type": "number",
                    "minimum": 0
                },
                   "paragraph": {
                    "description": "The relative paragraph number containing the token (indicates that the nth paragraph contains this token).",
                    "type": "number",
                    "minimum": 1
                },
                "partOfSpeech": {
                    "description": "",
                    "type": "string"
                },
                "sentence": {
                    "description": "The relative sentence number containing the token (indicates that the nth sentence contains this token).",
                    "type": "number",
                    "minimum": 1
                },
                "stems": {
                    "description": "The token's base form(s); i.e., the forms referenced in a dictionary. For example, the singular nominative for nouns or the infinitive for verbs. This property is empty unless the token has a stem that differs from the token.",
                    "type": "array",
                    "minItems": 0,
                    "items": [
                        {
                            "type": "string"
                        }
                    ]
                },
                "token": {
                    "description": "The original, non-normalized form of the word as it appeared in the input.",
                    "type": "string"
                }
            }
        },
        "entity": {
            "type": "object",
            "properties": {
                "id": {
                    "description": "The ordinal position of this entity among other entities found in the input.",
                    "type": "number",
                    "minimum": 1
                },
                "label": {
                    "description": "The linguistic or semantic type of the entity, for instance \"PERSON\" or \"StrongPositiveSentiment\".",
                    "type": "string"
                },
                "labelPath": {
                    "description": "Identical to the label property unless the type is hierarchical; e.g., \"SOCIAL_MEDIA/TOPIC_TWITTER\". In this example, the label property would be TOPIC_TWITTER.",
                    "type": "string"
                },
                "normalizedForm": {
                    "description": "A normalized representation of the entity. For more information, see the description of normalizedToken in this schema.",
                    "type": "string",
                    "minLength": 0
                },
                "offset": {
                    "description": "The offset in characters relative to the beginning of the document. If the document's MIME type is other than text/plain, offset is relative to the document after text analysis converts it to plain text.",
                    "type": "number",
                    "minimum": 0
                },
                "paragraph": {
                    "description": "The relative paragraph number containing the entity (indicates that the nth paragraph contains this entity).",
                    "type": "number",
                    "minimum": 1
                },
                "parent": {
                    "description": "The value of the parent entity's \"id\" property. This property is not included if the token has no parent. Used to indicate that there is a lingustic relationship between two entities.  For example, it is used by sentimentAnalysis to relate topics to their enclosing sentiments.",
                    "type": "number",
                    "minimum": 1
                },
                "sentence": {
                    "description": "The relative sentence number containing the entity (indicates that the nth sentence contains this entity).",
                    "type": "number",
                    "minimum": 1
                },
                "text": {
                    "description": "The original, non-normalized form of the entity as it appeared in the input.",
                    "type": "string"
                }
            }
        }
    },
    "type": "object",
    "properties": {
        "language": {
            "description": "2-letter code indicating the primary language of the input text.",
            "type": "string",
            "minLength": 2,
            "maxLength": 2
        },
        "mimeType": {
            "description": "MIME type of the input.",
            "type": "string"
        },
        "textSize": {
            "description": "Number of characters in the input (after conversion to plain text if mimeType other than text/plain).",
            "type": "number"
        },
        "tokens": {
            "description": "",
            "type": "array",
            "items": [
                {
                    "$ref": "#/definitions/output_token"
                }
            ]
        },
        "entities": {
            "description": "Entities extracted from the input when entityExtraction or a variety of factExtraction invoked, otherwise empty.",
            "type": "array",
            "minItems": 0,
            "items": [
                {
                    "$ref": "#/definitions/entity"
                }
            ]
        }
    },
    "required": ["language","mimeType","textSize"]
}

Example:

{
  "language": "de",
  "mimeType": "text/plain",
  "textSize": 36
}

HTTP status code 400

Request syntactically incorrect. Any details will be provided within the response payload.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 400,
  "message": "There are validation problems, see details section for more information",
  "moreInfo": "https://api.yaas.io/patterns/errortypes.html",
  "type": "validation_violation",
  "details": [
    {
      "field": "hybris-tenant",
      "message": "size must be between 1 and 36",
      "type": "invalid_header"
    }
  ]
}

HTTP status code 401

Given request is unauthorized. Bad or expired token. Reauthenticate the user. Any details will be provided within the response payload.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status" : 401,
  "message" : "Authorization: Unauthorized. Bearer TOKEN is invalid",
  "type" : "insufficient_credentials",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

HTTP status code 403

Given authorization scopes are not sufficient and do not match required scopes.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 403,
  "message": "Given request does not have required scopes. It is not authorized to perform this operation.",
  "type": "insufficient_permissions"
}

HTTP status code 500

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status" : 500,
  "message" : "Invalid server settings. Please contact administrator.",
  "type" : "internal_service_error",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

HTTP status code 503

Occasionally, this error occurs during processing. Please consider implementing a retry mechanism in your client application for stable processing.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 503,
  "message": "A temporary service unavailability was detected. Refer to the error details response for a re-attempt strategy.",
  "type": "service_temporarily_unavailable",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

HTTP status code 504

This error occurs if text analysis takes longer than 20 seconds. There could be two reasons for this error:

Processing takes longer because the current service load is high. Please try again later. You could also consider implementing a retry mechanism in your client application for more stable processing.
The text to be processed is too big or too complex. Sometimes, even small texts take a long time to process. Please split your text in smaller chunks and send it separately to the service.

Body

Type: application/json

Schema:

{
  "$schema":"http://json-schema.org/draft-04/schema#",
  "title":"error",
  "description":"Schema for API specified errors.",
  "type":"object",
  "properties":
  {
    "status":
    {
      "type":"integer",
      "description":"original HTTP error code, should be consistent with the response HTTP code",
      "minimum":100,
      "maximum":599
    },
    "type":
    {
      "type":"string",
      "description":"classification of the error type, lower case with underscore eg validation_failure",
      "pattern":"[a-z]+[a-z_]*[a-z]+"
    },
    "message":
    {
      "type":"string",
      "description":"descriptive error message for debugging"
    },
    "moreInfo":
    {
      "type":"string",
      "format":"uri",
      "description":"link to documentation to investigate further and finding support"
    },
    "details":
    {
      "type":"array",
      "description":"list of problems causing this error",
      "items":
      {
        "$schema":"http://json-schema.org/draft-04/schema#",
        "title":"errorDetail",
        "description":"schema for specific error cause",
        "type":"object",
        "properties":
        {
          "field":
          {
            "type":"string",
            "description":"a bean notation expression specifying the element in request data causing the error, eg product.variants[3].name, this can be empty if violation was not field specific"
          },
          "type":
          {
            "type":"string",
            "description":"classification of the error detail type, lower case with underscore eg missing_value, this value must be always interpreted in context of the general error type.",
            "pattern":"[a-z]+[a-z_]*[a-z]+"
          },
          "message":
          {
            "type":"string",
            "description":"descriptive error detail message for debugging"
          },
          "moreInfo":
          {
            "type":"string",
            "format":"uri",
            "description":"link to documentation to investigate further and finding support for error detail"
          }
        },
        "required":["type"]
      }
    }
  },
  "required":["status" , "type" ]
}

Example:

{
  "status": 504,
  "message": "Service is not reachable: Upstream service connection timeout.",
  "type": "service_temporarily_unavailable",
  "moreInfo" : "https://api.yaas.io/patterns/errortypes.html"
}

Minimum Text Length

The Language Identification service requires at least 30 characters of text to make a reasonably accurate guess of the language. If the input contains fewer than 30 characters, the service does not attempt to make a guess and instead returns the default language.

Default Language

The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.

Setting a subset of languages

You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.

Internal Algorithm

The service includes a model for each of the languages it can identify. A model contains data about a language and an algorithm for analyzing text to compute a score, or confidence measure, that answers the question, "Do I have confidence that this language could really be the language of the document?" The language with the best confidence wins and is returned. When confidence in language identification is low, the service returns the default language instead of the identified language.

The algorithm selects a single language, even if the document contains multiple languages. Naturally, it favors the language in which the majority of the document's words are written.

Meaning of the textSize value

The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.

Python Tutorials

Simple example with static data and default options

In this simple Python tutorial, you'll pass some static text to the service and print out the JSON response.

Get an access token

To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.

import requests
import json

# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'

s = requests.Session()

# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']

Identify the language

The POST request includes a single item: the text for language identification. You could load the contents of a binary file, web page, text stream, or other resource into the request. In this example, you use static text that you typed into your Python source code.

# The Language Identification service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-language/v1/'

# The example text whose language you want to determine. 
service_text = 'Guten Morgen, Herr Veezner. Wie geht es? Haben Sie etwas zu verzollen?'

# HTTP request headers
req_headers = {}

# Set content-type to 'application/json' to pass a plain text document to the service. Specify the plain text's encoding as UTF-8.
req_headers['content-type'] = 'application/json; charset=utf-8'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)

# Make the REST call to the Language Identification service
response = s.post(url = service_url,  headers = req_headers, data = json.dumps({'text':service_text}))

Finally, display the returned JSON response on your console, which, for this tutorial is:

{
    "mimeType": "text/plain",
    "textSize": 70, 
    "language": "de"
}

where "de" is the ISO 639-1 code for German.

# Print result
if response.status_code == 200:
    print json.dumps(json.loads(response.text), indent = 4)
else:
    print 'Error', response.status_code
    print response.text

Specify a list of languages in a binary document

In this example, you know that the text must be in one of three languages: English, French, or Spanish. You have a Microsoft Word file named French.docx, containing the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça."

This example does not show the required steps to obtain an access token. Please see the first example in this tutorial for those instructions.

Read the contents of the Word file

filename = "French.docx" 
f = open('French.docx', 'rb')
service_binarydata = f.read()

Identify the language as either English, Spanish, or French

This example's POST request includes the languageCodes parameter to force the service to select from three languages.

# Append "?languageCodes=en%2Ces%2Cfr" to the Language Identification service URL to restrict identification to Spanish ('es'), English ('en'), or French ('fr'). Separate each language with a comma, represented as '%2C' within URLs.
service_url = 'https://api.beta.yaas.io/sap/ta-language/v1?languageCodes=es%2Cen%2Cfr'

# HTTP request headers
req_headers = {}
# The example uses a different content-type value in this tutorial because it passes binary data.
req_headers['content-type'] = 'application/octet-stream'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)

# Make the REST call to the Language Identification service. Pass the raw binary data. Do not base64-encode the data.
response = s.post(url = service_url,  headers = req_headers, data = service_binarydata)

Print the language on the console

Like the first example in this tutorial, the JSON response includes three attributes: the MIME type of the input, the length of the document's text in characters, and the two-letter ISO 639-1 code of the text's language as determined by the service. The JSON returned in this example is:

{
    "mimeType": "application/msword", 
    "textSize": 64, 
    "language": "fr"
}

The binary data is correctly identified as Microsoft Word and the language of the text is correctly identified as French. Retrieve the language attribute's value and convert it to its full name.

The value returned for textSize is 64 while the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." contains 59 characters. The apparent discrepancy is accounted for by the conversion of Microsoft Word files to a plain-text equivalent within the Text Analysis technology. The accented characters 'é' (which appears twice) and 'ç' are internally represented by multi-byte characters in the service.

# Translate 2-letter language abbreviations to full English name
language_names = {
 'en' : 'English',
 'ar' : 'Arabic',
 'ca' : 'Catalan',
 'cs' : 'Czech',
 'da' : 'Danish',
 'de' : 'German',
 'es' : 'Spanish',
 'fa' : 'Persian',
 'fr' : 'French',
 'hr' : 'Croatian',
 'hu' : 'Hungarian',
 'id' : 'Indonesian',
 'it' : 'Italian',
 'iw' : 'Hebrew',
 'ja' : 'Japanese',
 'ko' : 'Korean',
 'nb' : 'Bokmål',
 'nl' : 'Dutch',
 'nn' : 'Nynorsk',
 'pl' : 'Polish',
 'pt' : 'Portuguese',
 'ro' : 'Romanian',
 'ru' : 'Russian',
 'sh' : 'Serbo-Croatian',
 'sk' : 'Slovak',
 'sl' : 'Slovenian',
 'sr' : 'Serbian',
 'sv' : 'Swedish',
 'th' : 'Thai',
 'tr' : 'Turkish',
 'zh' : 'Simplified Chinese',
 'zf' : 'Traditional Chinese'
}

if response.status_code == 200:
    # De-serialize the JSON reply, get the language attribute, and print its full name.
    response_dict = json.loads(response.text)
    response_language_code = response_dict.get('language','Oops - return is missing the language attribute.')
    response_language_name = language_names.get(response_language_code,'Oops - no name for returned language code \"' + response_language_code + '\".')
    print('The language of the document was identified as \"' + response_language_code + '\" (' + response_language_name + ').')
else:
    print 'Error', response.status_code
    print response.text

Send feedback
If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.
Get Help
Use our private help channel. Receive updates over email and contact our specialists directly.
hybris Experts
If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.

Overview

Use Cases

API Reference

/

/

post /

Headers

Query Parameters

Body

HTTP status code 200

Body

HTTP status code 400

Body

HTTP status code 401

Body

HTTP status code 403

Body

HTTP status code 500

Body

HTTP status code 503

Body

HTTP status code 504

Body

Minimum Text Length

Default Language

Setting a subset of languages

Internal Algorithm

Meaning of the textSize value

Python Tutorials

Simple example with static data and default options

Get an access token

Identify the language

Specify a list of languages in a binary document

Identify the language as either English, Spanish, or French

Print the language on the console