Overview
The Language Identification service of the Text Analysis Linguistic Analysis package is part of a series of enterprise-grade natural-language products for gaining new insights from unstructured text sources. This service identifies the language of text in file types such as a document, an e-mail, or a web page. For example, the system can identify language as shown: “Les premiers taxis sans chauffeur ont commencé à circuler jeudi 25 août, à Singapour.” → French.
Use Cases
As an example, assume you have a collection of news articles submitted from readers from around the world. The articles are in various formats, such as PDF, web pages, and Microsoft Word. After a user indicates the languages they can read, you want to show them articles written only in those languages. When building your database of articles, you pass each one to this service and tag it with the language this service returns.If your collection contains articles in either English or French but no other languages, you can restrict this service to return only English or French by specifying those two languages in the languageCodes parameter.
In another application, you create a search index for those news articles so that users can search for articles containing specific terms. You first invoke this service to determine the article's language and then call the Linguistic Analysis service to produce the individual tokens you place in your search index. You set the Linguistic Analysis service's languageCodes parameter to the value returned from this service.
This service can identify 33 languages: Arabic, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Farsi, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian (Bokmål), Norwegian (Nynorsk), Polish, Portuguese, Romanian, Russian, Serbian (Cyrillic), Serbian (Latin), Slovak, Slovenian, Spanish, Swedish, Thai, and Turkish.
The service accepts input in a wide variety of formats:
- Abobe PDF
- Generic email messages (.eml)
- HTML
- Microsoft Excel
- Microsoft Outlook email messages (.msg)
- Microsoft PowerPoint
- Microsoft Word
- Open Document Presentation
- Open Document Spreadsheet
- Open Document Text
- Plain Text
- Rich Text Format (RTF)
- WordPerfect
- XML
The size of each input file is limited to 1 MB.
API Reference
/
/
Given a list of languages, determine the one in which a document is most likely written. Returns the language of the input, its MIME type, and its length. Accuracy of the result is dependent upon the amount of input text supplied. If less than 30 characters are supplied, the language code that is first in languageCodes
is returned, or en
if languageCodes
is missing.
Minimum Text Length
The Language Identification service requires at least 30 characters of text to make a reasonably accurate guess of the language. If the input contains fewer than 30 characters, the service does not attempt to make a guess and instead returns the default language.
Default Language
The default language is either the first value listed in the languageCodes input parameter (see Setting a subset of languages in this topic) or English if the languageCodes input parameter is not specified.
Setting a subset of languages
You can instruct the service to choose from a specific, reduced set of languages by setting the languageCodes input parameter. This forces the service to choose from one of the languages you supply.
Use this setting with caution. If, for example, you set languageCodes to Danish, German, or Dutch and the input text is in Russian, the service cannot return Russian. It must return the default.
Internal Algorithm
The service includes a model for each of the languages it can identify. A model contains data about a language and an algorithm for analyzing text to compute a score, or confidence measure, that answers the question, "Do I have confidence that this language could really be the language of the document?" The language with the best confidence wins and is returned. When confidence in language identification is low, the service returns the default language instead of the identified language.
The algorithm selects a single language, even if the document contains multiple languages. Naturally, it favors the language in which the majority of the document's words are written.
Meaning of the textSize value
The returned attribute textSize represents the amount of character data in the input, not the number of bytes. If the input is in plain text file without accented characters, textSize equals the input file's size. However, if the input is a binary file such as a PDF or Microsoft Word document, the textSize will probably be much smaller than the file size, especially if the file contains a lot of non-textual data such as an embedded image.
Python Tutorials
Simple example with static data and default options
In this simple Python tutorial, you'll pass some static text to the service and print out the JSON response.
Get an access token
To use the service, you must pass an access token in each call. Get the token from the OAuth2 service.
import requests
import json
# Replace the two following values with your client secret and client ID.
client_secret = 'clientSecretPlaceholder'
client_id = 'clientIDPlaceholder'
s = requests.Session()
# Get the access token from the OAuth2 service.
auth_url = 'https://api.beta.yaas.io/hybris/oauth2/v1/token'
r = s.post(auth_url, data= {'client_secret':client_secret, 'client_id':client_id,'grant_type':'client_credentials'})
access_token = r.json()['access_token']
Identify the language
The POST request includes a single item: the text for language identification. You could load the contents of a binary file, web page, text stream, or other resource into the request. In this example, you use static text that you typed into your Python source code.
# The Language Identification service's URL
service_url = 'https://api.beta.yaas.io/sap/ta-language/v1/'
# The example text whose language you want to determine.
service_text = 'Guten Morgen, Herr Veezner. Wie geht es? Haben Sie etwas zu verzollen?'
# HTTP request headers
req_headers = {}
# Set content-type to 'application/json' to pass a plain text document to the service. Specify the plain text's encoding as UTF-8.
req_headers['content-type'] = 'application/json; charset=utf-8'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)
# Make the REST call to the Language Identification service
response = s.post(url = service_url, headers = req_headers, data = json.dumps({'text':service_text}))
Finally, display the returned JSON response on your console, which, for this tutorial is:
{
"mimeType": "text/plain",
"textSize": 70,
"language": "de"
}
where "de"
is the ISO 639-1 code for German.
# Print result
if response.status_code == 200:
print json.dumps(json.loads(response.text), indent = 4)
else:
print 'Error', response.status_code
print response.text
Specify a list of languages in a binary document
In this example, you know that the text must be in one of three languages: English, French, or Spanish. You have a Microsoft Word file named French.docx, containing the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça."
Read the contents of the Word file
filename = "French.docx"
f = open('French.docx', 'rb')
service_binarydata = f.read()
Identify the language as either English, Spanish, or French
This example's POST request includes the languageCodes parameter to force the service to select from three languages.# Append "?languageCodes=en%2Ces%2Cfr" to the Language Identification service URL to restrict identification to Spanish ('es'), English ('en'), or French ('fr'). Separate each language with a comma, represented as '%2C' within URLs.
service_url = 'https://api.beta.yaas.io/sap/ta-language/v1?languageCodes=es%2Cen%2Cfr'
# HTTP request headers
req_headers = {}
# The example uses a different content-type value in this tutorial because it passes binary data.
req_headers['content-type'] = 'application/octet-stream'
req_headers['Cache-Control'] = 'no-cache'
req_headers['Connection'] = 'keep-alive'
req_headers['Accept-Encoding'] = 'gzip'
req_headers['Authorization'] = 'Bearer {}'.format(access_token)
# Make the REST call to the Language Identification service. Pass the raw binary data. Do not base64-encode the data.
response = s.post(url = service_url, headers = req_headers, data = service_binarydata)
Print the language on the console
Like the first example in this tutorial, the JSON response includes three attributes: the MIME type of the input, the length of the document's text in characters, and the two-letter ISO 639-1 code of the text's language as determined by the service. The JSON returned in this example is:{
"mimeType": "application/msword",
"textSize": 64,
"language": "fr"
}
The binary data is correctly identified as Microsoft Word and the language of the text is correctly identified as French. Retrieve the language attribute's value and convert it to its full name.
The value returned for textSize is 64
while the text "Je suis désolé, Dave. Je crains de ne pas pouvoir faire ça." contains 59 characters. The apparent discrepancy is accounted for by the conversion of Microsoft Word files to a plain-text equivalent within the Text Analysis technology. The accented characters 'é' (which appears twice) and 'ç' are internally represented by multi-byte characters in the service.
# Translate 2-letter language abbreviations to full English name
language_names = {
'en' : 'English',
'ar' : 'Arabic',
'ca' : 'Catalan',
'cs' : 'Czech',
'da' : 'Danish',
'de' : 'German',
'es' : 'Spanish',
'fa' : 'Persian',
'fr' : 'French',
'hr' : 'Croatian',
'hu' : 'Hungarian',
'id' : 'Indonesian',
'it' : 'Italian',
'iw' : 'Hebrew',
'ja' : 'Japanese',
'ko' : 'Korean',
'nb' : 'Bokmål',
'nl' : 'Dutch',
'nn' : 'Nynorsk',
'pl' : 'Polish',
'pt' : 'Portuguese',
'ro' : 'Romanian',
'ru' : 'Russian',
'sh' : 'Serbo-Croatian',
'sk' : 'Slovak',
'sl' : 'Slovenian',
'sr' : 'Serbian',
'sv' : 'Swedish',
'th' : 'Thai',
'tr' : 'Turkish',
'zh' : 'Simplified Chinese',
'zf' : 'Traditional Chinese'
}
if response.status_code == 200:
# De-serialize the JSON reply, get the language attribute, and print its full name.
response_dict = json.loads(response.text)
response_language_code = response_dict.get('language','Oops - return is missing the language attribute.')
response_language_name = language_names.get(response_language_code,'Oops - no name for returned language code \"' + response_language_code + '\".')
print('The language of the document was identified as \"' + response_language_code + '\" (' + response_language_name + ').')
else:
print 'Error', response.status_code
print response.text
If you find any information that is unclear or incorrect, please let us know so that we can improve the Dev Portal content.
Use our private help channel. Receive updates over email and contact our specialists directly.
If you need more information about this topic, visit hybris Experts to post your own question and interact with our community and experts.