Skip to main content
Skip table of contents

Segmentation and tag rules for TMX


censhare Translation with Memory works with a rule-based segmentation process for texts and inline tags. Learn how to customize the rule sets to achieve the desired segmentation.

The segmentation and tag rules are stored in two files. censhare uses the standardized XML format SRX (Segmentation-Rules-eXchange) and ITS (Internationalization-Tag-Set).

Introduction

The segmentation rules define syntax-based rules for the creation of translation units. For example, each phrase or sentence is considered a translation unit. To detect the Beginning and end of a phrase, censhare uses the segmentation rules. The rules also contain a set of language-specific abbreviations and special cases to ensure the correct segmentation of a text. In most cases, you do not have to adjust the segmentation rules. Before you edit the segmentation rules, familiarize yourself with the SRX format and the specific rules for the language you want to edit.

Besides the segmentation rules, censhare applies tag rules to create translation units and mark the text to be translated. Tag rules are based on the XML schema of the source document. For each element, the tag rules define whether it is structural or inline, and whether the content is translated or not. The tag rules also define, if attributes are translatable or not. The tag rules must be specified for the XML schema of your source files. The tag rules require your adjustment and proper testing to produce the desired results.

Segmentation rules

The segmentation of a text into translation units requires a set of rules which describe the syntactic construction of phrases and sentences in a given language. Thus, segmented text can be exported and imported in translation tools. For example, the TMX schema contains the segments of a translation memory and meta information for source and target language. The XLIFF schema contains the segments and structural information of a specific document. Both schemas rely on the segmentation rules. The segmentation rules are stored in an SRX file. The SRX schema is an XML-based schema. The schema contains a rule set for each language. The rule set takes into account punctuation marks, abbreviations, upper-/lower case rules and the context in which they are used in a language.

Languages

The standard SRX file in censhare contains rule sets for these languages:

  • Breton

  • Chinese

  • Danish

  • German

  • English

  • Esperanto

  • French

  • Galician

  • Greek

  • Icelandic

  • Italian

  • Japanese

  • Catalan

  • Dutch

  • Persian

  • Polish

  • Portuguese

  • Romanian

  • Russian

  • Slovak

  • Slovenian

  • Spanish

  • Tamil

  • Ukrainian

  • Belarusian

(1) Besides these languages, the ITS rules contain a generic rule set. These rules are applicable in most use cases.

(2) You only have to adjust the rules, if your organization or company uses terminology or abbreviations that cause a segmentation. For example, if you use abbreviations that end with a dot and do not add them to the list, censhare interprets these as the end of a phrase and creates a new segment.

(3) If the desired language is not in the list, you can add a new rule set to the SRX file.

Create a custom SRX file

To create a custom SRX file, do the following:

  1. The standard SRX file is located at /censhare_Server/app/services/babelfish/LanguageTool_SRX.srx.

  2. Copy this file to the /censhare-Custom/censhare_Server/app/services/babelfish/ directory. If the Babelfish directory does not exist in this path, create it as well.

  3. Edit the SRX file and save your changes. Do not change the name of the file.

  4. Restart your server and, if necessary, synchronize the remote servers.

Internationalization tag set (ITS)

Note: The Tag rules only apply to the Translation with Memory. For the XLIFF exporter/importer, censhare uses a separate Tag mapping configuration. For this topic see Configure the XLIFF export/import.

The ITS rules describe how translation units (segments) are created on the basis of the XML schema of the source document. Tag rules describe the translation rules for structural elements, inline elements and attributes:

Type

Examples

Translation rule

Structural elements

<headline/>, <text/>, <paragraph/>

translate and create a segment.

Inline elements

<bold/>, <italic/>, <link/>, <indexterm/>

translate, but keep within the parent element and do not create a new segment.

Attributes

title, alt, tooltip

translate

id, class, href

do not translate

The following example shows a snippet with different element and attribute types:

CODE
<headline-1 id="toc-1">This is a headline</headline-1>
<paragraph>This is <bold>bold</bold> text.</paragraph>
<paragraph>This is <italic>italic</italic>text.</paragraph> 
<paragraph>And this is a <link href="/home" title="Go back to Homepage">Link</link>.</paragraph>


The ITS file that describes the translation rules for these elements:

CODE
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" 
                    xmlns:itsx="http://www.w3.org/2008/12/its-extensions" 
                    xmlns:okp="okapi-framework:xmlfilter-options" 
                    xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0">

   <!-- Structural elements -->
   <its:translateRule selector="//headline" translate="yes"/>
   <its:translateRule selector="//paragraph" translate="yes"/>

   <!-- Inline elements -->
   <its:translateRule selector="//bold" translate="yes" withinText="yes" />
   <its:translateRule selector="//italic" translate="yes" withinText="yes" />

   <!-- Attributes -->
   <its:translateRule selector="//headline/@id" translate="no"/>
   <its:translateRule selector="//link/@title" translate="yes"/>
   <its:translateRule selector="//link/@href" translate="no"/>
</its:rules>

The ITS rules must be adjusted for your document schema. You can use global ITS rules in your system or define domain-specific ITS.

Global ITS rules

To configure global ITS rules, do the following:

  1. The standard rule set is stored at /censhare-Server/app/services/babelfish/okf_xml@censhare.fprm.

  2. Copy this file to the /censhare-Custom/censhare_Server/app/services/babelfish/ directory. If the Babelfish directory does not exist in this path, create it as well.

  3. Edit the file and save your changes. Do not change the name of the file.

  4. Restart your server and, if necessary, synchronize the remote servers.

Domain-specific ITS rules

For each domain in which you want to define domain-specific ITS rules, do the following:

  1. Copy the standard rule set /censhare-Server/app/services/babelfish/</bold><bold>okf_xml@censhare.fprm to your computer.

  2. Edit the file in an XML editor and save the changes.

  3. Upload the files in censhare Web. censhare creates an asset from the uploaded file.

  4. Open the asset you created and start the Developer mode.

  5. In the Details tab, edit the Status widget.

  6. Select the type Text and select the desired Domain and 2nd domain.

  7. Click OK to close the dialog.

  8. In the Administration tab, edit the Generic asset administration widget.

  9. Enable the resource, enter a resource key and select the Usage "ITS rules".

  10. Click OK to close the dialog.

  11. Save the changes in the asset.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.