An index is created for every feature in the censhare database (cdb). A full-text index is created in order for censhare users to search for partial words, individual words, or word groups.

Prerequisites

You must create an asset feature that you want to index. See Create asset feature.

Introduction

The recommended procedure for creating a full-text index is to create a second, so-called full-text index feature for each feature that needs to be indexed. In this full-text index feature, you then configure the feature(s) that you want to index. The full-text index can then be added to the virtual index of the Quick search. This article explains how to create the full-text index feature and how to configure the virtual Quick Search index.

Configuration

Basic feature configuration

To create an index feature for an existing asset feature, do the following:

  1. In the censhare Admin Client, in the Master data/Features table, click the plus icon to create a new feature. We recommend to give this feature the same name as the feature that you want to index, followed by the suffix "(full text)".

  2. As an ID, use the ID of the original feature (the one that you want to index) with a "text" prefix, suffix, or infix. For example: "companyName:text.featureName", where "featureName" is the original name of the indexed feature. The ID must be unique.

  3. In the Resource key field, proceed similarly. For example: "companyName:features.text.featureName", where "featureName" is the original name of the indexed feature. The resource key must be unique.

  4. In the Trait key field, enter "full_text_index". This key is used for all full-text index features.

  5. In the Feature key field, enter the original name of the feature without prefixes, infixes, or suffixes. For example: "featureName". The feature key does not have to be unique.

  6. In the Target object field, select Virtual. Explanation: Full-text index features are not assigned to any object in the censhare data model.

  7. In the Type field, select Asset attribute (internal), or Asset feature (internal). Both options can be used likewise. The option Asset feature also works without restrictions. However, we recommend to reserve this option for the actual asset features.

  8. In the Value type field, select Text (string).

  9. In the Storage field, select Full text. This option is specific to full-text index features.

  10. Disable the Searchable field. Otherwise, index features appear in the search results, which is not desired.

Index configuration

In the Index configuration section, configure the full-text index as follows:

  1. Click the plus icon to add an index configuration.

  2. In the censhare default configuration area, in the Type field, select Full-text index. A new Full-text index sub-section displays inside the Custom configuration section.

  3. Configure the full-text index.

  4. In the Features sub-section, click the + icon to add a feature that you want to index.

  5. In the Name field, select the desired feature.

  6. In the Mode field, select what exactly is indexed:

  • Content: Indexes the value of a feature as is. If you selected default_text_index or content_index above, it indexes the content file.
  • Key: Indexes a key/value pair. Use to index a human-readable value if an asset feature stores only a key or an ID. For example, the user ID/user name.

Reference: Indexes the name of an asset that is referenced in a feature, instead of the asset ID, or asset resource key. This allows users to search for the name to find the respective reference. For example, the Author feature stores a reference to a person asset. To find a person that is referenced as author of an asset, you must select this option.

Single-feature or multi-feature indexes?

In general, it is best practice to index only a single feature in a full-text index. However, you can add multiple features to one full-text index. This can be helpful if you want to combine multiple similar features in one index. For example, add free-text keywords, keyword references, and hierarchical keyword values to one keyword index.

  • In the Character mapping sub-section, you can define replacements for special characters in the index. For example, replace special characters if you use multiple languages. See also the index parameter

Index properties

Field

Description

Default text index

Select to index the complete asset content (storage item). Warning: Indexing the content of large files can cause a high server load and bad system performance. We do not recommend to use this option unless you are sure that the system can handle this. We recommend to use the Content full-text indexing option (see below) to have more control over the indexed content. In any case, make sure that only one index feature indexes the content of your assets. The default feature for this purpose is the Content (full text) feature (resource key: censhare:text.content).

Content fulltext

Select to index part of the asset content (storage item) or to index specific content elements. Warning: Make sure that only one index feature indexes the content of your assets. The default content index feature is the Content (full text) feature (resource key: censhare:text.content). If you add a new content index feature, disable the default feature.

Maximum text file size (only if Content fulltext is selected)

Enter the number of characters that are indexed. Text files are indexed only to this limit. If files are larger than the limit, the remaining content is ignored in the index. This option allows you to control the load that content indexing causes on the censhare server. The default value in the Content (full text) feature is 1.000.000. To index the complete content regardless of the size of the text file, select the option Default text index (see above).

XPath to select content from (only if Content fulltext is selected)

Enter an XPath expression to index only specific elements of the content. For example, index only headlines and the teaser of the XML content. This option allows you to generate better indexes (and better matches in the search).

Minimum word length

Specifies the minimum number of characters that users must enter before the Quick search starts. Default is 3.

Use stop words

Stop words have syntactical and grammatical functions in texts and are usually excluded from the index. For example, the words articles, prepositions, auxiliary verbs. To explicitly include stop words in the full-text index, select this field. For more information see the

For more information, see this Wikipedia article on stop words .

Use stemming

Enable this field to reduce a word to its root during indexing. This reduces the number of different word forms in the index to the same stem. censhare uses algorithmic stemming. To work correctly, a word stem table is required for each language, and the respective asset language must be selected. Algorithmic stemming has its limits, for example, with irregular verbs like "take" (take, took, taken). Note: Algorithmic stemming can lead to undesired results. Therefore, stemming is disabled by default. If you want to use it, we recommend to test the results thoroughly for each language.

Store term frequency

Enable this field to store how often a word occurs in the content. In running texts, the term frequency can be a meaningful indicator of the relevance of a word. Do not use this option in combination with stop words (see above). We also do not recommend to use the term frequency for asset metadata. Notes: If enabled, the index requires extra memory to store the frequency values for each word.

Strip diacritic marks

Enable this field to remove diacritical marks in the index. For example, the characters "ö", "ç" or "é" are replaced by "o", "c" and "e". Users do not need to spell a word correctly to find the matching occurrences. For example, they can enter "garcon" in the search field to find texts that contain "garçon". For more information see the 

For more information, see this Wikipedia article on diacritical marks .

Junction expand operator

Specifies how censhare handles multiple search terms that users enter in the search field. We recommend to use the AND_NOT_EMPTY mode. This modus shows better search results if a user enters several search terms, and one of these terms does not produce a match. The stricter AND mode only produces search results if all search terms that a user enters produce a match.

Setting s for fuzzy searches

Minimum word length

The fuzzy search only starts when the minimum word length is reached. The default value is 5 characters.

Maximum editing distance

Determines by how many characters the entered word can deviate from the matching word. censhare considers inserted, missing, or replaced characters at any position in a search term as a deviation. The higher the maximum deviation, the higher the degree of fuzziness. This creates more false matches. Values >2 are not recommended.

Maximum errors

This parameter is similar to the Maximum editing distance setting. It compensates for typos.

N-Gram size

Internal parameter. Do not change! The field specifies the fragment length used when an item of text is broken down into fragments. For more information, see the 

For more information, see this Wikipedia article on n-grams .

Maximum prefix expansion

Specifies the number of hits before the full-text search stops when the search term is used as a prefix. censhare first searches for direct hits on the search term. For a subsequent prefix search, all words that begin with that prefix also count as hits.

Maximum infix expansion

Specifies the maximum number of hits which can occur before the full-text search stops when the search term is used as an infix. As a result, in addition to direct hits and prefix hits, this search also finds infix hits. Any word in which the search term occurs anywhere in the word (infix) counts as a hit here.

Maximum fuzzy expansion

The fuzzy search creates a list of matches on every deviation level. This list contains all possible combinations for the search term. With increasing deviation, the Maximum fuzzy expansion prevents too many matches with low relevance. The fuzzy search stops when the number of hits in the censhare database (cdb) for the current deviation exceeds this value. See also Maximum editing distance above.

Number of hits to skip fuzzy search

Stops the full-text search if there are more hits for direct, prefix or infix matches than indicated in this field.

Number of hits to end fuzzy search

Stops the full-text if a fuzzy search throws more matches than indicated in this field. This global parameter controls the number of matches of a fuzzy search. Low editing distance matches are prioritized.

Minimum non-numeric sequence

Specifies the minimum number of non-numeric characters (letters) that must occur in a search term to be interpreted as text. For fewer non-numeric characters, the fuzzy search is not applied. For example, prevent product IDs that contain a combination of numbers and letters to throw false matches.

Scale

If you want to combine a full-text index with other full-text indexes in a virtual index (for example, in the

For more information see this article Customize the Quick search, search filters, and sorting options, this parameter allows you to weigh the index and give it a higher or lower relevance (in proportion to the other indexes and their scale value). If you use the full-text index as a single index, this parameter has no relevance.



Virtual full-text index for the Quick search

The Quick search uses a virtual full-text index that indexes all features that are available in the Quick search. You can customize this index, if necessary.

Build and update an index in the censhare database

If you create a new feature and the respective full-text index, you do not have to build the index, as there are no assets with the respective feature. When users start to edit assets and add the new feature, the index is created automatically.

If you create a full-text index for an existing feature, if the indexed feature stores values that are calculated with an XPath expression, or if you change the index configuration of a full-text index, you must (re)build the index in the censhare database.

If you modify the feature that has already a full-text index, you also must rebuild the respective index. For example, if you change the value type or the XPath expression to calculate the feature.

Proceed as follows:

  1. In the censhare Admin Client, click and select Embedded database rebuild feature index.

  2. First, select the Feature that you want to index, and click OK.

  3. Run the Embedded database rebuild feature index server action again.

  4. Now, select the full-text index of the feature, and click OK.

Note: For new features, you can skip the first two steps. For existing features, always build the indexes in the exact sequence as described above (first the feature index, then the full-text index). This ensures that the feature index is up-to-date before the full-text index is built. When you execute the server action, censhare deletes the index for your selected feature. Until the index is rebuilt, users cannot search for the feature. The build is carried out as an asynchronous process (during ongoing operations). We recommend to plan a maintenance window to rebuild indexes whenever possible.

Enable the server action

If the Embedded database rebuild feature index server action is not available, you must enable it. Proceed as follows:

  1. In the censhare Admin Client, open the Configuration/Modules/Administration/Embedded database directory.

  2. Open the Embedded database rebuild feature index configuration.

  3. In the General setup section, select Enabled.

  4. Click OK to save the configuration.

  5. Update the server configuration, and - if necessary - synchronize the remote servers.