Text Similarity

Provides multiple text-based similarity algorithms to measure the similarity of input text pairs. The provided algorithms are tuned to measure similarity both in the representation (syntax) and the meaning (semantics) of the text content.

Text Similarity

Perform the similarity analysis on the given sentence pair, either syntactic or semantic analysis

Securityapi_key
Request
Request Body schema: application/json
required
text1
required
string non-empty

The text content with UTF-8 text representation

lang1
required
string

The two letter language code

Enum: "en" "fr" "es"
text2
required
string non-empty

The text content with UTF-8 text representation

lang2
required
string

The two letter language code

Enum: "en" "fr" "es"
algo
string

Similarity Algorithms

Syntactic Similarity

The syntactic similarity algorithms exclusively focus on the representational features of text. The most dominant of these features is the set of tokens (character and words) being used. Different syntactic similarity algorithms exploit these features differently to provide a measure of similarity between an input text pair. The similarity is measured based on a scale of 0 to 1, where 1 represents the best possible match, and 0 indicates the no match scenario. In addition the base algorithms, we also utilize the approach of character and/or word based shingles to add context for increasing the similarity accuracy. The following syntactic similarity algorithms are supported:

  1. syn.cosine-with-shingles: This represents the combination of applying character-based shingles to the classic cosine similarity algorithm.
  2. syn.sorensen_dice-shingles: This represents the combination of applying character-based shingles to the classic Sørensen–Dice coefficient algorithm.
  3. syn.jw-shingles: This represents the combination of applying character-based shingles to the classic Jaro–Winkler algorithm that is similar in nature to edit distance based measures.
  4. syn.cosine-word: This represents the combination of applying word-based shingles to the classic cosine similarity algorithm. Compared to the syn.cosine-with-shingles, this algorithm will produce less false positives for larger text pieces.
  5. syn.simple: A Semantax proprietary algorithm that is optimized for comparison speed and accuracy. It is based on the cosine similarity algorithm and it combines both character and word based shingles.
  6. syn.weighted-word: A Semantax proprietary algorithm that is optimized for comparison speed and accuracy. It is based on the classic [Jaro–Winkler] algorithm. Both character and word based shingles are combined in a weighted capacity to increase the impact of term-frequency.
  7. syn.sentence: A Semantax proprietary algorithm derived from the classic cosine similarity algorithm. The main feature of this algorithm is the inclusion of NLP (natural language processing) primitives for higher accuracy of similarity comparisons. NLP processing includes lemmatization/stemming, term normalization etc. This algorithm is best suited for a single sentence, or a couple of short sentences as input.
  8. syn.paragraph: A Semantax proprietary algorithm that extends syn.sentence to compare a pair of input paragraphs (a set of sentences). In addition to the syn.sentence features, this algorithm also includes a weighted Jaccard Similarity score of the overlapping sentences across the input pair.

Semantic Similarity

The semantic similarity algorithms focus on comparing the input text pair based on the main concepts present in the text regardless of the words used to represent these concepts. Roughly speaking it is similar to comparing the meaning of the two sentences independent of the words used. See here for more details. Our semantic similarity algorithms are created using modern deep learning based word embeddings trained on enterprise corpus of sample documents. The models are trained on single sentences, and/or short paragraphs as input, and therefore work best for content size in that range.

All of our semantic similarity algorithms support multi and cross lingual scenarios, where the input text pair can be expressed in any of the supported languages (for example en-en, en-fr, en-es, fr-fr, fr-es etc.). The following semantic similarity algorithms are supported:

  1. sem.ssm: The default semantic similarity algorithm that offers the best combination of speed and accuracy with an emphasis on english-to-english common language input pairs.
  2. sem.ssm14: This semantic similarity model is trained on data from government, insurance and banking industry verticals. The model is optimized for speed but provides a good level of over all accuracy.
  3. sem.ssm20: Similar to sem.ssm14, this model is build on a much larger input corpus.
  4. sem.ssm28: Builds on the same approach as the previous two models but also includes basic support for higher semantic relationships such as negation.
  5. sem.ssm30: Similar to ssm28, with better similarity score distribution.
Enum: "syn.weighted-word" "syn.simple" "syn.cosine-with-shingles" "syn.sorensen_dice-shingles" "syn.cosine-word" "syn.jw-shingles" "syn.paragraph" "syn.sentence" "sem.ssm" "sem.ssm14" "sem.ssm20" "sem.ssm28" "sem.ssm30"
Responses
200

200 response

400

Invalid request body

403

The request is forbidden (Please input a valid API key)

post/text/similarity
Request samples
application/json
{
  • "text1": "how are you",
  • "lang1": "en",
  • "text2": "how old are you",
  • "lang2": "en",
  • "algo": "syn.cosine-word"
}
Response samples
application/json
{
  • "status": {
    },
  • "result": {
    }
}