Text Similarity

Provides multiple text-based similarity algorithms to measure the similarity of input text pairs. The provided algorithms are tuned to measure similarity both in the representation (syntax) and the meaning (semantics) of the text content.

Text Similarity

Perform the similarity analysis on the given sentence pair, either syntactic or semantic analysis

Securityapi_key

Request

Request Body schema: application/json
required

text1 required	string non-empty The text content with UTF-8 text representation
lang1 required	string The two letter language code Enum: "en" "fr" "es"
text2 required	string non-empty The text content with UTF-8 text representation
lang2 required	string The two letter language code Enum: "en" "fr" "es"
algo	string Similarity Algorithms Syntactic Similarity The syntactic similarity algorithms exclusively focus on the representational features of text. The most dominant of these features is the set of tokens (character and words) being used. Different syntactic similarity algorithms exploit these features differently to provide a measure of similarity between an input text pair. The similarity is measured based on a scale of 0 to 1, where 1 represents the best possible match, and 0 indicates the no match scenario. In addition the base algorithms, we also utilize the approach of character and/or word based shingles to add context for increasing the similarity accuracy. The following syntactic similarity algorithms are supported: syn.cosine-with-shingles: This represents the combination of applying character-based shingles to the classic cosine similarity algorithm. syn.sorensen_dice-shingles: This represents the combination of applying character-based shingles to the classic Sørensen–Dice coefficient algorithm. syn.jw-shingles: This represents the combination of applying character-based shingles to the classic Jaro–Winkler algorithm that is similar in nature to edit distance based measures. syn.cosine-word: This represents the combination of applying word-based shingles to the classic cosine similarity algorithm. Compared to the syn.cosine-with-shingles, this algorithm will produce less false positives for larger text pieces. syn.simple: A Semantax proprietary algorithm that is optimized for comparison speed and accuracy. It is based on the cosine similarity algorithm and it combines both character and word based shingles. syn.weighted-word: A Semantax proprietary algorithm that is optimized for comparison speed and accuracy. It is based on the classic [Jaro–Winkler] algorithm. Both character and word based shingles are combined in a weighted capacity to increase the impact of term-frequency. syn.sentence: A Semantax proprietary algorithm derived from the classic cosine similarity algorithm. The main feature of this algorithm is the inclusion of NLP (natural language processing) primitives for higher accuracy of similarity comparisons. NLP processing includes lemmatization/stemming, term normalization etc. This algorithm is best suited for a single sentence, or a couple of short sentences as input. syn.paragraph: A Semantax proprietary algorithm that extends syn.sentence to compare a pair of input paragraphs (a set of sentences). In addition to the syn.sentence features, this algorithm also includes a weighted Jaccard Similarity score of the overlapping sentences across the input pair. Semantic Similarity The semantic similarity algorithms focus on comparing the input text pair based on the main concepts present in the text regardless of the words used to represent these concepts. Roughly speaking it is similar to comparing the meaning of the two sentences independent of the words used. See here for more details. Our semantic similarity algorithms are created using modern deep learning based word embeddings trained on enterprise corpus of sample documents. The models are trained on single sentences, and/or short paragraphs as input, and therefore work best for content size in that range. All of our semantic similarity algorithms support multi and cross lingual scenarios, where the input text pair can be expressed in any of the supported languages (for example en-en, en-fr, en-es, fr-fr, fr-es etc.). The following semantic similarity algorithms are supported: sem.ssm: The default semantic similarity algorithm that offers the best combination of speed and accuracy with an emphasis on english-to-english common language input pairs. sem.ssm14: This semantic similarity model is trained on data from government, insurance and banking industry verticals. The model is optimized for speed but provides a good level of over all accuracy. sem.ssm20: Similar to sem.ssm14, this model is build on a much larger input corpus. sem.ssm28: Builds on the same approach as the previous two models but also includes basic support for higher semantic relationships such as negation. sem.ssm30: Similar to ssm28, with better similarity score distribution. Enum: "syn.weighted-word" "syn.simple" "syn.cosine-with-shingles" "syn.sorensen_dice-shingles" "syn.cosine-word" "syn.jw-shingles" "syn.paragraph" "syn.sentence" "sem.ssm" "sem.ssm14" "sem.ssm20" "sem.ssm28" "sem.ssm30"

Responses

200

200 response

400

Invalid request body

403

The request is forbidden (Please input a valid API key)

post/text/similarity

Request samples

application/json

{"text1": "how are you",
"lang1": "en",
"text2": "how old are you",
"lang2": "en",
"algo": "syn.cosine-word"
}

Response samples

application/json

{"status": {"success": true,
"code": 200
},
"result": {"text1": "string",
"text2": "string",
"score": 1,
"prediction": {"match": true,
"conf": 1
}
}
}

➔ Next to Text Summary

Text Similarity

Text Similarity

Request Body schema: application/jsonrequired

Similarity Algorithms

Syntactic Similarity

Semantic Similarity

Request Body schema: application/json
required