Measuring the sentence level similarity

Ercan Canhasi


This article describes a method used to calculate the similarity between short English texts, specifically of sentence length. The described algorithm calculates semantic and word order similarities of two sentences. In order to do so, it uses a structured lexical knowledge base and statistical information from a corpus. The described method works well in determining sentence similarity for most sentence pairs, consequently the implemented method can be used in computer automated sentence similarity measurements and other text based mining problems. We encapsulated the implemented algorithm in a .NET library, to simplify the task of calculating sentence similarity for end users.

Full Text:



Y. Li, D. McLean, Z. A. Bandar, J. D. OShea and K. Crockett, Sentence Similarity Based on Semantic Nets and Corpus Statistic IEEE Transaction on knowledge and data engineering, Vol. 18, No. 8, August 2006

V. Hatzivassiloglou, J. Klavans, and E. Eskin, Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning, Proc. Joint SIGDAT Conf. Empirical Methods in NLP and Very Large Corpora, 1999.

C.T. Meadow, B.R. Boyce, and D.H. Kraft, Text Information Retrieval Systems, second ed. Academic Press, 2000.

F. RYO, Y. TSUYOSHI, K HIROAKI, M. TETSUYA, T. YOSHINORI and O. NOBORU, "Measuring Similarity between Documents using Term Frequency and Concept Dictionary", Joho Shori Gakkai Kenkyu Hokoku, VOL.2003, NO.4(NL-153)

H. Huang, H. Yang and Y. Kuo, "A Fuzzy-Rough Set Based Semantic Similarity Measure Between Cross-Lingual Documents", 2008 3rd International Conference on Innovative Computing Information and Control

D. Metzler, S. Dumais and C. Meek, "Similarity Measures for Short Segments of Text"

Wen-tau Yih and Christopher Meek, "Improving Similarity Measures for Short Segments of Text", Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

M. Sahami and T. D. Heilman,"A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets", WWW 2006, May 2326, 2006, Edinburgh, Scotland.

Manning, C. D., and Sch¨utze, H. 1999. Foundations of Statistical Natural Language Processing. The MIT Press.

Lund, K., Burgess, C. & Atchley, R. A. (1995). Semantic and associative priming in a high-dimensional semantic space. Cognitive Science Proceedings (LEA), 660-665.

Lund, K. & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers, 28(2),203-208.

Susan T. Dumais (2005) "Latent Semantic Analysis". Available:

Thomas K Landauer and Susan Dumais, "Latent semantic analysis", Scholarpedia 2008, 3(11):4356. Available:

Thomas Landauer, Peter W. Foltz, & Darrell Laham (1998). "Introduction to Latent Semantic Analysis". Available:

BURGESS, C., LIVESAY, K., AND LUND "Explorations in context space: Words, sentences", 1998, discourse. Disc. Proc. 25, 23, 211257.

Jones, R.; Rey, B.; Madani, O.; and Greiner, W., "Generating query substitutions", 2006, In Proc. of WWW 06.

G.A. Miller, WordNet: A Lexical Database for English, Comm. ACM, vol. 38, no. 11, pp. 39-41, 1995.

Brown Corpus Information, pus_ling/content/corpora/list/private/brown/brown.html, 2005.

Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press

George A. Miller (Emeritus), Christiane Fellbaum, Randee Tengi, Helen Langone, Adam Ernst and Lavanya Jose, "WordNet", 2006,

Simpson T., Crowe M., 2005, WordNet.Net

Wikipedia, Brown Corpus, 15. March 2011.

Nltk, Natural language toolkit development, 17. March 2011

P.Resnik, Using Information Content to Evaluate Semantic Similarity in a Taxonomy, Proc. 14th Intl Joint Conf. AI, 1995.

P. Wiemer-Hastings, Adding Syntactic Information to LSA, Proc. 22nd Ann. Conf. Cognitive Science Soc., pp.989-993, 2000.

R.Rada, H.Mili, E.Bichnell, and M.Blettner, Development and Application of a Metric on Semantic Nets, IEEE Trans. System, Man, and Cybernetics, vol.9, no.1, pp. 17-30, 1989


  • There are currently no refbacks.