转载于http://en.wikipedia.org/wiki/Dice’s_coefficient
出自Wikipedia
Dice’s coefficient, named after Lee Raymond Dice[1] and also known as the Dice coefficient, is a similarity measure related to the Jaccard index.
For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the combined set (union) :[2]
When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[3]
where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:
nightnacht
We would find the set of bigrams in each word:
- {
ni,ig,gh,ht}- {
na,ac,ch,ht} - {
Each set has four elements, and the intersection of these two sets has only one element: ht.
Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.
[编辑]See also
![]()
The Wikibook Algorithm implementation has a page on the topic of
- Jaccard index, which is equivalent: D = 2J / (1 + J) and J = D / (2 − D)
- Levenshtein distance
- Sørensen similarity index
[编辑]Notes
- ^ Dice, Lee R. (1945). “Measures of the Amount of Ecologic Association Between Species”. Ecology 26 (3): 297–302. doi:10.2307/1932409.
- ^ van Rijsbergen, Cornelis Joost (1979). Information Retrieval. London: Butterworths.
- ^ Kondrak, Grzegorz; Marcu, Daniel; and Knight, Kevin (2003). “Cognates Can Improve Statistical Translation Models”. Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. pp. 46–48.
[编辑]References
3个分类: Information retrieval | String similarity measures | Measure theory


Comments