


COMPAREASSIGNMENTS Computes a quality measure of the similarity between assignments.
[score, scoreIfIndependent] = compareAssignments(assignments1, assignments2)
The inputs, 'assignments1' and 'assignments2', must be two column vectors of
the same length where each row contains an integer category label for the
corresponding sample. The integer labels used in the assignment vectors need
have no intrinsic meaning (in particular, e.g., category 1 in 'assignments1'
has no relationship to category 1 in 'assignments2').
The first output, 'score', is a scalar between 0 and 1 that measures the
similarity between the two classifications. A 'score' of 1 implies perfect
correspondance, ignoring actual labels. For example, if all samples in
'assignments1' are labelled by 1 and relabelled as 2 in 'assignments2', the
'score' would be 1. Deviations from this correspondance are penalized in a
a fashion that recognizes category splitting/merging and penalizes these less
than completely random redistribution.
The algorithm is motivated by a Chi^2 two-way classification; however, here we
return a similarity score rather than simply testing the hypothesis that the
classifications are independent. The expected score if the classifications
were independent is returned as the second output, 'scoreIfIndependent', with
the standard Chi^2 two-way p-value returned as an optional third output (this
requires the statistics toolbox). This p-value represents the probability that
the two assignments were independent.
Conceptually (though not computationally), the algorithm considers all N*(N-1)
pairs of data samples and counts pairs that cosegregate, where a pair of samples
is defined as cosegregating if they either share the same category in both
assignments or if they do not share category in either assignment. For example,
consider the following assignments:
sample # assignments1 assignments2
1 1 2
2 1 2
3 2 3
4 1 3
The pairs (1,2) and (1,3) cosegregate while the pair (1,4) does not (since they
share a label in 'assignments1' but not in 'assignments2'). 'score' is the fraction
of pairs that cosegregate between the two assignments.
(An optional third boolean input argument 'showTables' (default 0) produces a graphical
output with the contingency table, conditional probabilities and marginals for the
assignments. The 'score' described above is calculated efficiently using these matrices).

0001 function [score, scoreIndep, p] = compareAssignments(assigns1, assigns2, showTables) 0002 0003 % COMPAREASSIGNMENTS Computes a quality measure of the similarity between assignments. 0004 % 0005 % [score, scoreIfIndependent] = compareAssignments(assignments1, assignments2) 0006 % The inputs, 'assignments1' and 'assignments2', must be two column vectors of 0007 % the same length where each row contains an integer category label for the 0008 % corresponding sample. The integer labels used in the assignment vectors need 0009 % have no intrinsic meaning (in particular, e.g., category 1 in 'assignments1' 0010 % has no relationship to category 1 in 'assignments2'). 0011 % 0012 % The first output, 'score', is a scalar between 0 and 1 that measures the 0013 % similarity between the two classifications. A 'score' of 1 implies perfect 0014 % correspondance, ignoring actual labels. For example, if all samples in 0015 % 'assignments1' are labelled by 1 and relabelled as 2 in 'assignments2', the 0016 % 'score' would be 1. Deviations from this correspondance are penalized in a 0017 % a fashion that recognizes category splitting/merging and penalizes these less 0018 % than completely random redistribution. 0019 % 0020 % The algorithm is motivated by a Chi^2 two-way classification; however, here we 0021 % return a similarity score rather than simply testing the hypothesis that the 0022 % classifications are independent. The expected score if the classifications 0023 % were independent is returned as the second output, 'scoreIfIndependent', with 0024 % the standard Chi^2 two-way p-value returned as an optional third output (this 0025 % requires the statistics toolbox). This p-value represents the probability that 0026 % the two assignments were independent. 0027 % 0028 % Conceptually (though not computationally), the algorithm considers all N*(N-1) 0029 % pairs of data samples and counts pairs that cosegregate, where a pair of samples 0030 % is defined as cosegregating if they either share the same category in both 0031 % assignments or if they do not share category in either assignment. For example, 0032 % consider the following assignments: 0033 % sample # assignments1 assignments2 0034 % 1 1 2 0035 % 2 1 2 0036 % 3 2 3 0037 % 4 1 3 0038 % The pairs (1,2) and (1,3) cosegregate while the pair (1,4) does not (since they 0039 % share a label in 'assignments1' but not in 'assignments2'). 'score' is the fraction 0040 % of pairs that cosegregate between the two assignments. 0041 % 0042 % (An optional third boolean input argument 'showTables' (default 0) produces a graphical 0043 % output with the contingency table, conditional probabilities and marginals for the 0044 % assignments. The 'score' described above is calculated efficiently using these matrices). 0045 0046 % Last Modified By: sbm on Thu Jun 2 17:25:54 2005 0047 0048 if ((size(assigns1, 2) > 1) | (size(assigns2, 2) > 1) | (size(assigns1,1) ~= size(assigns2, 1))) 0049 error('Error in assignment vectors. The first two inputs must be column vectors of equal length.'); 0050 end 0051 0052 if ((nargin < 3) | (showTables == 0)) % if we're not doing graphics, this is more memory efficient. 0053 assigns1 = sortassignments(assigns1); 0054 assigns2 = sortassignments(assigns2); 0055 showTables = 0; 0056 end 0057 0058 s = warning('off'); 0059 0060 numSamples = size(assigns1, 1); 0061 numCategories1 = length(unique(assigns1)); 0062 numCategories2 = length(unique(assigns2)); 0063 0064 % Construct classification table and marginals 0065 joint = full(sparse(assigns1, assigns2, 1, max(assigns1), max(assigns2))) ./ numSamples; 0066 marginal1 = sum(joint, 2); 0067 marginal2 = sum(joint, 1); 0068 0069 % This somewhat cryptic expression computes the score described above. i'll comment it 0070 % later to explain. 0071 score = (2 * joint(:)' * joint(:)) - sum(sum(joint' * joint)) - sum(sum(joint * joint')); 0072 score = 1 + (numSamples / (numSamples - 1)) * score; 0073 0074 % Now get the score expected if the classifications were independent; we do this by 0075 % reconstructing a joint under the assumption of independent classifications (i.e., 0076 % p(x,y) = p(x)p(y)) and then using the same mystery expression to find the score. 0077 jointIndep = (marginal1 * marginal2); 0078 scoreIndep = (2 * jointIndep(:)' * jointIndep(:)) ... 0079 - sum(sum(jointIndep' * jointIndep)) - sum(sum(jointIndep * jointIndep')); 0080 scoreIndep = 1 + (numSamples / (numSamples-1)) * scoreIndep; 0081 0082 % if a p-value was requested, compute Chi^2 0083 if (nargout > 2) 0084 X2 = numSamples .* [((joint - jointIndep).^2)./jointIndep]; % chi^2 0085 X2(isnan(X2)) = 0; % (clean up divide by zeros) 0086 X2 = sum(X2(:)); 0087 df = (numCategories1 - 1) * (numCategories2 - 1); % degrees of freedom 0088 p = 1 - chi2cdf(X2,df); 0089 end 0090 0091 % Optional graphical output 0092 if (showTables) 0093 % construct conditional tables 0094 oneGivenTwo = joint ./ repmat(marginal2, [size(joint,1), 1]); 0095 oneGivenTwo(find(isnan(oneGivenTwo))) = 0; % (deal with divide by zeros) 0096 twoGivenOne = joint ./ repmat(marginal1, [1, size(joint,2)]); 0097 twoGivenOne(find(isnan(twoGivenOne))) = 0; % (deal with divide by zeros) 0098 0099 figure; 0100 subplot(2,2,1); imagesc(joint); 0101 title('Two-Way Classification Table'); ylabel('Assignments 1'); xlabel('Assignments 2'); 0102 subplot(2,2,2); imagesc(oneGivenTwo); 0103 title('Assignments 1 given Assignments 2'); ylabel('Assignments 1'); xlabel('Assignments 2'); 0104 subplot(2,2,3); imagesc(twoGivenOne); 0105 title('Assignments 2 given Assignments 1'); ylabel('Assignments 1'); xlabel('Assignments 2'); 0106 subplot(4,2,6); bar(marginal1); axis tight; 0107 title('Assignments 1 Marginal'); 0108 subplot(4,2,8); bar(marginal2); axis tight; 0109 title('Assignments 2 Marginal'); 0110 pixval on; 0111 end 0112 0113 warning(s);