


COMPAREASSIGNMENTS Computes a quality measure of the similarity between assignments.
[score, scoreIfIndependent] = compareAssignments(assignments1, assignments2)
The inputs, 'assignments1' and 'assignments2', must be two column vectors of
the same length where each row contains an integer category label for the
corresponding sample. The integer labels used in the assignment vectors need
have no intrinsic meaning (in particular, e.g., category 1 in 'assignments1'
has no relationship to category 1 in 'assignments2').
The first output, 'score', is a scalar between 0 and 1 that measures the
similarity between the two classifications. A 'score' of 1 implies perfect
correspondance, ignoring actual labels. For example, if all samples in
'assignments1' are labelled by 1 and relabelled as 2 in 'assignments2', the
'score' would be 1. Deviations from this correspondance are penalized in a
a fashion that recognizes category splitting/merging and penalizes these less
than completely random redistribution.
The algorithm is motivated by a Chi^2 two-way classification; however, here we
return a similarity score rather than simply testing the hypothesis that the
classifications are independent. The expected score if the classifications
were independent is returned as the second output, 'scoreIfIndependent', with
the standard Chi^2 two-way p-value returned as an optional third output (this
requires the statistics toolbox). This p-value represents the probability that
the two assignments were independent.
Conceptually (though not computationally), the algorithm considers all N*(N-1)
pairs of data samples and counts pairs that cosegregate, where a pair of samples
is defined as cosegregating if they either share the same category in both
assignments or if they do not share category in either assignment. For example,
consider the following assignments:
sample # assignments1 assignments2
1 1 2
2 1 2
3 2 3
4 1 3
The pairs (1,2) and (1,3) cosegregate while the pair (1,4) does not (since they
share a label in 'assignments1' but not in 'assignments2'). 'score' is the fraction
of pairs that cosegregate between the two assignments.
(An optional third boolean input argument 'showTables' (default 0) produces a graphical
output with the contingency table, conditional probabilities and marginals for the
assignments. The 'score' described above is calculated efficiently using these matrices).

0001 function [score, scoreIndep, p] = compareAssignments(assigns1, assigns2, showTables) 0002 0003 % COMPAREASSIGNMENTS Computes a quality measure of the similarity between assignments. 0004 % 0005 % [score, scoreIfIndependent] = compareAssignments(assignments1, assignments2) 0006 % The inputs, 'assignments1' and 'assignments2', must be two column vectors of 0007 % the same length where each row contains an integer category label for the 0008 % corresponding sample. The integer labels used in the assignment vectors need 0009 % have no intrinsic meaning (in particular, e.g., category 1 in 'assignments1' 0010 % has no relationship to category 1 in 'assignments2'). 0011 % 0012 % The first output, 'score', is a scalar between 0 and 1 that measures the 0013 % similarity between the two classifications. A 'score' of 1 implies perfect 0014 % correspondance, ignoring actual labels. For example, if all samples in 0015 % 'assignments1' are labelled by 1 and relabelled as 2 in 'assignments2', the 0016 % 'score' would be 1. Deviations from this correspondance are penalized in a 0017 % a fashion that recognizes category splitting/merging and penalizes these less 0018 % than completely random redistribution. 0019 % 0020 % The algorithm is motivated by a Chi^2 two-way classification; however, here we 0021 % return a similarity score rather than simply testing the hypothesis that the 0022 % classifications are independent. The expected score if the classifications 0023 % were independent is returned as the second output, 'scoreIfIndependent', with 0024 % the standard Chi^2 two-way p-value returned as an optional third output (this 0025 % requires the statistics toolbox). This p-value represents the probability that 0026 % the two assignments were independent. 0027 % 0028 % Conceptually (though not computationally), the algorithm considers all N*(N-1) 0029 % pairs of data samples and counts pairs that cosegregate, where a pair of samples 0030 % is defined as cosegregating if they either share the same category in both 0031 % assignments or if they do not share category in either assignment. For example, 0032 % consider the following assignments: 0033 % sample # assignments1 assignments2 0034 % 1 1 2 0035 % 2 1 2 0036 % 3 2 3 0037 % 4 1 3 0038 % The pairs (1,2) and (1,3) cosegregate while the pair (1,4) does not (since they 0039 % share a label in 'assignments1' but not in 'assignments2'). 'score' is the fraction 0040 % of pairs that cosegregate between the two assignments. 0041 % 0042 % (An optional third boolean input argument 'showTables' (default 0) produces a graphical 0043 % output with the contingency table, conditional probabilities and marginals for the 0044 % assignments. The 'score' described above is calculated efficiently using these matrices). 0045 0046 if ((size(assigns1, 2) > 1) || (size(assigns2, 2) > 1) || (size(assigns1,1) ~= size(assigns2, 1))) 0047 error('Error in assignment vectors. The first two inputs must be column vectors of equal length.'); 0048 end 0049 0050 if ((nargin < 3) || (showTables == 0)) % if we're not doing graphics, this is more memory efficient. 0051 assigns1 = sortassignments(assigns1); 0052 assigns2 = sortassignments(assigns2); 0053 showTables = 0; 0054 end 0055 0056 s = warning('MATLAB:divideByZero', 'off'); 0057 0058 numSamples = size(assigns1, 1); 0059 numCategories1 = length(unique(assigns1)); 0060 numCategories2 = length(unique(assigns2)); 0061 0062 % Construct classification table and marginals 0063 joint = full(sparse(assigns1, assigns2, 1, max(assigns1), max(assigns2))) ./ numSamples; 0064 marginal1 = sum(joint, 2); 0065 marginal2 = sum(joint, 1); 0066 0067 % This somewhat cryptic expression computes the score described above. i'll comment it 0068 % later to explain. 0069 score = (2 * joint(:)' * joint(:)) - sum(sum(joint' * joint)) - sum(sum(joint * joint')); 0070 score = 1 + (numSamples / (numSamples - 1)) * score; 0071 0072 % Now get the score expected if the classifications were independent; we do this by 0073 % reconstructing a joint under the assumption of independent classifications (i.e., 0074 % p(x,y) = p(x)p(y)) and then using the same mystery expression to find the score. 0075 jointIndep = (marginal1 * marginal2); 0076 scoreIndep = (2 * jointIndep(:)' * jointIndep(:)) ... 0077 - sum(sum(jointIndep' * jointIndep)) - sum(sum(jointIndep * jointIndep')); 0078 scoreIndep = 1 + (numSamples / (numSamples-1)) * scoreIndep; 0079 0080 % if a p-value was requested, compute Chi^2 0081 if (nargout > 2) 0082 X2 = numSamples .* (((joint - jointIndep).^2)./jointIndep); % chi^2 0083 X2(isnan(X2)) = 0; % (clean up divide by zeros) 0084 X2 = sum(X2(:)); 0085 df = (numCategories1 - 1) * (numCategories2 - 1); % degrees of freedom 0086 p = 1 - chi2cdf(X2,df); 0087 end 0088 0089 % Optional graphical output 0090 if (showTables) 0091 % construct conditional tables 0092 oneGivenTwo = joint ./ repmat(marginal2, [size(joint,1), 1]); 0093 oneGivenTwo(find(isnan(oneGivenTwo))) = 0; % (deal with divide by zeros) 0094 twoGivenOne = joint ./ repmat(marginal1, [1, size(joint,2)]); 0095 twoGivenOne(find(isnan(twoGivenOne))) = 0; % (deal with divide by zeros) 0096 0097 figure; 0098 subplot(2,2,1); imagesc(joint); 0099 title('Two-Way Classification Table'); ylabel('Assignments 1'); xlabel('Assignments 2'); 0100 subplot(2,2,2); imagesc(oneGivenTwo); 0101 title('Assignments 1 given Assignments 2'); ylabel('Assignments 1'); xlabel('Assignments 2'); 0102 subplot(2,2,3); imagesc(twoGivenOne); 0103 title('Assignments 2 given Assignments 1'); ylabel('Assignments 1'); xlabel('Assignments 2'); 0104 subplot(4,2,6); bar(marginal1); axis tight; 0105 title('Assignments 1 Marginal'); 0106 subplot(4,2,8); bar(marginal2); axis tight; 0107 title('Assignments 2 Marginal'); 0108 pixval on; 0109 end 0110 0111 warning(s);