Abstract
Clustering of ranking data aims at the identification of groups of subjects with a homogenous, common, preference behavior. Human beings naturally tend to rank objects in the everyday life such as shops, one’s place of living, choice of occupations, singers and football teams, according to their preferences. More generally, ranking data occurs when a number of subjects are asked to rank a list of objects according to their personal preference order. The input in cluster analysis is a dissimilarity matrix quantifying the differences between rankings of two subjects. The choice of the dissimilarity dramatically affects the classification outcome and therefore the computation of an appropriate dissimilarity matrix is an issue. Several distance measures have been proposed for ranking data. We propose generalizations of this kind of distance using copulas adapted to the case of missing data. We consider the case of the extreme list where only the top-k and/or bottom-k ranks are known. We discuss an optimistic and a pessimistic imputation of missing values and show its effect on the classification. Those generalizations provide a more flexible instrument to model different types of data dependence structures and consider different situations in the classification process. Simulated and real data are used to illustrate the performance and the importance of our proposal.