Dataset for research paper similarity evaluation

The dataset available here is intended to help with experiments regarding the assessment of research paper similarity. It provides a ground truth based on annotations made by 8 experts on 220 documents related to artificial intelligence. Using tf-idf with cosine similarity, the 30 most similar papers in a 16597-paper collection were found for each of the 220 papers. Each of those 30 papers was then manually tagged by an expert sufficient familiar with it as either similar or dissimilar.

The dataset is formed by the following three documents:

The structure of a record in this last document is given by the following tags:

To evaluate the performance of any developed research paper similarity method, each paper p in the testids.txt file must thus be compared against 30 others indicated in evaluation.txt, some of which are tagged as similar (note: in some cases, a paper is compared to less than 30 documents, as in some cases the annotators had no complete certainty about the similarity, and those possible candidates were then just ignored).