Dataset for research paper similarity evaluation

The dataset available here is intended to help with experiments regarding the assessment of research paper similarity. It provides a ground truth based on annotations made by 8 experts on 220 documents related to artificial intelligence. Using tf-idf with cosine similarity, the 30 most similar papers in a 16597-paper collection were found for each of the 220 papers. Each of those 30 papers was then manually tagged by an expert sufficient familiar with it as either similar or dissimilar.

The dataset is formed by the following three documents:

evaluation.txt: this file contains lines in the form:
doc_id1 POS doc_id2 , when the document with identifier doc_id2 was tagged as similar to the document with identifier .
doc_id1 NEG doc_id2 , when the document with identifier doc_id2 was tagged as not similar to the document with identifier .
testids.txt: this file contains the identifiers of the 220 papers which conform the test set.
documents.txt: this file contains records about the papers used in the experiments (the 220 papers plus the 30 most similar ones for each case, a total of 2881 papers).

The structure of a record in this last document is given by the following tags:

DOC, followed by the unique identifier of the document within the dataset
TIT, followed by the title of the paper
AUT, followed by the author(s) of the paper
JOU, followed by the journal in which the paper was published originally
YEA, followed by the year of publication of the paper
DOI, followed by the DOI of the paper. This last field is optional, as not all the papers had an available DOI
END, to indicate the end of the record

To evaluate the performance of any developed research paper similarity method, each paper p in the testids.txt file must thus be compared against 30 others indicated in evaluation.txt, some of which are tagged as similar (note: in some cases, a paper is compared to less than 30 documents, as in some cases the annotators had no complete certainty about the similarity, and those possible candidates were then just ignored).