The dataset available here is intended to help with experiments regarding the assessment of research paper similarity. It provides a ground truth based on annotations made by 8 experts on 220 documents related to artificial intelligence. Using tf-idf with cosine similarity, the 30 most similar papers in a 16597-paper collection were found for each of the 220 papers. Each of those 30 papers was then manually tagged by an expert sufficient familiar with it as either similar or dissimilar.
The dataset is formed by the following three documents:
doc_id1 POS doc_id2
, when the document with identifier doc_id2
was tagged as similar to the document with identifier doc_id1 NEG doc_id2
, when the document with identifier doc_id2
was tagged as not similar to the document with identifier The structure of a record in this last document is given by the following tags:
DOC
, followed by the unique identifier of the document within the datasetTIT
, followed by the title of the paperAUT
, followed by the author(s) of the paperJOU
, followed by the journal in which the paper was published originallyYEA
, followed by the year of publication of the paperDOI
, followed by the DOI of the paper. This last field is optional, as not all the papers had an available DOIEND
, to indicate the end of the record
To evaluate the performance of any developed research paper similarity method, each paper p
in the testids.txt
file must thus be compared
against 30 others indicated in evaluation.txt
, some of which are tagged as similar (note: in some cases, a paper is compared to less than 30 documents, as
in some cases the annotators had no complete certainty about the similarity, and those possible candidates were then just ignored).