Can We Automate Scientific Reviewing?

Weizhe Yuan, Pengfei Liu, Graham Neubig

Introduction

The number of published papers is growing exponentially tabah1999literature; de2009bibliometrics; bornmann2015growth. While this may be positively viewed as indicating acceleration of scientific progress, it also poses great challenges for researchers, both in reading and synthesizing the relevant literature for one’s own benefit, and for performing peer review of papers to vet their correctness and merit. With respect to the former, a large body of existing work explores automatic summarization of a paper or a set of papers for automatic survey generation mohammad-etal-2009-using; jha-etal-2013-system; jha-etal-2015-content; DBLP:conf/aaai/JhaCR15; DBLP:conf/aaai/YasunagaKZFLFR19; cohan-etal-2018-discourse; xing-etal-2020-automatic. However, despite the fact that peer review is an important, but laborious part of our scientific process, automatic systems to aid in the peer review process remain relatively underexplored. bartoli2016your investigated the feasibility of generating reviews by surface-level term replacement and sentence reordering, and wang2020reviewrobot (contemporaneously and independently) propose a two-stage information extraction and summarization pipeline to generate paper reviews. However, both do not extensively evaluate the quality or features of the generated review text.

In this work, we are concerned with providing at least a preliminary answer to the ambitious over-arching question: can we automate scientific reviewing? Given the complexity of understanding and assessing the merit of scientific contributions, we do not expect an automated system to be able to match a well-qualified and meticulous human reviewer at this task any time soon. However, some degree of review automation may assist reviewers in their assessments, or provide guidance to junior reviewers who are just learning the ropes of the reviewing process. Towards this goal, we examine two concrete research questions, the answers to which are prerequisites to building a functioning review assistant:

Q1: What are the desiderata of a good automatic reviewing system, and how can we quantify them for evaluation? Before developing an automatic review system, we first must quantify what constitutes a good review in the first place. The challenge of answering this question is that a review commonly involves both objective (e.g. “lack of details necessary to replicate the experimental protocol”) and subjective aspects (e.g. “lack of potential impact”). Due to this subjectivity, defining a “good” review is itself somewhat subjective.

As a step towards tackling this challenge, we argue that it is possible to view review generation as a task of aspect-based scientific paper summarization, where the summary not only tries to summarize the core idea of a paper, but also assesses specific aspects of that paper (e.g. novelty or potential impact). We evaluate review quality from multiple perspectives, in which we claim a good review not only should make a good summary of a paper but also consist of factually correct and fair comments from diverse aspects, together with informative evidence.

To operationalize these concepts, we build a dataset of reviews, named ASAP-ReviewASpect-enhAnced Peer Review dataset from machine learning domain, and make fine-grained annotations of aspect information for each review, which provides the possibility for a richer evaluation of generated reviews.

Q2: Using state-of-the-art NLP models, to what extent can we realize these desiderata? We provide an initial answer to this question by using the aforementioned dataset to train state-of-the-art summarization models to generate reviews from scientific papers, and evaluate the output according to our evaluation metrics described above. We propose different architectural designs for this model, which we dub ReviewAdvisor (§LABEL:sec:model), and comprehensively evaluate them, interpreting their relative advantages.

Lastly, we highlight our main observations and conclusions:

(1) What are review generation systems (not) good at? Most importantly, we find the constructed automatic review system generates non-factual statements regarding many aspects of the paper assessment, which is a serious flaw in a high-stakes setting such as reviewing. However, there are some bright points as well. For example, it can often precisely summarize the core idea of the input paper, which can be either used as a draft for human reviewers or help them (or general readers) quickly understand the main idea of the paper to be reviewed (or pre-print papers). It can also generate reviews that cover more aspects of the paper’s quality than those created by humans, and provide evidence sentences from the paper. These could potentially provide a preliminary template for reviewers and help them quickly identify salient information in making their assessment.

(2) Will the system generate biased reviews? Yes. We present methods to identify and quantify potential biases in reviews (§LABEL:sec:bias-analysis), and find that both human and automatic reviewers exhibit varying degrees of bias. (i) regarding native vs. non-native English speakers: papers of native English speakers tend to obtain higher scores on “Clarity” from human reviewers than non-native English ones,Whether this actually qualifies as “bias” is perhaps arguable. Papers written by native English speakers may be more clear due to lack of confusing grammatical errors, but the paper may actually be perfectly clear but give the impression of not being clear because of grammatical errors. but the automatic review generators narrow this gap. Additionally, system reviewers are harsher than human reviewers when commenting regarding the paper’s “Originality” for non-native English speakers. (ii) regarding anonymous vs. non-anonymous submissions: both human reviewers and system reviewers favor non-anonymous papers, which have been posted on non-blind preprint servers such as arXivhttps://arxiv.org/ before the review period, more than anonymous papers in all aspects.

Based on above mentioned issues, we claim that a review generation system can not replace human reviewers at this time, instead, it may be helpful as part of a machine-assisted human review process. Our research also enlightens what’s next in pursuing a better method for automatic review generation or assistance and we summarize eight challenges that can be explored for future directions in §LABEL:sec:challenges.

What Makes a Good Peer Review?

Although peer review has been adopted by most journals and conferences to identify important and relevant research, its effectiveness is being continuously questioned (Smith2006PeerRA; langford2015arbitrariness; tomkins2017reviewer; gao-etal-2019-rebuttal; rogers-augenstein-2020-improve).

As concluded by jefferson2002measuring: “Until we have properly defined the objectives of peer-review, it will remain almost impossible to assess or improve its effectiveness.” Therefore we first discuss the possible objectives of peer review.

A research paper is commonly first reviewed by several committee members who usually assign one or several scores and give detailed comments. The comments, and sometimes scores, cover diverse aspects of the paper (e.g. “clarity,” “potential impact”; detailed in §3.2.1), and these aspects are often directly mentioned in review forms of scientific conferences or journals.For example, one example from ACL can be found at: https://acl2018.org/downloads/acl_2018_review_form.html

Then a senior reviewer will often make a final decision (i.e., “reject” or “accept”) and provide comments summarizing the decision (i.e., a meta-review).

After going through many review guidelineshttps://icml.cc/Conferences/2020/ReviewerGuidelines https://NeurIPS.cc/Conferences/2020/PaperInformation/ReviewerGuidelines, https://iclr.cc/Conferences/2021/ReviewerGuide and resources about how to write a good reviewhttps://players.brightcove.net/3806881048001/rFXiCa5uY_default/index.html?videoId=4518165477001, https://soundcloud.com/nlp-highlights/77-on-writing-quality-peer-reviews-with-noah-a-smith, https://www.aclweb.org/anthology/2020.acl-tutorials.4.pdf, https://2020.emnlp.org/blog/2020-05-17-write-good-reviews we summarize some of the most frequently mentioned desiderata below:

Decisiveness: A good review should take a clear stance, selecting high-quality submissions for publication and suggesting others not be accepted (jefferson2002effects; Smith2006PeerRA).

Comprehensiveness: A good review should be well-organized, typically starting with a brief summary of the paper’s contributions, then following with opinions gauging the quality of a paper from different aspects. Many review forms explicitly require evaluation of different aspects to encourage comprehensiveness.

Justification: A good review should provide specific reasons for its assessment, particularly whenever it states that the paper is lacking in some aspect. This justification also makes the review more constructive (another oft-cited desiderata of reviews), as these justifications provide hints about how the authors could improve problematic aspects in the paper (xiong-litman-2011-automatically).

Accuracy: A review should be factually correct, with the statements contained therein not being demonstrably false.

Kindness: A good review should be kind and polite in language use.

Based on above desiderata, we make a first step towards evaluation of reviews for scientific papers and characterize a “good” review from multiple perspectives.

2 Multi-Perspective Evaluation

The advent of the Open Peer Review systemhttps://openreview.net/ makes it possible to access review data for analysis or model training/testing. One previous work kang18naacl attempts to collect reviews from several prestigious publication venues including the Conference of the Association of Computational Linguistics (ACL) and the International Conference on Learning Representations (ICLR). However, there were not nearly as many reviews accumulated in OpenReview at that timeDuring that time, there are no reviews of ICLR from 2018 to 2020 nor reviews of NeurIPS from 2018 to 2019. and other private reviews only accounted for a few hundred. Therefore we decided to collect our own dataset Aspect-enhanced Peer Review (ASAP-Review).

We crawled ICLR papers from 2017-2020 through OpenReviewhttps://openreview.net and NeurIPS papers from 2016-2019 through NeurIPS Proceedings.http://papers.NeurIPS.cc For each paper’s review, we keep as much metadata information as possible. Specifically, for each paper, we include following metadata information that we can obtain from the review web page:

Reference reviews, which are written by a committee member.

Meta reviews, which are commonly written by an area chair (senior committee member).

Decision, which denotes a paper’s final “accept” or “reject” decision.

Other information like url, title, author, etc.

We used Allenai Science-parsehttps://github.com/allenai/science-parse to parse the pdf of each paper and keep the structured textual information (e.g., titles, authors, section content and references). The basic statistics of our ASAP-Review dataset is shown in Tab. 2.

2 Aspect-enhanced Review Dataset

Although reviews exhibit internal structure, for example, as shown in Fig. LABEL:fig:multi-view, reviews commonly start with a paper summary, followed by different aspects of opinions, together with evidence. In practice, this useful structural information cannot be obtained directly. Considering that fine-grained information about the various aspects touched on by the review plays an essential role in review evaluation, we conduct aspect annotation of those reviews. To this end, we first (i) introducing an aspect typology and (ii) perform human annotation.

We define a typology that contains 8 aspects, which follows the ACL review guidelineshttps://acl2018.org/downloads/acl_2018_review_form.html. We manually inspected several review guidelines from ML conferenecs and found the typology in ACL review guideline both general and comprehensive. with small modifications, which are Summary (SUM), Motivation/Impact (MOT) , Originality (ORI), Soundness/Correctness (SOU), Substance (SUB), Replicability (REP), Meaningful Comparison (CMP) and Clarity (CLA). The detailed elaborations of each aspect can be found in Supplemental Material B.1. Inside the parentheses are what we will refer to each aspect for brevity. To take into account whether the comments regarding each aspect are positive or negative, we also mark whether the comment is positive or negative for every aspect (except summary).

2.2 Aspect Annotation

Overall, the data annotation involves four steps that are shown in Fig. 1.

To manually annotate aspects in reviews, we first set up a data annotation platform using Doccano.https://github.com/doccano/doccano We asked 6 students from ML/NLP backgrounds to annotate the dataset. We asked them to tag an appropriate text span that indicates a specific aspect. For example, “The results are new ${}_{\text{[Positive Originality]}}$ and important to this field ${}_{\text{[Positive Motivation]}}$ ”. The detailed annotation guideline can be found in Supplemental Material B.1. Each review is annotated by two annotators and the lowest pair-wise Cohen kappa is 0.653, which stands for substantial agreement. In the end, we obtained 1,000 human-annotated reviews in total. The aspect statistics in this dataset are shown in Fig. 2-(a).

Since there are over 20,000 reviews in our dataset, using human labor to annotate them all is unrealistic. Therefore, we use the annotated data we do have to train an aspect tagger and use it to annotate the remaining reviews. The basic architecture of our aspect tagger contains a pre-trained model BERT devlin2019bert and a multi-layer perceptron. The training details can be found in Appendix LABEL:app:train-tagger.

However, after inspecting the automatically labeled dataset, we found that there appears to be some common problems such as interleaving different aspects and inappropriate boundaries. To address those problems, we used seven heuristic rules to refine the prediction results and they were executed sequentially. The detailed heuristics can be found in Appendix LABEL:sec:heuristics. An example of our model prediction after applying heuristic rules is shown in Appendix LABEL:sec:example-annotate. Fig. 2-(b) shows the distribution of all reviews over different aspects. As can be seen, the relative number of different aspects and the ratio of positive to negative are very similar across human and automatic annotation.

To evaluate the data quality of reviews’ aspects, we conduct human evaluation. Specifically, we measure both aspect precision and aspect recall for our defined 15 aspects.

We randomly chose 300 samples from our automatically annotated dataset and assigned each sample to three different annotators to judge the annotation quality. As before, these annotators are all from ML/NLP backgrounds.

The detailed calculation for aspect precision and aspect recall can be found in Appendix LABEL:app:asp-pre-and-asp-recall. Under these criteria, we achieved $92.75\%$ aspect precision and 85.19% aspect recall. The fine-grained aspect precision and aspect recall for each aspect is shown in Tab. LABEL:tab:asp-cov-rec. The aspect recall for positive replicability is low. This is due to the fact that there are very few mentions of positive replicability. And in our human evaluation case, the system identified one out of two, which results in 50%. Other than that, the precision and recall are much higher. The recall numbers for negative aspects are lower than positive aspects. However, we argue that this will not affect the fidelity of our analysis much because (i) we observe that the imperfect recall is mostly (over $85\%$ ) caused by partial recognition of the same negative aspect in a review instead of inability to recognize at least one. This will not affect our calculation of Aspect Coverage and Aspect Recall very much. (ii) The imperfect recall will slightly pull up Aspect Score (will discuss in §LABEL:sec:measure_bias), but the trend will remain the same.

Besides, one thing to mention is that our evaluation criterion is very strict, and it thus acts as a lower bound for these two metrics.

However, we found that none of these adjustments could generate satisfying fluent and coherent texts according to our experiments. Common problems include interchanges between first and third person narration (They… Our model…), contradiction between consecutive sentences, more descriptive texts and fewer opinions, etc.

A.9 CE Extraction Details

The basic sentence statistics of our ASAP-Review dataset is listed in Tab. 12.

We use two steps to extract salient sentences from a source document: (i) Keywords filtering, (ii) Cross-entropy method

We have predefined 48 keywords and in the first stage, we select sentences containing those keywords as well as their inflections. The 48 keywords are shown in Tab. LABEL:table:_keywords. After applying keywords filtering, the statistics of selected sentences are shown in Tab. 13.

A.9.2 Cross Entropy Method

Following 10.1145/3077136.3080690’s approach in unsupervised summaization. We formalize the sentence extraction problem as a combinatorial optimization problem. Specifically, we define the performance function $R$ as below.

Where $S$ represents the concatenation of selected sentences, $\text{Len}(S)$ represents the number of words in $S$ while $\text{Count}(w)$ represents the number of times $w$ appears in $S$ . The intuition behind this performance function is that we want to select sentences that can cover more diverse words. Note that when calculating $R(S)$ , we do preprocessing steps (i.e. lowercasing, removing punctuation, removing stop words etc.).

For each paper containing $n$ sentences, we aim to find a binary vector $p=(p_{1},\cdots,p_{n})$ in which $p_{i}$ indicates whether the $i$ -th sentence is selected such that the conbination of selected sentences achieves highest performance score and also contains fewer than 30This number is chosen according to our empirical observations. We need to extract sentences that can fit BART’s input length (1024). sentences. We did this by using Cross Entropy Method rubinstein2013cross. The algorithm is shown below.

For each paper containing $n$ sentences, we first assume that each sentence is equally likely to be selected. We start with $p_{0}=(1/2,1/2,...,1/2)$ . Let $t:=1$ .

Draw a sample $X_{1},\cdots,X_{N}$ of Bernoulli vectors with success probability vector $p_{t-1}$ . For each vector, concatenate the sentences selected and get $N$ sequences $S_{1},\cdots,S_{N}$ . Calculate the performance scores $R(S_{i})$ for all $i$ , and order them from smallest to biggest, $R_{(1)}\leq R_{(2)}\leq\cdots\leq R_{(N)}$ . Let $\gamma_{t}$ be $(1-\rho)$ sample quantile of the performances: $\gamma_{t}=R_{(\lceil(1-\rho)N\rceil)}$ .

Use the same sample to calculate $\hat{p_{t}}=(\hat{p}_{t,1},\cdots,\hat{p}_{t,n})$ via

where $I_{\{c\}}$ takes the value 1 if $c$ is satisfied, otherwise 0.

If the value of $\gamma_{t}$ hasn’t changed for 3 iterations, then stop. Otherwise, set $t:=t+1$ and return to step 2.

The elements in $p_{t}$ will converge to either very close to 0 or very close to 1. And we can sample from the converged $p_{t}$ to get our extraction.

We chose $N=1000$ , $\rho=0.05$ and $\alpha=0.7$ when we ran this algorithm. If we happen to select more than 30 sentences in a sample, we drop this sample. Note that we slightly decrease the initial probability when there are more than 90 sentences after filtering to ensure enough sample number in the first few iterations.

A.10 Detailed Analysis and Case Study

We take our aspect-enhanced model using CE extraction to conduct case study. Tab. 16 lists five examples for each aspect the model mentions. It can be seen that the language use of generated reviews are pretty close to real reviewers.

For aspect-enhanced model, It would also be interesting to trace back to the evidence when the model generates a specific aspect. To do that we inspect where the model attends when it generates a specific aspect by looking at the attention values with respect to the source input.The way we aggregate attention values is to take the maximum, no matter is to aggregate tokens to a word or to aggregate different attention heads or to aggregate words to an aspect.

And interestingly, we found that the model attends to the reasonable place when it generates a specific aspect. Fig. 9 presents the attention heatmap of several segment texts, the bottom of the figure shows aspects the model generates. There are some common patterns we found when we examined the attention values between the source input and output.

When the model generates summary, it will attend to sentences that contain strong indicators like “we propose” or “we introduce”.

When it generates originality, it will attend to previous work part as well as places describing contributions of this work.

When it generates substance, it will attend to experiment settings and number of experiments conducted;

When it generates meaningful comparison, it will attend to places contains “et al.”

A.11 Calculation of Aspect Score

For accepted (rejected) papers, we calculate the average aspect score for each aspect.

The aspect score of a review is calculated as follows.

If an aspect does not appear in a review, then we count the score for this aspect as 0.5 (which stands for neutral)

If an aspect appears in a review, we denote its occurrences as $\mathcal{O}=\{o_{1},o_{2},\cdots,o_{n}\}$ where $n$ is the total number of occurrences. And we denote the positive occurrences of this aspect as $\mathcal{O}_{p}=\{o_{p_{1}},o_{p_{2}},\cdots,o_{p_{n}}\}$ where $p_{n}$ is the total number of positive occurrences. The aspect score is calculated using Formula 12.

A.12 Bias Analysis for All Models

Here, following the methods we proposed in §LABEL:sec:measure_bias, we list the bias analysis for all models in Fig. 10, Fig. 11, Tab. 14, Tab. 15.

The annotation guideline for annotating aspects in reviews can be found at https://github.com/neulab/ReviewAdvisor/blob/main/materials/AnnotationGuideline.pdf