Matching Paper with Reviewers: Report from Area Chairs

Dear readers,

My personal nightmare is rejecting interesting innovative papers in favor of safe incremental pieces on the topic de jour. Given the size of our submission pool (1400+), Min and I will be unable to read the vast majority of the submissions. This means that we have to rely on (very many) others to make the right choices. The only thing that is in our control is the mechanism for paper assignment which ideally would match every paper with the most qualified reviewers and provide adequate oversight from area chairs.

In one of our previous blog entries, we have discussed how papers were distributed among technical areas. In this entry, we want to focus on reviewer assignment within different areas. On our side, we implemented two changes. First, we increased the number of area chairs (ACs) for each area, and introduced the notion of a meta AC that coordinates efforts within each area. The idea here was to reduce the number of papers that each AC handles, thereby providing them with the opportunity to closely coordinate their papers. In addition, we added scores from the Toronto Paper Matching System (TPMS) to help coordinate paper assignments. We believed that TPMS will be particularly useful for large areas where ACs may not necessarily know the research profile of all of the reviewers they work with.

While we provided general guidance on how to complete paper assignment within each area (e.g., balancing senior and junior reviewers for each submission), ACs had a free hand in  this process.  To inform you about this process,  we asked several ACs to contribute a short summary on how did they implement the matching process. After reading their summaries, I was impressed by their diligence and thought they put into the process. I hope reading these summaries will be reassuring for those who submitted to ACL 2017 — your papers were handled with all the attention they deserve.  Also, for those of you who are interested to be in an AC in the future, reading these descriptions will give you an idea about the amount of effort this job involves.

We thank Jason Eisner, Emily Pitler, Preslav Nakov, Alessandro Moschitti, and Bonnie Webber for their writing contributions,  and the rest of the ACs for their hard work!

Phonology, Morphology and Word Segmentation (ACs:  Jason Eisner,  Hinrich Schütze; submitted by Jason Eisner

As ACs, we’ll ultimately have to rank the papers and make recommendations. That’s both important and hard, so naturally we hope for input that will help us make wise decisions without reading the papers too closely ourselves! We’d like every paper to be read by a team of reviewers who, collectively, can identify its strengths and weaknesses.

Here’s how Hinrich Schütze and I proceeded in the area we’re co-chairing this year:

1. We manually organized the 40+ submissions into 8 clusters. The largest of these — unsupervised discovery / generative modeling — had 6 subclusters corresponding loosely to areas of linguistics..

2. We placed each reviewer into one or more of these (sub)clusters, sometimes also placing them alongside specific papers on which they seemed particularly qualified to comment. This was tricky because this year we’d been assigned some unfamiliar reviewers. We both spent quite a few hours looking up reviewers’ own work online, and in a database of relevant papers that I’d earlier scraped from the ACL Anthology on behalf of SIGMORPHON.

3. We invited extra reviewers for clusters where we were shorthanded. We ended up with a surplus, which made subsequent assignment easier.

4. Hinrich manually made a initial rough draft assignment in the START system. Then I went through each cluster in turn and assigned 3 reviewers to each paper, taking suggestions from the following 4 sources (in decreasing order of importance):
(a) our list of reviewers associated with that paper or its subcluster and cluster,
(b) the reviewers who had bid “yes,”
(c) the reviewers who scored as high matches according to the Toronto Paper Matching System, and
(d) the draft assignment.
START’s user interface color-codes each reviewer by their bid, so it was easy to ensure that almost all assignments were “yes” or “maybe” bids. The Toronto scores were harder to access and didn’t always match our intuitions, so we didn’t consult those frequently.

5. We revised the assignment until it was right. Reviewer assignment is a combinatorial optimization challenge that requires a few passes to smooth out. There are many interactions: No reviewer should have too high a load. (Nor just 1 paper, because that makes it hard to compare.) A paper that touches on multiple areas of expertise should have a correspondingly diverse review panel. A paper should not have more than one junior reviewer. Papers that are similar should share a reviewer.

Most important, each of a paper’s 3 reviewers should be appropriate! We were mainly interested in whether a reviewer is experienced with the technical methods or the linguistic phenomena, or knows the past literature on the task, or brings a special lens to a paper.

Tagging, Chunking, Syntax and Parsing  (ACs: Emily Pitler, Barbara Plank, Yue Zhang, Hai Zhao; submitted by Emily Pitler)

We did not use any of the automatic assignment features in our area.  Both the bids and Toronto Matching scores were used, but only as input features for our human judgments.  Between the four ACs, we each assigned about 1/4 of the papers in our area–below is the process I used for the papers I did assignments for.  

I went through the process with three simultaneous tabs open: one tab to do the actual assignments in START, another START tab for keeping track of how many papers were currently assigned to each reviewer, and a shared spreadsheet we had created ahead of time that showed the conflicts of interest, seniority status, and paper load quota for each reviewer.

When I got to a paper, I first looked at the set of people who bid ‘yes’ on it, and tried to select reviewers from that set.  Based on the title and abstract, I considered which of these reviewers had the most appropriate expertise for the paper; within that group, I tried to select three reviewers who would bring distinct viewpoints from one another.  I used the Toronto Matching Scores to help me figure out who might have relevant expertise among the reviewers I did not know personally.  

As a group, we made sure that each paper was assigned at most one student reviewer.  We also were able to assign exclusively ‘yes’ bidding reviewers in a large majority of the cases.  Finally, we did a fine-tuning pass where we evened out the load, reassigning papers from those who were initially assigned too many to those who were initially assigned too few. 

Semantics (Manaal Faruqui, Hannaneh Hajishirzi, Anna Korhonen, Preslav Nakov, Mehroosh Sadrzadeh, Aline Villavicencio; submitted by Preslav Nakov)

For the Semantics track, the process of assignment of papers to reviewers was largely facilitated by the availability of many great reviewers, which we have managed to recruit for our area. Rather than simply reusing reviewer lists from previous years, we invested a lot of time and efforts in putting together a long list of potential reviewers whom we knew and whom we trusted, and this has paid off as we got many of them eventually in our track. The process of paper assignment was further facilitated by the presence of the TPMS system, even though not all of our reviewers had put their profiles there. We had the second largest area this year, with 159 papers, but we had six co-chairs, so we had less than 30 papers per chair. We first asked START to generate automatic assignments, which we then refined manually, making sure each paper was assigned at least one senior reviewer who was an expert in the paper’s topic and who has bid for the paper. We further made our best to match each paper with reviewers with the necessary expertise: for this, we relied on suggestions by TPMS and on our knowledge about the background of the individual reviewers. We also tried to assign reviewers who had actually bid on the paper. This all was subject to constraints such as that no reviewer should have more than four papers assigned (in fact, most of our reviewers got three papers, and some got two), and we tried to avoid any potential conflict of interest we could potentially see. Trying to balance all these objectives and constraints was hard and took us several iterations, and many hours of work over three days and over different time zones. We further wrote some scripts to perform automatic checks for some constraints such as reviewer load. Overall, we are quite happy with the final result.

Information Extraction and Retrieval, Question Answering, Text Mining, Document Analysis and NLP Applications (Eugene Agichtein,  Chia-Hui Chang,  Jing Jiang, Sarvnaz Karimi,  Zornitsa Kozareva, Kang Liu, Tie-Yan Liu, Mausam, Alessandro Moschitti, Smaranda Muresan; submitted by Alessandro Moschitti

The paper assignment for ACL 2017 had some challenges due to:

  1. the high number of reviewers (280) and papers (307), which are not typical for a standard track;
  2. the need of coordination of the work among 10 Area Chairs in different time zones;
  3. the fact that the same reviewers were serving both short and long tracks, however the tracks were set as different conferences, consequently, Softconf could not automatically assign reviews considering the paper load constraint of each reviewer.

To solve this problem we came up with two competing approaches:

a. make two automatic assignments for short and long paper tracks, allowing violation of some load constraints; and

b. set two assignment quotas for each reviewer for the two tracks, respectively, such that their sum satisfies the desired reviewer load.

Option a. would have required a manual check of the reviewer load whereas Option b. may have generated two assignment problems with possibly no feasible solutions, and, in any case, the two solutions could have been probably worse than the one achievable by Option b. Thus, we opted for Option a. 

Meta-Area Chair one (MAC1) made the automatic assignments for long and short tracks, and set the environment needed for TPMS. Due to the high number of reviewers, this solution affected very few violations.

Meta-Area Chair two (MAC2) automatically scraped the current reviewer quota and the current assignments for each reviewer from the short and long track data in Softconf. MAC2 wrote an automated script to verify the current reviewers’ load, and listed all reviewers whose quota was violated and needed adjustment.

After that some volunteers from the Area Chairs (ACs) manually adjusted the violations by assigning the most suitable reviewers from TPMS (indeed, knowing the characteristics of all 280 reviewers was basically impossible). The main challenge was with the coordination of the eight ACs, since they were in different zones and could only work on the assignment in a mutually exclusive way, otherwise they would end up overriding each other’s assignments. 

When we started running out of time, MAC2 took over the process and finalized all missing assignments by manually going over all reviewers with exceeded load, and finding a replacement from the reviewers whose expertise and TPMS score matched the paper but at the same time could take on additional review load. In the meantime, all the ACs were asked to check COIs or wrong assignments thanks to their specific knowledge about reviewers.

After this step MAC1 carried out a final check over all assignments and adjusted another small problem related to a mismatch of the set of reviewers in the two different tracks (3 reviewers were missing in the short track).

After this last step, the review phase was ready to start and emails were sent to all reviewers.

Discourse and Pragmatics (Yangfeng Ji, Sujian Li, Bonnie Webber; submitted by Bonnie Webber

The Discourse & Pragmatics ACs did not do anything complicated:

 – We reviewed CoIs, adding for people recently moved to new positions, CoIs with colleagues at their previous institution;

 – Reviewers who put too many constraints on what they were willing to review (i.e., these and only these N papers) were gently told that they would be better members of the Community if they would accept a broader selection of papers compatible with their known expertise;

 – After running an initial assignment on Softconf, we adjusted it so that every paper would have at least one reviewer, if not two, who had ASKED to review the paper. We thought it best to have reviewers who had actually expressed interest in learning more about work being reported.

– We carried out the same process for both long and short papers.