Spotting Spurious Data with Neural Networks

Automatic identification of spurious instances (those with potentially wrong labels in datasets) can improve the quality of existing language resources, especially when annotations are obtained through crowdsourcing or automatically generated based on coded rankings. We have developed an effective approach inspired by queueing theory and psychology of learning to automatically identify spurious instances in datasets. Our approaches discriminate instances based on their "difficulty to learn," determined by a downstream learner. Our methods can be applied to any dataset assuming the existence of a neural network model for the target task of the dataset. 

Leitner System

  • Suppose we have n queues {q_0, q_1,...,q_{n-1}
  • Initially place all instances in the first queue, q_0,
  • Leitner scheduler trains the network with instances of q_i at every 2^i iterations,
  • During training, if an instance from q_i is correctly classified by the network, it will be "promoted" to q_{i+1}, otherwise it will be "demoted" to the first queue, q_0,
  • As the network trains, higher queues will accumulate easier instances, while lower queues carry either hard or potentially spurious instances.






[Paper], [Poster], [Code].