datasets | Oshin Agarwal

KELM Corpus

The KELM corpus consists of the entire English Wikidata Knowledge Graph as natural language text. This synthetic corpus can be used to augment pre-training data of language models. The method of generating this corpus and the results of incorporating it in pre-training are described at length in our NAACL 2021 paper, "Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training". The corpus is available here

Entity-switched NER

This dataset consists of sentences from the English CoNLL'03 NER dataset with ethnically diverse entities replaced programatically with some manual annotation for the replacement of organizations. It can be used to evaluate the robustness of named entity recognition systems in recognizing a diverse set of entities correctly. The dataset is available here and more details about it can be found in "Entity-Switched Datasets: An Approach to Auditing the In-Domain Robustness of Named Entity Recognition Models".

Embeddings Personality Bias

This dataset consists of words describing nationalities, professions or a common noun description of a hypothetical person, along with human ratings about the Big Five personality traits formed solely from this information. Analysis of the data reveals a large number of statistically significant stereotypes in people which can also be found in word embeddings. The datset and analysis are described in our \*SEM 2019 paper, "Word Embeddings (Also) Encode Human Personality Stereotypes". The dataset is available here

EBM-NLP

This corpus consists of abstracts of clinical trials, annotated for phrases describing patients, interventions and outcomes of the trial. As part of our NAACL 2019 paper, "Predicting Annotation Difficulty to Improve Task Routing and Model Performance for Biomedical Information Extraction", we also generated difficulty scores for labeling these abstracts to determine when the abstract can be annotated by crowd workers and when it needs to annotated by medical experts. The dataset is available here

</div> </div> </div> </div> </div> </div> </div>