We observed similar results comparing different feature representations that although TF-IDF models might capture more repetitive patterns (higher accuracy), contextualized models help us identify patterns of terms used in different contexts, thereby reducing the accuracy. TF-IDF is a statistical measure that evaluates how relevant a word is to an item in a collection of items. We have the actual age and gender information for each item in a cluster from item metadata (patient characteristics). Is there a dominance of question categories and topics prevalent within each cluster of similar item stems with respect to these patient characteristics? Every vignette’s text contained a reference to the patient’s age and gender, consistent with the exam style guidelines, and every item contained metadata with a categorical value representing the age (grouped into categories) and gender. We also removed high-frequency non-medical words after consultation with an internal medical advisor who has expertise in the content areas represented by the exam blueprints, in the item-writing guidelines, and in training other physicians to create content for these exams. In this study, we propose to explore such bias on the part of physicians writing items for medical licensure exams, where “bias” is defined as the use of repeated, stereotypical language patterns with respect to patient characteristics.
We explored topic modeling on the clean item stems with correctly predicted gender and age ranges to discover the hidden themes (language patterns indicative of specific patient characteristics) in a cluster. As we intended to identify both patient characteristics indicative of language patterns, we removed age and gender indicative terms. Language models provide context to distinguish between words and phrases that sound similar. As SciBERT and BioClinicalBERT models are trained on different datasets, we observed that while SciBERT performed with higher accuracy in two clusters, BioClinicalBERT performed best in three others. As Logistic Regression performed the best in all clusters, we select it for comparison with models trained on other representations as shown in Table III. POSTSUBSCRIPT’ in Fig. 4 as an example, in the subsection “Preliminary experiment settings and model tuning”, the Gram matrix with five sample data is shown below. Similar to gender, as accuracy here indicates the patterns related to age range patient characteristics, ideally, a model should have lower accuracy. We observed that training our model end-to-end is critical for improved registration accuracy (Table 2). We experimented with the pre-training Appearance model, Flow Model, and Style encoder separately using the losses defined in Sec.