OBJECTIVE: To assess whether an integrated Advanced Modular Manikin (AMM) provides improved participant experience compared to use of peripheral simulators alone, during a standardized trauma team scenario.
BACKGROUND: Simulation-based team training has been shown to improve team performance. To address limitations of existing manikin simulators the AMM platform was created, that enables interconnectedness, interoperability and integration of multiple simulators ("peripherals") into an adaptable, comprehensive training system.
METHODS: A randomized single-blinded, cross-over study with two conditions was employed to assess learner experience differences when using the integrated AMM platform versus peripheral simulators. First responders, anesthesiologists, and surgeons rated their experience and workload with the conditions in a 3-scene standardized trauma scenario. Participant ratings were compared and focus groups conducted to obtain insight into participant experience.
RESULTS: Fourteen teams (n=42) participated. Team experience ratings were higher for the integrated AMM condition compared with peripherals (Cohen's d = .25, p = .016). Participant experience varied by background with surgeons and first responders rating their experience significantly higher compared with anesthesiologists (p < .001). Higher workload ratings were observed with the integrated AMM condition (Cohen's d = .35, p = .014) driven primarily by anesthesiologist ratings. Focus groups revealed that participants preferred the integrated AMM condition based upon its increased realism, physiologic responsiveness, and feedback provided on their interventions.
CONCLUSION: This first comprehensive evaluation suggests that integration with the AMM platform provides benefits over individual peripheral simulators and has the potential to expand simulation-based learning opportunities and enhance learner experience, especially for surgeons.
Background: The transition to residency marks a significant shift in the financial circumstances of medical trainees. Despite existing resources, residents still cite uncertainty in this domain. A personal finance curriculum is needed to close this educational gap and improve the financial well-being of trainees.
Methods: The curriculum was developed using Kern's framework. Two needs assessments informed the consensus development of goals and objectives, educational strategies, and assessments. Course material was hosted online for asynchronous review and complemented by two 1-hour webinars. The curriculum was piloted at one institution. Participants completed (1) knowledge assessments before and after the intervention, (2) a survey of reactions to the curriculum, and (3) an assessment of financial behavioral changes after the intervention.
Results: Thirty-seven residents (37/49, 76%) enrolled in the curriculum. Among participants, 20 (20/37, 54%) completed the curriculum. Most participants agreed or strongly agreed that the content was relevant (20/20, 100%) and clearly presented (19/20, 95%) and that they would recommend the curriculum to other residents (20/20, 100%). Performance on the knowledge assessment improved 21% after the intervention (mean ± SD = pretest 57% ± 17%, posttest = 78% ± 12%; p < 0.001). Most residents (17/20, 85%) also reported behavioral changes including setting new financial goals (12/20, 60%), taking new action toward financial planning (11/20, 55%), and changing financial habits (6/20, 30%). There were no direct financial costs incurred in the implementation of this pilot.
Conclusions: This is a successful pilot of a virtual personal finance curriculum with positive outcomes data. Addressing this problem at scale will require buy-in from educators around the country to deliver this information to residents that may not otherwise seek it out. Future study should assess curricular outcomes in other settings and the durability of acquired knowledge and behavioral changes over time.
CONTEXT: The Japan Residency Matching Program (JRMP) launched in 2003 and is now a significant event for graduating medical students and postgraduate residency hospitals. The environment surrounding JRMP changed due to Japanese health policy, resulting in an increase in the number of unsuccessfully-matched students in the JRMP. Beyond policy issues, we suspected there were also common characteristics among the students who do not get a match with residency hospitals.
METHODS: In total 237 out of 321 students at The University of Tokyo Faculty of Medicine graduates from 2018 to 2020 participated in the study. The students answered to the questionnaire and gave written consent for using their personal information including the JRMP placement, scores of the pre-clinical clerkship (CC) Objective Structured Clinical Examinations (OSCE), the Computer-Based Test (CBT), the National Board Examination (NBE), and domestic scores for this study. The collected data were statistically analyzed.
RESULTS: The JRMP placements were correlated with some of the pre-CC OSCE factors/stations and/or total scores/global scores. Above all, the result of neurological examination station had most significant correlation between the JRMP placements. On the other hand, the CBT result had no correlation with the JRMP results. The CBT results had significant correlation between the NBE results.
CONCLUSIONS: Our data suggest that the pre-clinical clerkship OSCE score and the CBT score, both undertaken before the clinical clerkship, predict important outcomes including the JRMP and the NBE. These results also suggest that the educational resources should be intensively put on those who did not make good scores in the pre-clinical clerkship OSCE and the CBT to avoid the failure in the JRMP and the NBE.
BACKGROUND: Opioid misuse, overprescribing, dependency, and overdose remains a significant concern in the United States. A quality improvement study was conducted at the University of Illinois Hospital & Health Sciences System to determine the effect of standardizing the default orders for hydrocodone-acetaminophen products implemented on June 22, 2016.
METHODS: Prior to the intervention, default orders had variable dose tablet numbers (1 or 2) and dosing frequencies (every 4 or 6 hours), and no default dispense quantity. Defaults were modified to 1 tablet every 6 hours as needed for pain and dispense quantities of 3 and 5 days' supply were added. Number of tablets per order, dosing frequency, and days' supply prescribed between January 1, 2016, and June 21, 2016, were compared to those placed between June 22, 2016, and December 31, 2016. Opioid doses were converted into morphine milligram equivalents (MME). Analyses were performed to determine the effect of the intervention on daily opioid dose and number of days' supply prescribed.
RESULTS: 22,052 orders were included in this study. Following the intervention, the number of tablets prescribed was reduced by an average of 19,832 tablets per month. Every 6 hours dosing (as opposed to every 4 hours) increased by 21.52 percentage points. Prescriptions with ≥ 50 MME/day dropped by 5.8 percentage points, and > 3 days' supply decreased by 2.54 percentage points. Linear regression demonstrated an increase in opioid prescriptions with daily < 50 MME (odds ratio [OR] = 1.72, p < 0.001) and ≤ 3 days' supply (OR = 1.27, p < 0.001).
CONCLUSION: Default electronic health record settings strongly influence prescribing patterns.
PURPOSE: Assessment of the Core Entrustable Professional Activities for Entering Residency (Core EPAs) requires direct observation of learners in the workplace to support entrustment decisions. The purpose of this study was to examine the internal structure validity evidence of the Ottawa Surgical Competency Operating Room Evaluation (O-SCORE) scale when used to assess medical student performance in the Core EPAs across clinical clerkships.
METHOD: During the 2018-2019 academic year, the Virginia Commonwealth University School of Medicine implemented a mobile-friendly, student-initiated workplace-based assessment (WBA) system to provide formative feedback for the Core EPAs across all clinical clerkships. Students were required to request a specified number of Core EPA assessments in each clerkship. A modified O-SCORE scale (1= "I had to do" to 4 = "I needed to be in room just in case") was used to rate learner performance. Generalizability theory was applied to assess the generalizability (or reliability) of the assessments. Decision studies were then conducted to determine the number of assessments needed to achieve a reasonable reliability.
RESULTS: A total of 10,680 WBAs were completed on 220 medical students. The majority of ratings were completed on EPA 1 (history and physical) (n = 3,129; 29%) and EPA 6 (oral presentation) (n = 2,830; 26%). Mean scores were similar (3.5-3.6 out of 4) across EPAs. Variance due to the student ranged from 3.5% to 8%, with the majority of the variation due to the rater (29.6%-50.3%) and other unexplained factors. A range of 25 to 63 assessments were required to achieve reasonable reliability (Phi > 0.70).
CONCLUSIONS: The O-SCORE demonstrated modest reliability when used across clerkships. These findings highlight specific challenges for implementing WBAs for the Core EPAs including the process for requesting WBAs, rater training, and application of the O-SCORE scale in medical student assessment.
BACKGROUND: The American College of Surgeons (ACS)/Association of Program Directors in Surgery (APDS) Resident Skills Curriculum includes validated task-specific checklists and global rating scales (GRS) for Objective Structured Assessment of Technical Skills (OSATS). However, it does not include instructions on use of these assessment tools. Since consistency of ratings is a key feature of assessment, we explored rater reliability for two skills.
METHODS: Surgical faculty assessed hand-sewn bowel and vascular anastomoses in real-time using the OSATS GRS. OSATS were video-taped and independently evaluated by a research resident and surgical attending. Rating consistency was estimated using intraclass correlation coefficients (ICC) and generalizability analysis.
RESULTS: Three-rater ICC coefficients across 24 videos ranged from 0.12 to 0.75. Generalizability reliability coefficients ranged from 0.55 to 0.8. Percent variance attributable to raters ranged from 2.7% to 32.1%. Pairwise agreement showed considerable inconsistency for both tasks.
CONCLUSIONS: Variability of ratings for these two skills indicate the need for rater training to increase scoring agreement and decrease rater variability for technical skill assessments.
PURPOSE: To investigate the learning curve of robot-assisted vitreoretinal surgery compared to manual surgery in a simulated setting.
METHODS: The study was designed as a randomized controlled longitudinal study. Eight ophthalmic trainees in the 1st or 2nd year of their specialization were included. The participants were randomized to either manual or robot-assisted surgery. Participants completed repetitions of a test consisting of three vitreoretinal modules on the Eyesi virtual reality simulator. The primary outcome measure was time to learning curve plateau (minutes) for total test score. The secondary outcome measures were instrument movement (mm), tissue treatment (mm ) and time with instruments inserted (seconds).
RESULTS: There was no significant difference in time to learning curve plateau for robot-assisted vitreoretinal surgery compared to manual. Robot-assisted vitreoretinal surgery was associated with less instrument movements (i.e. improved precision), -0.91 standard deviation (SD) units (p < 0.001). Furthermore, robot-assisted vitreoretinal surgery was associated with less tissue damage when compared to manual surgery, -0.94 SD units (p = 0.002). Lastly, robot-assisted vitreoretinal surgery was slower than manual surgery, 0.93 SD units (p < 0.001).
CONCLUSIONS: There was no significant difference between the lengths of the learning curves for robot-assisted vitreoretinal surgery compared to manual surgery. Robot-assisted vitreoretinal surgery was more precise, associated with less tissue damage, and slower.
BACKGROUND: Most medical schools offer medical Spanish education to teach patient-physician communication skills with the growing Spanish-speaking population. Medical Spanish courses that lack basic standards of curricular structure, faculty educators, learner assessment, and institutional credit may increase student confidence without sufficiently improving skills, inadvertently exacerbating communication problems with linguistic minority patients.
OBJECTIVE: To conduct a national environmental scan of US medical schools' medical Spanish educational efforts, examine to what extent existing efforts meet basic standards, and identify next steps in improving the quality of medical Spanish education.
DESIGN: Data were collected from March to November 2019 using an IRB-exempt online 6-item primary and 14-item secondary survey.
PARTICIPANTS: All deans of the Association of American Medical Colleges member US medical schools were invited to complete the primary survey. If a medical Spanish educator or leader was identified, that person was sent the secondary survey.
MAIN MEASURES: The presence of medical Spanish educational programs and, when present, whether the programs met four basic standards: formal curricular structure, faculty educator, learner assessment, and course credit.
KEY RESULTS: Seventy-nine percent of medical schools (125 out of 158) responded to either or both the primary and/or secondary surveys. Among participating schools, 78% (98/125) of medical schools offered medical Spanish programming; of those, 21% (21/98) met all basic standards. Likelihood of meeting all basic standards did not significantly differ by location, school size, or funding type. Fifty-four percent (53/98) report formal medical Spanish curricula, 69% (68/98) have faculty instructors, 57% (56/98) include post-course assessment, and 31% (30/98) provide course credit.
CONCLUSIONS: Recommended next steps for medical schools include formalizing medical Spanish courses as electives or required curricula; hiring and/or training faculty educators; incorporating learner assessment; and granting credit for student course completion. Future studies should evaluate implementation strategies to establish best practice recommendations beyond basic standards.
Introduction: While many medical schools provide opportunities in medical Spanish for medical students, schools often struggle with identifying a structured curriculum. The purpose of this module was to provide a flexible, organ system-based approach to teaching and learning musculoskeletal and dermatologic Spanish terminology, patient-centered communication skills, and sociocultural health contexts.
Methods: An 8-hour educational module for medical students was created to teach musculoskeletal and dermatologic medical communication skills in Spanish within the Hispanic/Latinx cultural context. Participants included 47 fourth-year medical students at an urban medical school with a starting minimum Spanish proficiency at the intermediate level. Faculty provided individualized feedback on speaking, listening, and writing performance of medical Spanish skills, and learners completed a written pre- and postassessment testing skills pertaining to communication domains of vocabulary, grammar, and comprehension as well as self-reported confidence levels.
Results: Students demonstrated improvement in vocabulary, grammar, comprehension, and self-confidence of musculoskeletal and dermatologic medical Spanish topics. While students with overall lower starting proficiency levels (intermediate) scored lower on the premodule assessment compared to higher proficiency students (advanced/native), the postmodule assessment did not show significant differences in skills performance among these groups.
Discussion: An intermediate Spanish level prerequisite for this musculoskeletal and dermatologic module can result in skills improvement for all learners despite starting proficiency variability. Future study should evaluate learner clinical performance and integration of this module into other educational settings such as graduate medical education (e.g., orthopedic, rehabilitation, and dermatology residency programs) and other health professions (e.g., physical therapy and nursing).
Background: Simulator-assisted arthroscopy education traditionally consists of initial training of basic psychomotor skills before advancing to more complex procedural tasks.
Purpose: To explore and compare the effects of basic psychomotor skills training versus procedural skills training on novice surgeons' subsequent simulated knee arthroscopy performance.
Study Design: Controlled laboratory study.
Methods: Overall, 22 novice orthopaedic surgeons and 11 experienced arthroscopic surgeons participated in this study, conducted from September 2015 to January 2017. Novices received a standardized introductory lesson on knee arthroscopy before being randomized into a basic skills training group or a procedural skills training group. Each group performed 2 sessions on a computer-assisted knee arthroscopy simulator: The basic skills training group completed 1 session consisting of basic psychomotor skills modules and 1 session of procedural modules (diagnostic knee arthroscopy and meniscal resection), whereas the procedural skills training group completed 2 sessions of procedural modules. Performance of the novices was compared with that of the experienced surgeons to explore evidence of validity for the basic psychomotor training skills modules and the procedural modules. The effect of prior basic psychomotor skills training and procedural skills training was explored by comparing pre- and posttraining performances of the randomized groups using a mixed-effects regression model.
Results: Validity evidence was found for the procedural modules, as test results were reliable and experienced surgeons significantly outperformed novices. We found no evidence of validity for the basic psychomotor skills modules, as test scores were unreliable and there was no difference in performance between the experienced surgeons and novices. We found no statistical effect of basic psychomotor skills training as compared with no training ( = .49). We found a statistically significant effect of prior procedural skills training ( < .001) and a significantly larger effect of procedural skills training as compared with basic psychomotor skills training ( = .019).
Conclusion: Procedural skills training was significantly more effective than basic psychomotor skills training regarding improved performance in diagnostic knee arthroscopy and meniscal resection on a knee arthroscopy simulator. Furthermore, the basic psychomotor skills modules lacked validity evidence.
Clinical Relevance: On the basis of these results, we suggest that future competency-based curricula focus their training on full knee arthroscopy procedures. This could improve future education programs.
BACKGROUND: Examining the predictors of summative assessment performance is important for improving educational programs and structuring appropriate learning environments for trainees. However, predictors of certification examination performance in pediatric postgraduate education have not been comprehensively investigated in Japan.
METHODS: The Pediatric Board Examination database in Japan, which includes 1578 postgraduate trainees from 2015 to 2016, was analyzed. The examinations included multiple-choice questions (MCQs), case summary reports, and an interview, and the predictors for each of these components were investigated by multiple regression analysis.
RESULTS: The number of examination attempts and the training duration were significant negative predictors of the scores for the MCQ, case summary, and interview. Employment at a community hospital or private university hospital were negative predictors of the MCQ and case summary score, respectively. Female sex and the number of academic presentations positively predicted the case summary and interview scores. The number of research publications was a positive predictor of the MCQ score, and employment at a community hospital was a positive predictor of the case summary score.
CONCLUSION: This study found that delayed and repeated examination taking were negative predictors, while the scholarly activity of trainees was a positive predictor, of pediatric board certification examination performance.
Medical Spanish education aims to reduce linguistic barriers in healthcare and has historically been led by Hispanic/Latinx students and faculty, often without formal training or institutional support. We surveyed 158 US medical schools about their medical Spanish programs. We then examined national trends in Underrepresented in Medicine and Hispanic/Latinx faculty and students as factors associated with meeting medical Spanish basic standards for curricula, educators, assessment, and course credit. We received responses from 125 schools (79%), of which 98 (78%) reported offering some form of medical Spanish. Schools with greater racial/ethnic diversity were more likely to have medical Spanish required courses (P-values < 0.01) but not curricular electives. Overall, likelihood of meeting all basic standards did not differ by diversity characteristics. High-quality medical Spanish requires more than recruitment of diverse students and faculty. Institutions should prioritize meaningful inclusion by supporting evidence-based curricula and faculty educators.
Competency-based medical education (CBME) is being implemented worldwide. In CMBE, residency training is designed around competencies required for unsupervised practice and use entrustable professional activities (EPAs) as workplace "units of assessment". Well-designed workplace-based assessment (WBA) tools are required to document competence of trainees in authentic clinical environments. In this study, we developed a WBA instrument to assess residents' performance of intra-operative pathology consultations and conducted a validity investigation. The entrustment-aligned pathology assessment instrument for intra-operative consultations (EPA-IC) was developed through a national iterative consultation and used clinical supervisors to assess residents' performance at an anatomical pathology program. Psychometric analyses and focus groups were conducted to explore the sources of evidence using modern validity theory: content, response process, internal structure, relations to other variables, and consequences of assessment. The content was considered appropriate, the assessment was feasible and acceptable by residents and supervisors, and it had a positive educational impact by improving performance of intra-operative consultations and feedback to learners. The results had low reliability, which seemed to be related to assessment biases, and supervisors were reluctant to fully entrust trainees due to cultural issues. With CBME implementation, new workplace-based assessment tools are needed in pathology. In this study, we showcased the development of the first instrument for assessing resident's performance of a prototypical entrustable professional activity in pathology using modern education principles and validity theory.
OBJECTIVES: In 2015, the National Academy of Medicine IOM estimated that 12 million patients were misdiagnosed annually. This suggests that despite prolonged training in medical school and residency there remains a need to improve diagnostic reasoning education. This study evaluates a new approach.
METHODS: A total of 285 medical students were enrolled in this 8 center, IRB approved trial. Students were randomized to receive training in either abdominal pain (AP) or loss of consciousness (LOC). Baseline diagnostic accuracy of the two different symptoms was assessed by completing a multiple-choice question (MCQ) examination and virtual patient encounters. Following a structured educational intervention, including a lecture on the diagnostic approach to that symptom and three virtual patient practice cases, each student was re-assessed.
RESULTS: The change in diagnostic accuracy on virtual patient encounters was compared between (1) baseline and post intervention and (2) post intervention students trained in the prescribed symptom vs. the alternate symptom (controls). The completeness of the student's differential diagnosis was also compared. Comparison of proportions were conducted using χ 2-tests. Mixed-effects regressions were used to examine differences accounting for case and repeated measures. Compared with baseline, both the AP and LOC groups had marked post-intervention improvements in obtaining a correct final diagnosis; a 27% absolute improvement in the AP group (p<0.001) and a 32% absolute improvement in the LOC group (p<0.001). Compared with controls (the groups trained in the alternate symptoms), the rate of correct diagnoses increased by 13% but was not statistically significant (p=0.132). The completeness and efficiency of the differential diagnoses increased by 16% (β=0.37, p<0.001) and 17% respectively (β=0.45, p<0.001).
CONCLUSIONS: The study showed that a virtual patient platform combined with a diagnostic reasoning framework could be used for education and diagnostic assessment and improved correct diagnosis compared with baseline performance in a simulated platform.
PURPOSE: Competency-based education relies on the validity and reliability of assessment scores. Generalizability (G) theory is well suited to explore the reliability of assessment tools in medical education but has only been applied to a limited extent. This study aimed to systematically review the literature using G-theory to explore the reliability of structured assessment of medical and surgical technical skills and to assess the relative contributions of different factors to variance.
METHOD: In June 2020, 11 databases, including PubMed, were searched from inception through May 31, 2020. Eligible studies included the use of G-theory to explore reliability in the context of assessment of medical and surgical technical skills. Descriptive information on study, assessment context, assessment protocol, participants being assessed, and G-analyses were extracted. Data were used to map G-theory and explore variance components analyses. A meta-analyses was conducted to synthesize the extracted data on the sources of variance and reliability.
RESULTS: Forty-four studies were included; of these, 39 had sufficient data for meta-analysis. The total pool included 35,284 unique assessments of 31,496 unique performances of 4,154 participants. Person variance had a pooled effect of 44.2% (95% confidence interval [CI] [36.8%-51.5%]). Only assessment tool type (Objective Structured Assessment of Technical Skills-type vs task-based checklist-type) had a significant effect on person variance. The pooled reliability (G-coefficient) was .65 (95% CI [.59-.70]). Most studies included D-studies (39, 89%) and generally seemed to have higher ratios of performances to assessors to achieve a sufficiently reliable assessment.
CONCLUSIONS: G-theory is increasingly being used to examine reliability of technical skills assessment in medical education but more rigor in reporting is warranted. Contextual factors can potentially affect variance components and thereby reliability estimates and should be considered, especially in high-stakes assessment. Reliability analysis should be a best practice when developing assessment of technical skills.
OBJECTIVE: Surgeons provide patient care in complex health care systems and must be able to participate in improving both personal performance and the performance of the system. The Accreditation Council for Graduate Medical Education (ACGME) Vascular Surgery Milestones are utilized to assess vascular surgery fellows' (VSF) achievement of graduation targets in the competencies of Systems Based Practice (SBP) and Practice Based Learning and Improvement (PBLI). We investigate the predictive value of semiannual milestones ratings for final achievement within these competencies at the time of graduation.
METHODS: National ACGME milestones data were utilized for analysis. All trainees entering the 2-year vascular surgery fellowship programs in July 2016 were included in the analysis (n=122). Predictive probability values (PPVs) were obtained for each SBP and PBLI sub-competencies by biannual review periods, to estimate the probability of VSFs not reaching the recommended graduation target based on their previous milestones ratings.
RESULTS: The rate of non-achievement of the graduation target level 4.0 on the SBP and PBLI sub-competencies at the time of graduation for VSFs was 13.1 - 25.4%. At the first time point of assessment, 6 months into the fellowship program, the PPV of the SBP and PBLI milestones for non-achievement of level 4.0 upon graduation ranged from 16.3 - 60.2%. Six months prior to graduation, the PPVs across the 6 sub-competencies ranged from 14.6 - 82.9%.
CONCLUSIONS: A significant percentage of VSFs do not achieve the ACGME Vascular Surgery Milestone targets for graduation in the competencies of SBP and PBLI, suggesting a need to improve curricula and assessment strategies in these domains across vascular surgery fellowship programs. Reported milestones levels across all time point are predictive of ultimate achievement upon graduation and should be utilized to provide targeted feedback and individualized learning plans to ensure graduates are prepared to engage in personal and health care system improvement once in unsupervised practice.
PURPOSE: In undergraduate medical education (UME), competency-based medical education has been operationalized through the thirteen Core Entrustable Professional Activities for Entering Residency (Core EPAs). Direct observation in the workplace using rigorous, valid, reliable measures is required to inform summative decisions about graduates' readiness for residency. The purpose of this study is to investigate the validity evidence of two proposed workplace-based entrustment scales.
METHOD: The authors of this multisite, randomized, experimental study used structured vignettes and experienced raters to examine validity evidence of the Ottawa scale and the UME supervisory tool (Chen scale) in 2019. The authors used a series of 8 cases (6 developed de novo) depicting learners at pre-entrustable (less-developed) and entrustable (more-developed) skill levels across 5 Core EPAs. Participants from Core EPA pilot institutions rated learner performance using either the Ottawa or Chen scale. The authors used descriptive statistics and analysis of variance to examine data trends and compare ratings, conducted inter-rater reliability and generalizability studies to evaluate consistency among participants, and performed a content analysis of narrative comments.
RESULTS: Fifty clinician-educators from 10 institutions participated, yielding 579 discrete EPA assessments. Both the Ottawa and Chen scales differentiated between less- and more-developed skill levels (P < .001). The interclass correlation was good to excellent for all EPAs using Ottawa (range = .68-.91) and fair to excellent using Chen (range = .54-.83). Generalizability analysis revealed substantial variance in ratings attributable to the learner-EPA interaction (59.6% for Ottawa; 48.9% for Chen) suggesting variability for ratings was appropriately associated with performance on individual EPAs.
CONCLUSIONS: In a structured setting, both the Ottawa and Chen scale distinguished between pre-entrustable and entrustable learners; however, the Ottawa scale demonstrated more desirable characteristics. These findings represent a critical step forward in developing valid, reliable instruments to measure learner progression toward entrustment for the Core EPAs.
BACKGROUND: Objective Structured Clinical Examinations (OSCEs) are used in a variety of high-stakes examinations. The primary goal of this study was to examine factors influencing the variability of assessment scores for mock OSCEs administered to senior anesthesiology residents.
METHODS: Using the American Board of Anesthesiology (ABA) OSCE Content Outline as a blueprint, scenarios were developed for 4 of the ABA skill types: (1) informed consent, (2) treatment options, (3) interpretation of echocardiograms, and (4) application of ultrasonography. Eight residency programs administered these 4 OSCEs to CA3 residents during a 1-day formative session. A global score and checklist items were used for scoring by faculty raters. We used a statistical framework called generalizability theory, or G-theory, to estimate the sources of variation (or facets), and to estimate the reliability (ie, reproducibility) of the OSCE performance scores. Reliability provides a metric on the consistency or reproducibility of learner performance as measured through the assessment.
RESULTS: Of the 115 total eligible senior residents, 99 participated in the OSCE because the other residents were unavailable. Overall, residents correctly performed 84% (standard deviation [SD] 16%, range 38%-100%) of the 36 total checklist items for the 4 OSCEs. On global scoring, the pass rate for the informed consent station was 71%, for treatment options was 97%, for interpretation of echocardiograms was 66%, and for application of ultrasound was 72%. The estimate of reliability expressing the reproducibility of examinee rankings equaled 0.56 (95% confidence interval [CI], 0.49-0.63), which is reasonable for normative assessments that aim to compare a resident's performance relative to other residents because over half of the observed variation in total scores is due to variation in examinee ability. Phi coefficient reliability of 0.42 (95% CI, 0.35-0.50) indicates that criterion-based judgments (eg, pass-fail status) cannot be made. Phi expresses the absolute consistency of a score and reflects how closely the assessment is likely to reproduce an examinee's final score. Overall, the greatest (14.6%) variance was due to the person by item by station interaction (3-way interaction) indicating that specific residents did well on some items but poorly on other items. The variance (11.2%) due to residency programs across case items was high suggesting moderate variability in performance from residents during the OSCEs among residency programs.
CONCLUSIONS: Since many residency programs aim to develop their own mock OSCEs, this study provides evidence that it is possible for programs to create a meaningful mock OSCE experience that is statistically reliable for separating resident performance.
The philosophy of science is concerned with what science is, its conceptual framing and underlying logic, and its ability to generate meaningful and useful knowledge. To that end, concepts such as ontology (what exists and in what way), epistemology (the knowledge we use or generate), and axiology (the value of things) are important if somewhat neglected topics in health professions education scholarship. In an attempt to address this gap, Academic Medicine has published a series of Invited Commentaries on topics in the philosophy of science germane to health professions educational science. This Invited Commentary concludes the Philosophy of Science series by providing a summary of the key concepts that were elucidated over the course of the series, highlighting the intent of the series and the principles of ontology, epistemology, axiology, and methodology. The authors conclude the series with a discussion of the benefits and challenges of cross-paradigmatic research.
Entrustable Professional Activities (EPAs) describe the core tasks health professionals must be competent performing prior to promotion and/or moving into unsupervised practice. When used for learner assessment, they serve as gateways to increased responsibility and autonomy. It follows that identifying and describing EPAs is a high-stakes form of work analysis aiming to describe the core work of a profession. However, hasty creation and adoption of EPAs without rigorous attention to content threatens the quality of judgments subsequently made from using EPA-based assessment tools. There is a clear need for approaches to identify validity evidence for EPAs themselves prior to their deployment in workplace-based assessment. For EPAs to realize their potential in health professions education, they must first be constructed to reflect accurately the work of that profession or specialty. If the EPAs fail to do so, they cannot predict a graduate's readiness for or future performance in professional practice. Evaluating the methods used for identification, description, and adoption of EPAs through a construct validity lens helps give leaders and stakeholders of EPA development confidence that the EPAs constructed are, in fact, an accurate representation of the profession's work. Application of a construct validity lens to EPA development impacts all five commonly followed steps in EPA development: selection of experts; identification of candidate EPAs; iterative revisions; evaluation of proposed EPAs; and formal adoption of EPAs into curricula. It allows curricular developers to avoid pitfalls, bias, and common mistakes. Further, construct validity evidence for EPA development provides assurance that the EPAs adopted are appropriate for use in workplace-based assessment and entrustment decision-making.