Integrating Technology into Evaluation
Linda Raftree, BA in Anthropology
Founder
MERL Tech, New York, United States
Kerry Bruce, DrPH (she/her/hers)
CEO
North End Consulting, United States
Paul Jasper, PhD (he/him/his)
Senior Consultant - Research & Evaluation
Oxford Policy Management, United Kingdom
Stephanie Coker, MPA; BA in Political Science and Government
Senior Manager, Strategic Learning and Evaluation
CVS Health
New York, New York, United States
Location: Room 104
Abstract Information: The session will posit that AI Language Models bring incredible opportunities to scale qualitative analysis to tell powerful and representative stories of social change, but that biases, opacity of coding, and rules orienting AI software and tools make them imperfect at best and dangerous at worst. Before moving into breakout groups, discussants will provide a brief overview of the basics of the newest AI language models, how they work, and offer advice and caution to evaluators on applying these models to their work. Discussants will orient session participants on popular Large Language Model (LLM) tools (e.g. ChatGPT, Bard, LLaMa) and relevant applications (e.g. sentiment analysis, theme identification and aggregation in qualitative data) of AI Language Models in the field of monitoring, evaluation, and learning. After brief remarks from the discussants, we will draw on attendees’ insights and experiences to reflect on the risks and opportunities for applying AI language models to evaluation practice and evidence-informed storytelling. Breakout groups will cover key questions like ‘what are the best uses for LLMs in evaluation?’, ‘what are the key challenges with LLM for evaluation?’, and ‘how can evaluators address ethical issues of LLMs?’. Groups will share back key insights in plenary before hearing final remarks from the discussants. Much of the session will borrow from the early lessons and explorations of a cross-regional, multidisciplinary, 200+ member-strong Community of Practice working to understand the needs, opportunities, and risks of emerging types of machine learning and AI language models in evaluation practice.
Relevance Statement: Hundreds of thousands of evaluation reports are written every year, intending to document and review change and causal relationships across development, humanitarian, and human rights programs and projects. A typical evaluation report ranges from 35-80 pages with dozens more pages of appendices, and still often only reflects a fraction of the data and insights gathered over the course of the evaluation’s surveys, desk research, interviews, and/or focus groups. Evaluators often spend a huge amount of their time on processing, cleaning, and managing collected data, which cuts into time and budgets for translating findings into insights, prioritizing use of the findings, and telling powerful stories for change. Time and resource constraints also tend to limit whose voices and reflections end up included and highlighted in analyses. Until recently, most of the tools needed to make better use of this data have been expensive, clunky, and difficult for the average practitioner to use. With the emergence of AI language models -- and especially the launch of several consumer-facing tools-- these tools are becoming increasingly accessible, but limited formal efforts have taken shape to think about how to equip evaluators with new skills and capacities to deploy AI Language Models in service of evaluations to advance social change. This session will focus on early insights from evaluators and data scientists who are working together to advance the ethical application of AI Language Models for monitoring, evaluation, and learning.