Assessing the use case
The need for a nuanced understanding of domain specific language, guided the process through which DevelopMetrics built each use-case for their tool. Working with ITR, they co-developed a framework for understanding the work the department did. Exploring questions like:
What are the most important concepts to our work and how do we define them?
What are the relationships between these concepts, and how does that relate to what we’re trying to achieve?
What’s the best way to categorise different aspects of our work?
This co-creation process draws together everyone’s technical expertise to build something which is reflective and inclusive of various views and perspectives. As such, this framework gave the team a solid foundation from which to build a tool which was sensitive to the departments work, and the way they used key concepts.
Additionally, the record of this process gave them a useful resource for evaluating the tool and educating users on the best strategies for using it. For example, if users thought the tool wasn’t capturing certain aspects of what they meant by a concept like digital development, they could return to the framework and consider why. This provides a level of transparency with the tool because they can trace the details of how it performs back to a source.
Data collection
The value of a tool like this is that it allows organisations to understand the information they have stored in their databases. The data collection process involves consulting those organisations to understand which databases are most relevant to the use-case, and the ways in which they want those resources to be pulled together. For many organisations, understanding the information that’s in just one of their databases can be incredibly useful for guiding their work. By triangulating that data with different sources, you’re able to add another level of richness to the information the tool provides.
Data processing and training
It is useful to distinguish between the training phase for DevelopMetric’s base model, DEELM, and the way they train a tool for specific use-cases.
The process of building DEELM involved five years of consultation with experts at the UN and USAID, and a lot of manual tagging. The team went through development reports, and manually tagged specific sections of those reports with relevant concepts. For example, they would look through a report related to the humanitarian sector and highlight specific sections of the document which related to resilience. They would then engage sector experts and academics to verify that their tagging process was accurately capturing the nuances of what the text was discussing. This tagging process was then peer-reviewed by other experts, to ensure that it was comprehensive.
This is a supervised learning approach, which like many of the examples we’ve explored in this report, involves training an AI system on labelled data to process patterns between that data and categories used to classify it.
The challenge was ensuring DEELM could map relationships between the original training data and its generic categories (e.g., resilience, governance, food security), and the new, use-case-specific categories designed for a particular project. By fine-tuning DEELM’s ability to discern and apply these relationships, the team ensured the model could accurately classify and retrieve the most relevant insights within specific development contexts.
Building the model
The main challenge is getting the first phases right. With a representative, well curated dataset grounded in a clear conceptual framework it is much easier to build the other parts of the solution. After having done this on ITR’s programme data, they were able to design the interface which allowed users to search through the information categorised by the LLM and to add different functions, like the auto-generation of briefs on specific topics.
Testing and Iterating
One of use-cases developed for ITR was assessing the success of different interventions. With limited budgets, it is incredibly important that resources in international development are used to sustainability impact communities for the better.
Testing for efficacy therefore demands a high level of rigour to ensure that the tool is accurately capturing the expertise of the professionals who create the tool. Testing an LLM-driven tool presents unique challenges in this context, for example, if a user asks “Where has blockchain had the greatest impact in Southeast Asia?”, there is no single correct answer to benchmark against. Instead, testing must focus on:
Identifying blind spots in how the tool retrieves and processes information.
Evaluating biases in how the model ranks or prioritises certain types of evidence
Ensuring alignment with how experts in the field interpret and categorise interventions.
To test the tool, the team gave the tool to digital development specialist and had them ask it various questions, collected qualitative feedback on response accuracy and relevance, and used expert insights to refine the tool’s data sources, retrieval processes, and weighting mechanisms.
For example, suppose the tool overemphasised technological aspects of innovation while neglecting social and behavioural dimensions. The team might:
Identify gaps in training data, particularly in reports focusing on human-centred innovation.
Incorporate new datasets that provide richer context on behavioural and systemic change.
Adjust the training process to better capture holistic innovation approaches.
“One challenge with the testing process in this context was that there was no specific answer they expected the tool to give”
Impact
Through these iterations, the team developed a tool which was able to provide ITR with a contextually relevant, information-rich overview of their work in digital development. This overview saved them an immense amount of time finding and preparing evidence documents, such as the briefs to congress. With this timesaving, the different teams within ITR could attend to the higher-level task of evaluating how to use that evidence to inform their work, and better meet the challenge of building technology use-cases which improve people’s lives.