The technical challenges behind building an LLM platform for international development

The first of a three-part blog by Olivier Mills of Baobab Tech, a Frontier Tech implementing Partner

Pilot: Using LLMs as a tool for International Development professionals

 

In the context of the "LLM for International Development" pilot described by Seb Mhatre in this blog, we want to delve deeper into our technical learnings and share them to foster discussion and build a community of practice.

This journey of building an LLM application for international development is written for a technical audience who are familiar with the basics of building LLM-based applications. You might be a developer, or interested in testing a similar prototype on a related topic or in the development and humanitarian sector - if so, please also consider joining the community of practice we are setting up. Follow this series on the FT blog, and get in touch with the team by contacting jenny.prosser@dt-global.com.

Our technical insights will be presented in three parts.

A front page screenshot of the International Development Knowledge Assistant

Part 1: Technical environment, our approach and data preparation

Overview

International development organizations publish and fund tens of thousands of documents annually, including strategy papers, policy documents, program reports, evaluations, and research. While this transparency is crucial for accountability and collaboration, a significant challenge emerges: most development professionals can only read a fraction of the potentially relevant documents due to time constraints.

Our pilot aims to address this issue by creating a practical tool for development professionals. By leveraging Large Language Models (LLMs), we're working to make it easier for users to access and synthesize knowledge from the vast array of published documents. The goal is to support more informed decision-making by providing quick insights into ongoing activities, past outcomes, stakeholder involvement, and lessons learned.

However, working with datasets like the International Aid Transparency Initiative (IATI) presents unique challenges. Traditional data management methods struggle with necessary but complex data structures and fail to provide contextualized reasoning or synthesis. Users often need to export data and work separately outside these tools to gain specific sub-thematic or geographic insights. Our pilot aims to overcome these limitations by harnessing the power of LLMs to provide more nuanced and accessible analysis of development data.

1. The technical environment

Data scientists vs. app developers

The LLM revolution has sparked an intriguing collision between the worlds of data science and application development. Since the release of GPT-3.5's API, we've witnessed Machine Learning experts venturing into application development, while web and app developers rush to grasp NLP and LLM concepts. This cross-pollination has led to some growing pains.

Web developers, primarily versed in JavaScript/TypeScript, are scrambling to learn Python. Conversely, ML experts face the challenge of translating their Jupyter/Colab notebooks into production-ready applications. This learning curve has driven the popularity of tools like Streamlit and Gradio in the Python ecosystem. Meanwhile, frameworks like LangChain and LlamaIndex have expanded to include JavaScript/TypeScript versions, catering to the dominant app developer ecosystem.

LLMs as application components

Integrating LLMs into applications requires a significant mental shift. These models introduce a semi- to non-deterministic element to the application stack, often behaving like a black box that demands hours of prompt engineering and data flow rethinks. While tools like DSPy (by Stanford NLP) attempt to address this complexity, developers still grapple with how LLMs impact various aspects of their applications, from data flows and UI state to inference speed and cost considerations.

LLM application frameworks

To help us navigate this new landscape, frameworks like LangChain and LlamaIndex have emerged. These tools aim to simplify the process of building LLM-powered applications in production environments. More recently, Vercel has entered the game with their AI SDK, leveraging their experience with Next.js and serverless infrastructure to appeal to JavaScript developers.

Building the plane while flying it

Developing LLM applications during these early days often feels like constructing an aircraft mid-flight while simultaneously swapping out its engines. The rapid pace of innovation means that just as we perfect one approach, a new breakthrough renders it obsolete. For instance, OpenAI's introduction of JSON mode obsoleted complex prompt engineering for structured outputs, while advancements in context window sizes (like Claude's 200k tokens) and nimble embedding models open up new possibilities for application design and chunking techniques.

As of June 2024, many key tools in the LLM ecosystem are still in early versions (LangChain v0.2, LlamaIndex v0.3), and using cutting-edge features often requires enabling experimental flags. This constant flux means we have to frequently rewrite and adapt our application stack.

The distractions: noise + hype

Amidst genuine advancements, the LLM space is rife with hype and noise. Claims of new models "outperforming" established ones or RAG recipes that claim to be a cure-all are commonplace. We had to maintain a critical eye and thoroughly test new techniques before implementation, as the reality often falls short of the grandiose claims made by influencers and AI startup founders.

Generative UI

A promising development in LLM application design is Vercel's implementation of generative UI - yes, UI as in User Interface, not generative AI. This approach uses LLMs to generate data (e.g., JSON objects) that dynamically create UI components, such as follow-up question checklists or information cards. This capability goes beyond simply displaying LLM-generated data, allowing for the creation of highly customized and responsive user interfaces. Unlike traditional hard-coded, rule-based UIs, generative UI offers an opportunity for more dynamic and session-contextualized user interfacing. It enables the presentation of information through flexible components tailored to each user's specific session, topic, and stage, rather than relying on rigid dashboard layouts. While still in its early stages, this technology is poised to revolutionize how we build flexible, AI-driven applications, opening up new possibilities for enhanced functionality and user experience.

2. Our approach

In constructing our LLM-powered application for international development, we're guided by a utility-first principle, ensuring every aspect is tailored to meet the specific needs of international development professionals.

Our focus is on building for utility and specific use-cases, recognizing that a simple chatbot won't suffice for the complex needs in this field. We're also building with scale in mind, carefully considering the costs of LLM inference to avoid skyrocketing expenses as we scale.

Our system incorporates Language Models at various stages:

  1. Embedding generation for dense vector search

  2. Summarization, entity extraction, and metadata generation for our LLM-friendly version of IATI data

  3. Live inference for contextualized summaries or extracts specific to user queries

Our development process focuses on three key areas:

  1. Data preparation: optimizing our dataset for LLM processing

  2. Retrieval techniques: implementing advanced methods for quick and accurate information retrieval

  3. User Interface: designing an intuitive interface that presents complex information effectively

Throughout this process, we maintain a user-centric approach, continually asking: Who is using this application, and for what purpose? This guides our decision-making at every step, ensuring we're building a tool that truly serves the needs of international development professionals.

3. Data aggregation and preparation

The foundation of our (any hopefully any other!) LLM-powered application lies in effective data aggregation and preparation. We've focused our initial efforts on a subset of IATI data, namely FCDO data , but we're building with scalability in mind. Our framework is designed to eventually encompass all IATI data and potentially expand to other international development open datasets like OECD CRS and USAID DEC.

3.1 Collecting and analyzing IATI data

Working with IATI datasets presented several challenges:

  1. The data structure isn't LLM-friendly, often using codes for labels and multiple complex schemas.

  2. The datasets are vast, containing a significant amount of non-essential information.

  3. Data quality is inconsistent, with varying levels of detail in descriptions, sector information, and associated documents, because adding data to IATI is an administrative task/burden with limited perceived value (we hope to change that)

  4. Metadata limitations, particularly in geographic and thematic components, often led to incomplete information.

We utilized post-processed IATI data like the devtracker website (https://devtracker.fcdo.gov.uk/) and the fcdo.iati.cloud API for data collection.

3.2 Data structuring for retrieval

Given the limitations of direct IATI API usage for live inference, we developed our own streamlined datasets with augmented metadata. This approach is optimized for various retrieval tasks and includes:

  1. A schema encompassing activities (programmes) and documents (including chunked document, chunktable).

  2. Metadata extraction and standardization, focusing on essential IATI data augmented with geographic information.

  3. Conversion of codes to human-readable labels for partner roles, status codes, and country codes for improved keyword and vector searches

Our database schema optimized for RAG and LLM application use

Here's a sample of the metadata field we generated for a single activity/programme in activities. For illustration this activity record is found in full in the devTracker page here, d-portal version here and the JSON data closer to the original IATI data is here:

As you can see we simplified it and made it more human and language model friendly in comparison to the original datasets linked above.

3.3 Document processing

Our document processing pipeline handles various formats, including PDFs and ODTs. We implemented section-aware chunking for ODTs, leveraging their XML structure, and used paged chunking for PDFs (which constitute less than 10% of our documents).

To generate comprehensive summaries, we employed an 8 billion parameter model to summarize each section, then created a summary of these summaries. This approach provides a more complete overview of each document.

Crucially, we've established a relational database that connects documents to their respective activity/programme IDs. This structure enables complex document searches that can return related programmes, or alternatively, retrieve all documents (or their summaries) for a specific programme. This flexibility supports deep-dive queries and analysis that users might need to perform.

We can’t emphasize how much the data structure, augmentation and preparation stage is critical in ensuring that our LLM can efficiently and accurately retrieve and process information, ultimately providing valuable insights to international development professionals.

Stay tuned for the next part soon, on retrieval systems, model choice and prompt engineering


If you’d like to dig in further…

🚀 Explore this pilot’s profile page

📚 Learn more about the idea behind the pilot here

📚 Read part two of the blog here

📚 Read part three of the blog here

Frontier Tech Hub
The Frontier Technologies Hub works with UK Foreign, Commonwealth and Development Office (FCDO) staff and global partners to understand the potential for innovative tech in the development context, and then test and scale their ideas.
Previous
Previous

16 new ideas using frontier tech to solve global challenges

Next
Next

Lancement du projet pilote sur les terres arides : notre vision pour la regeneration du Sahel