Data collection

As previously mentioned, high-quality, relevant data is one of the most important enablers for developing effective AI use cases. Many hospitals have trusted sources of information—whether produced internally or by reputable external organisations—that they consider credible and valuable for their patients. These sources may include:

Factsheets on best medical practices
Images and videos illustrating health guidelines
Resources on managing specific conditions
Links to external, hospital-validated data

Care teams can upload these resources into their proprietary knowledge base, ensuring that the chatbot's conversational responses are aligned with that hospital’s trusted medical information.

Beyond general medical knowledge, hospitals also develop care plans—individualised guidelines tailored to each patient’s specific condition and treatment. When a user sets up an account on Avatr, their personalised care plan is integrated into the chatbot. This means the chatbot’s responses are informed by the hospital’s verified medical database, and The patient’s specific care plan

By combining trusted medical information with individualised care plans, Avatr ensures that chatbot responses are not only accurate and reliable but also tailored to each user’s unique needs.

Data pre-processing

How can different data sources be integrated with an LLM so that it can retrieve and provide useful responses? AI models do not process discrete data—such as words or images—directly. Instead, they require numerical representations. One way to convert text into numerical data is through word embeddings.

Word embeddings represent words as vectors in a multi-dimensional space, capturing their semantic meaning. Before generative AI, natural language understanding relied heavily on keyword-based methods, which matched exact words or used structured representations of word relationships. These techniques remain useful for searching well-structured databases and knowledge graphs. By contrast, generative AI systems embed words (or tokens, including punctuation and spaces) into a high-dimensional space. The distance between vectors reflects their semantic similarity in different contexts(Geeks for Geeks, n.d.).

Building a Vector Database

The first step in developing this AI-powered solution was to create a vector database to store the hospital’s knowledge base (e.g., factsheets, images, and other verified medical resources).

As discussed earlier, Retrieval-Augmented Generation (RAG) combines information retrieval with the generative abilities of LLMs. This vector database serves as the searchable knowledge base, enabling the AI system to:

Retrieve relevant information from trusted sources
Measure the semantic similarity between user queries and stored data
Provide accurate, context-specific responses

By restricting searches to a verified medical knowledge base, hospital teams ensure that the AI only provides medically accurate information to users.

Creating the model

Unlike other AI systems explored in this report, this model does not require fine-tuning.

You may have noticed that we haven’t referred to the hospital’s data as training data. That’s because factsheets, images, and other resources aren’t used to train the LLM itself. Unlike models that are fine-tuned for domain-specific tasks (such as teaching an AI to distinguish between images of sunsets and fires), this system takes a different approach:

The LLM is pre-trained on general language to generate human-like responses.
The retrieval process dynamically pulls information from the hospital’s knowledge bank.

Instead of fine-tuning the model, the development process involves iterating and refining the retrieval system to ensure that responses are information-rich, contextually relevant, and aligned with medical best practices.

This approach ensures that the chatbot provides responses akin to those of a well-trained doctor, grounded in verified medical knowledge while maintaining a natural, conversational tone.

Impact

Initial testing of Avatr technology has shown its potential to support doctors by providing them with data on how people are interacting with their care plans, giving doctors a more informed basis to make interventions to encourage adherence. In resource-constrained hospitals, where doctors are overstretched, these resources can allow them to meet the needs of their patients and deliver faster and better care. For patients, initial testing has already started to reduce re-admittance rates. Access to high-quality medical information has empowered outpatients to better follow their care plans and manage their diseases on their own terms.

Avatr GenAI aims to deliver even higher levels of adherence, leading to health and economic improvements. The Avatr system generates granular data at personal and population levels to measure and evaluate adherence to care plans. Aligned with the Health Belief Model, impact measures include changes in beliefs and behaviours leading to health and economic improvements. The evidence informs further iterative development of the Avatr GenAI model, specifically optimization of the embedding, retrieval and generation steps. There will also be various complexities of transitioning from a development environment to a production environment which is suitable for scaling up of the solution.

Okay, but how do you build it?

Data collection

Data pre-processing

Building a Vector Database

Creating the model

Impact

“these resources can allow [doctors] to meet the needs of their patients and deliver faster and better care.”