Getting the right data  

The first step is getting the right data to train the system. Your training data is the data which is fed into the system such that it can recognise the patterns of pixels in an image that you want the system to be able to classify.  

Online, through resources such as ImageNet, you can download huge databases of images to train these kinds of systems. If you want your training data to create a system which is effective in the real world, this dataset needs to be representative. This means that the data in the training set needs to reflect the complexities and nuances of the real-world phenomena which the system is designed to classify.  

For LUMs, the challenge was that many of the images of fires and smoke were taken from people’s phones at ground level but the cameras that they were deploying were mounted on towers. This gave them a completely different perspective than the existing ground-level images. As such, training the system on ground-level fire images wouldn’t have created a system which applied to the real-world use case they were exploring. 

The challenge of unrepresentative datasets is common when applying AI in international development. By its nature, data only provides a resemblance to the real-world systems which it is curated to represent. The real world is messy, complex, and full of variation. As Michel explains, this entails that the data which trains AI systems needs to be “closely aligned with the purpose for which the AI system is being developed” such that it has the same properties as the real-world phenomena it models (Michel, 2023, p.17).

In international development contexts, this challenge is particularly acute. As Sekara et al, note, many of the international datasets collating information about global health and livelihoods, which could be useful in an international development context, are curated for decision-making, not to train AI. This means that a lot of work must be done to ensure that they are processable by an algorithm. Also, these datasets are spread across multiple national and international organisations with complicated licensing rules, which limits access (Sekara et al., 2023).

The challenge is not insurmountable – in many cases, there may be the option to collect more data or to adapt existing datasets to make them better suited to the use case that you’re developing. The solution that the pilot team found to address this challenge was to look for images drawn not from real life, but from videogames. Red Dead Redemption and Grand Theft Auto allow players to create forest fires and customise a camera view of those fires which was like that of the cameras LUMs were deploying in real life. As such, they could use these games to gather a large number of images, similar enough to real-life fires, which could effectively train the computer vision system to classify images and detect fires.

This is known as creating a synthetic dataset. It’s a common strategy for addressing the challenge of an unrepresentative dataset. You can find out more about synthetic data by reading Synthetic data for speed, security and scale (Lucini, 2021).

Training the system

With an initial dataset, the team set about training the system. They used an off-the-shelf convolutional neural network (CNN) called Yolo (you only look once), which achieves state-of-the-art results in many computer vision tasks. Artificial neural networks (ANNs) are processing systems consisting of multiple nodes, organised in layers, which learn from input data to output a desired value. Each of the nodes performs a certain function on its input, producing an output which then gets distributed through subsequent layers to the final output layer. Think of them as a very complicated algorithm which includes multiple steps encoded in the structure of the nodes to take input data and create a desired output. Deep learning refers to a neural network which has multiple hidden layers between the input and output. CNNs are a specific type of ANN which are mostly used for computer vision tasks (O'Shea and Nash, 2015) 

However, as we’ve explained previously, there is still a challenge around tailoring the algorithm to the unique requirements of the real-world use case. Tailoring Yolo to a unique use case involves feeding the model your training set and using that to adjust the algorithms such that it is better suited to the specific use case, you’re exploring.  

This fine-tuned version of Yolo allowed the LUMs team to identify the presence of fires in an image and the specific part of the image that those fires were bound within.

Deploying, testing, and adjusting to the real world

Fine-tuning is not a static process, but rather a continuous process of gradual improvements to a model over time. After deploying the cameras in the forest, the team found that there were several challenges still confounding the system. Edge-cases are instances where an AI system is presented with a situation that requires a distinction or detection challenge which it hasn’t encountered in training (iMerit, 2022) This is a common issue with self-driving cars, where the computer vision system doesn’t know how to identify a strange object on the road and cannot determine how to act in response to it. In the context of the pilot, the team identified the following challenges:

During sunset, the computer vision system misidentified the orange sky as fire

Fog and haze were often misidentified as smoke 

During relatively dry periods, the system would misidentify dusty and barren terrain as smoke

The system was not trained to make nuanced distinctions between the presence of morning fog and smoke resulting from a fire. As such, the system needed to be refined to address the real-world distinctions between different features present in the environment. To do this, the team retrained their model on each edge case. They curated databases with extensive images of both fires and sunsets, haze and smoke, and thatches and smoke, and fed these into the algorithm. Through this training process, the parameters of the algorithm adjusted, and the system was better able to accurately distinguish fires and smoke from other features in the environment.

Future directions

Since the breakthrough in deep learning in 2012, it has been a common practice to use models trained on ImageNet after appropriate adaption. However, in recent years large language models (LLM), specifically Large Multi-Modal models (LMMs), have shown better performance on a wide variety of tasks as compared to classical ImageNet based methods. The team is now including the power of LLMs to further improve the reliability of the system (we’ll provide a technical introduction to LLMs later in the module).

For example, the team is employing LLMs to improve the binary classification of images into ‘normal’ or ‘containing smoke or fire’. At the initial classification step, there are three LLM based binary classifications happening to improve the accuracy of the classification, such that only images that are classified to contain fire or smoke are passed through the object detector. Both the classification as well as the localization information is then updated on the dashboard. Even if the fire or smoke is not localized, the classification information is updated on the dashboard as an event of interest. This has helped in significantly reducing the occurrence of false events to almost zero during daytime.


Keep reading to discover more about some of the non-technical challenges this team faced.