Cost-effective monitoring with active learning

EAI worked togther with MIT to create groundbreaking active learning approach that saves money while improving air quality

Overview

Air quality, health, and ultimately, the quality of life, are directly connected. With the rapid growth of urban populations worldwide, concern for global health is increasing too.

The World Health Organization lists air pollution as a major environmental risk to health, as it has a direct relationship to the incidence of diseases such as lung cancer, and chronic and acute respiratory diseases.

Recent advances in sensor technology and the internet of things have made the deployment of large-scale sensor networks possible, but not quite easy, yet.

When it comes to measuring air quality, sensors are either expensive or less accurate; and in what comes to IoT, special care needs to be taken to deal with networking in order to minimize reliability issues.

MIT has developed a unique distributed sensor network to measure air quality across MIT’s campus. The aim was to deploy a cost-effective solution that leverages active learning and Gaussian processes to optimize the placement of sensors in locations that maximize information gain while minimizing costs.

Air quality data for two sensors

Boston,
United States

We have worked together with MIT to find a solution that would require fewer sensors but still achieve satisfactory air quality data collection using machine learning, and more specifically, active learning, together with moveable sensors.

Instead of using fixed sensors, we started with the assumption that air quality distribution is not random but rather depends on events, seasonality, and weekly and daily schedules and is also highly connected between sensors adjacent to each other.

Based on this assumption, we designed a system that requires fewer sensors to produce the same results while also providing recommendations for the permanent or temporary placement of new sensors, and where these should be located.

Our approach demonstrates the potential of active learning and sensor optimization in environmental sensing applications and has significant implications for future research and monitoring efforts.

Challenges

Deploying sensor networks to measure air quality is a complex and challenging task that requires careful consideration of various factors. One of the major challenges is making sure that the data collected is accurate and useful. Things like missing data, wrong IDs, and noisy signals can make it hard to get reliable and actionable insights.

The cost of putting up high-quality sensors is also a big problem. It can be too expensive to buy sensors that can measure pollutants accurately, which makes it hard to set up large networks of sensors in many places. This can make the spatial resolution and accuracy of the data collected much worse, making it harder to learn anything useful from the data.

Also, when setting up sensor networks, it's important to think carefully about where to place the sensors. The readings from the sensors can be affected by things like the way the wind blows, the shape of the land, and pollution sources close by. To get accurate and reliable data, it is important to make sure that the sensors are put in the best places that accurately reflect the air quality of the surrounding areas.

To deal with these problems, you need a multifaceted plan that takes into account things like sensor quality, placement, and data processing methods. It is possible to solve these problems and get accurate and reliable information about air quality by coming up with new ways to use advanced technologies and methods.

Testing our assumptions

Visualizing the air quality data

After visualizing the data for the two first sensors, we could immediately identify the correlation. This is good, as high correlations could mean that one of these sensors might be unnecessary if we can use the other sensor, or a combination of other sensors, to predict it.

The following image also shows a number of missing data (notice the white areas in the x-axis, used for the date).

Air quality data for two sensors

Multivariate Analysis

Multivariate analysis is a statistical technique used to analyze multiple variables simultaneously. In this project, it is important because it allows us to examine the relationships between different environmental factors and their impact on air quality.

Correlations

We started by plotting a correlation matrix, this showed that there was a high correlation between different sensors. This is good news, as correlations between sensors means that we can very likely reduce the number of required sensors.

Air quality correlation

Autocorrelations

Autoregression is a time series analysis technique used to model the relationship between an observation and a number of lagged observations. It is an important technique for this project because it helps us find patterns and cycles in the air quality data that happen over time.

By modeling the relationship between past and present air quality measurements, we can make predictions about future air quality levels and identify potential sources of pollution. Autoregression can also be used to detect anomalies or sudden changes in air quality that may require further investigation.

Basically, autoregression is a powerful tool for figuring out patterns and trends in data about air quality and making good decisions about how to manage the environment. If there are cyclic patterns in the data, this means that specific sensor data might be easier to predict and that might help us reduce the number of samples needed to model it.

Air quality autocorrelation

Cross-correlations

Cross-correlation is a mathematical technique used to measure the similarity between two sets of data, often used to identify patterns and relationships between them. 

We used cross-correlation in this project to compare the data from different air quality sensors around the MIT campus. By calculating the cross-correlation between the data collected by different sensors, we can identify if the air quality patterns observed in one area are also observed in another area. 

This allows us to identify if there are time lags between air quality patterns observed in different sensor locations, which can be used to infer the direction of air pollution transport.

Cross-correlation can also help us figure out if the sources of air pollution are spread out evenly or if they are mostly in certain places.

Overall, cross-correlation plays a key role in helping us to understand the complex spatial and temporal patterns of air pollution around the MIT campus.

Air quality crosscorrelations

Solution

The deployed network of IoT sensors measured PM2.5 and PM10 particles, CO and NO. Raspberry Pi’s and sensors were used to build the network and the sensors were connected to the campus wifi.

Data was collected, but due to noisy sensor data, software noise reduction algorithms to clean up the data had to be applied.

To optimize sensor placement, we employed active learning techniques. This allowed us to know where a new sensor would be needed and for how long it would be needed in order to have satisfactory predictions of the air quality in that location.

Impact

Our air quality system had a significant impact on the number of sensors necessary to measure air quality around the campus. By accurately measuring air quality, we were able to provide valuable information that could help people make informed decisions about their health and well-being. For example, if the air quality in a particular area was poor, people could avoid that area or take measures to protect themselves. This data was provided on a real-time map hosted at clairity.mit.edu, which is unfortunately now offline.

The system also provided researchers with a wealth of data that could be used to understand the patterns of air quality around the campus. This information could be used to develop policies and interventions to improve air quality and protect public health.

Furthermore, the system we developed using cheaper sensors and active learning techniques could be replicated in other cities and communities, helping to address the global problem of air pollution. Ultimately, by using innovative technologies and methodologies, we were able to provide a cost-effective and reliable solution that had a positive impact on public health and well-being.

Gaussian Processes and Active Learning

Active learning played a critical role in the success of our air quality system.

In order to identify where a sensor would be needed, we would predict the air quality for a number of possible areas using Gaussian processes.

Using gaussian processes for the predictions not only gave us the predicted value but also the confidence for that prediction.

This confidence metric is key for the active learning system, as that is what guides it and allows us to reduce the number of required sensors.

If an area is predicted with high confidence, it’s unlikely a new sensor is needed there, however, if that same area is predicted with low confidence, then the system would probably benefit from the installation of a sensor there.

As the system got more information, it would update its predictions and confidence levels and make any necessary changes to where the sensors were placed.

This is what made the system cheaper to maintain while still giving accurate information about the air quality.

Note: Different query behaviors can be used to guide machine learning. We’ve used least confidence, but we could have used margin or entropy alternatively.

Future work

The air quality system that we developed using IoT and active learning has significant potential for future applications in environmental monitoring.

By combining low-cost sensors with active learning techniques, it is possible to create systems that can monitor a wide range of environmental factors, such as water quality, temperature, and noise levels.

Also, these systems could be used to monitor data in a variety of contexts, not just environmental data. They could be used to monitor factories (industry 4.0), and homes, office spaces, etc. 

Using low-cost sensors and active learning techniques can make the deployment of sensor networks like this much easier for a range of people and organizations, such as individuals and small businesses.

Let’s start a new project together