Note: this post is still being updated
Our objective was to generate a building fire risk score for every address in Baton Rouge, LA. Starting with a database listing every address in Baton Rouge, we matched addresses with urban planning records; tax assessor’s records; police department crime reports; and fire department incident reports. We used only datasets publicly available through city and county open data portals. Although matching data from one dataset to another presented challenges, we were able to generate risk scores for 86% of known addresses (over 120,000 total).
We constructed our dataset such that fires from the years 2012-2015 were visible to the algorithm as it learned to identify patterns in the property-level data. We reserved 2016 fires to test how well its predictions would perform. We designed the algorithm to assign “highest risk” to the top 2% of its predictions (around 2,500 addresses) and “elevated risk” to the top 20% of predictions (around 25,000). When tested our predictions against actual 2016 fires, 19% of fires occurred at addresses assigned a “highest risk” prediction; 63% occurred at addresses assigned an “elevated risk” prediction. These predictions, therefore, performed far better than random chance.
Our approach is modeled on an Atlanta Fire and Rescue Department project that generated risk scores for all known commercial properties in the City of Atlanta. To our knowledge, ours is the first published project to generate risk scores for residential properties. We have demonstrated that algorithmic fire prediction does not require extensive resources: ours employed existing data (collected by virtually every city), free software, and amounted to a few weeks’ work.
Our future work points in two directions: 1) improving the accuracy of predictions by incorporating additional property-level datasets; and 2) working with cities to determine how these predictions can best contribute to their operations. We consider Fire Departments to be important users, but not the only ones: other agencies, such as building inspection departments, may be well positioned to monitor high-risk homes.
The datasets and variables used for this analysis appear in the table below. These were all available through the city’s official open data portal.
Our objective was to analyze risk for every address in Baton Rouge. We started with the full listing of street addresses (n=140,625), and attempted to merge each subsequent dataset to this master list. The property information database includes entries for each municipal lot, which is a different unit of analysis than a street address. Some addresses contain multiple lots, while some lots contain multiple addresses. For each address in the master list, we generated variables summarizing the number of addresses sharing the same lot and the number of lots sharing the same address. We decided to drop the 14% of street addresses that did not match entries in the property information database, on the assumption that these were likely to contain many locations where no building existed.
We performed similar merges with datasets for the tax roll, crime reports and fire incidents, without dropping addresses that did not find matches. For addresses with multiple tax roll entries, we took the sum of the relevant value, assuming these were addresses containing multiple tax parcels. For the 36% of addresses that did not contain a corresponding entry in the tax roll, we modeled the missing market value and building improvement value using a Random Forest model (10 estimators, no maximum depth, maximum features=sqrt(n_features)). In 10-fold cross-validation on known values, its mean r-squared was around 0.4—significantly better than simpler imputation methods, but far from perfect.
We included past crime in our model, under the belief that a property with a history of crime incidence might be more likely to experience a fire. We assigned each address a score of “1” if the crime database contained at least one entry at the same address. These were coded by time period to ensure that only prior crimes were factoring into fire predictions.
U.S. Census block group land area (derived from lat/long)
Businesses: any registered at lot (yes/no)
Current use category
Future use category
Design level category
No. of lot database entries at address
No. of addresses at same lot entry
Fair market value
Value of improvements
No. of tax parcel entries at address
1+ crime occurrence 2011 (yes/no)
1+ crime occurrence 2012-2015 (yes/no)
1+ fire occurrence 2012-2015
1+ fire occurrence 2016
(dropped 19,588 addresses without corresponding lot entry)
To generate predictions, we tested three model types: logistic regression, Random Forest, and gradient-boosted trees, using the Python sklearn package. For Random Forest and gradient-boosted trees, we tuned the parameters of each estimator using a grid search with 5-fold cross-validation. We trained the models on 2012-2015 fire incidents, while reserving 2016 incidents for cross-validation.
While the best gradient boosted trees model (learning rate = 0.5, estimators = 60), performed slightly better in 5-fold cross-validation than the best Random Forest model (maximum depth = 8, estimators = 150) by area under the curve (AUC), Random Forest performed better in our 2016 sample. The Random Forest’s AUC was 0.81, compared to boosted trees’ 0.79 in 2016 testing:
As discussed above, at a threshold with false positive rate of 2%, the model correctly predicted 18.5% of addresses with a 2016 fire; at a FPR of 20%, it correctly predicted 63% of addresses with fire.
The most important variables were latitude and longitude, market value and value of improvements, and lot acreage:
The Atlanta Firebird model found that the most important features were associated with the density of people present at a given property. We suspected that our model might simply have learned to identify larger properties. However, our high-risk estimates were concentrated not just at the higher end of these variables. The model predicted many fires at the lower end, as well:
The high importance of latitude and longitude suggests that the model homed in on spatial patterns in the data. Additionally, the model used ALAND (the square mileage of the Census block group associated with the property), apparently a proxy for neighborhood population density.
More coming here…