Areas of the Mutlnomah County, Oregon grouped into 5 clusters based on patterns of building age. Each building is plotted individually using the colour assigned to its own cluster. (data from City of Portland Open Data Initiative, background OSM) |
My basic idea was to do something along the following lines:
- Partition the buildings into contiguous equal sized groups
- Create some kind of time-series for each group reflecting building history
- Run a clustering algorithm against the latter data
- See if it produced interesting results
In practice the approach I took was as follows:
- Work only with centroids of buildings
- Create a Delaunay triangulation of the building centroid nodes
- Convert this into a graph with buildings as nodes and the sides of the Delaunay triangles as edges.
- Partition the graph into equal sized subgraphs (target size was 500-1000 buildings)
- Create histograms of numbers of buildings built in each decade from 1840 to 2010 (starting year)
- Run a hierarchical clustering algoirthm (Ward's method chosen at least in part because of the nifty work done by Prof. Dr. Werner Bätzing on populations in Alpine villages (warning PDF))
- Visualise the results (see main image above).
Delaunay triangulation of 600 buildings from Portland Oregon Open Data. (the large triangles at the edges must be removed by clipping) |
row_number()-1 (over order by uid) as rownum
# uid is the primary key of table
Every value in the Delaunay triangulation can then be unambiguously back to the source data. From QGIS these are POINTA, POINTB, and POINTC in the result attributes.
With the Delaunay triangles with original building IDs reassigned to the nodes of the triangle it is then possible to convert this into a straight representation as a graph. Firstly any triangles which are outside the original target area should be removed simply by clipping with the original bounding box (in practice I used Multnomah county to do this, reducing the data size as well). This ensures that buildings which are not close together are not accidentally linked. For basic graph partitioning we only really need the connections between pairs of nodes: i.e, all unique occurrences of (POINTA,POINTB), (POINTB,POINTC),(POINTC,POINTA). I found by importing the Delaunay data imported into PostGIS, dumping the nodes and applying a window function to the result:
select * from (No doubt this SQL could be tidied a bit!
select gid, idx, refer ref1, lead(refer,1) over w ref2
from (select gid, (dp).path[2] as idx
, case (dp).path[2] when 1 then "POINTA"
when 2 then "POINTB"
when 3 then "POINTC"
else "POINTA" end refer,
(dp).geom
from (select *, st_dumppoints(geom) dp
from delaunay) a
) b
window w as (partition by gid order by idx) ) c
where c.ref2 is not null;
The data needs to be cleaned to remove duplicate edges (pairs of ref1,ref2), but can then be used to generate a listing of the graph suitable for input to an open-source graph partition package (I used metis under Cygwin), as follows:
create table pbx_graph_mitis asThis data was exported from PostGIS ran through metis with a simple no of partitions parameter. The result file was then pulled back into PostGIS and all building centroid rows had the partition number added. Thereafter it was a simple case of creating histogram data for each partition:
select ref1, array_agg(xref2 order by xref2) xref2_arr from
(select * from pbx_graph order by ref1,ref2) a
group by ref1, ref2;
select part_no
, sum(case when decade_built = 1840 then cnt else 0 end) b1840
, sum(case when decade_built = 1850 then cnt else 0 end) b1850
, sum(case when decade_built = 1860 then cnt else 0 end) b1860
, sum(case when decade_built = 1870 then cnt else 0 end) b1870
, sum(case when decade_built = 1880 then cnt else 0 end) b1880
, sum(case when decade_built = 1890 then cnt else 0 end) b1890
, sum(case when decade_built = 1900 then cnt else 0 end) b1900
, sum(case when decade_built = 1910 then cnt else 0 end) b1910
, sum(case when decade_built = 1920 then cnt else 0 end) b1920
, sum(case when decade_built = 1930 then cnt else 0 end) b1930
, sum(case when decade_built = 1940 then cnt else 0 end) b1940
, sum(case when decade_built = 1950 then cnt else 0 end) b1950
, sum(case when decade_built = 1960 then cnt else 0 end) b1960
, sum(case when decade_built = 1970 then cnt else 0 end) b1970
, sum(case when decade_built = 1980 then cnt else 0 end) b1980
, sum(case when decade_built = 1990 then cnt else 0 end) b1990
, sum(case when decade_built = 2000 then cnt else 0 end) b2000
, sum(case when decade_built = 2010 then cnt else 0 end) b2010
from (
select 10*floor(year_built/10) decade_built, part_no, count(*) cnt
from pbx_paritioned
group by 10*floor(year_built/10), part_no) a
group by part_no
order by 1
Once again this data was exported as a CSV file.
I used a totally naive approach to doing Ward Clustering in R. Essentially I followed the cookbook as described here. This is the resulting clustering diagram:
Cluster dendrogram output from R, the five clusters used are highlighted in red |
Methodologically this approach is rather flakey. You do not have to be a mathematician to spot various problems. The obvious issues which occur to me are:
- I did not exclude buildings without a building date
- I did include small buildings such as sheds and garages which are much more likely to be ephemeral than houses and other larger buildings
- Newly built sections will have values of 0 for all prior decades, but I hope to cluster such sections with older sections where the original buildings are largely intact.
- The clustering may just reflect either: landuse (residential versus commercial) or age of an area when construction started. In both cases the clustering would provide trivial information
- The choice of clustering method was predicated on finding a workable example rather than any deep statistical principles.
- The number of clusters chosen (5) was totally arbitrary and influenced by the example I was copying.
- There is no control data to assess whether the clusters reflect anything meaningful (hampered by my complete absence of any feel for the geography of Portland).
Finally, the whole point of looking at persistence is to gain insights into ways in which we can map both cities as they are and as they have been. Having a large reasonably complete dataset such as Portland's buildings enables us to learn what sort of information is useful, and also learn how others have solved data problems. In this case two aspects of the data were critical: building start date, and building type. It is not important that start date be accurate to year, indeed it is often easy to place a building in a give decade merely by inspection, whereas assigning a year may require detailed research. Building type is important so as to be able to exclude things like garages. It may have further importance in a more elaborate clustering technique. Both these tags exist and are used in OSM: this analysis suggests that they are important and valuable attributes for buildings.
Ideally I would like to run tree-based classification models for this type of segmentation or clustering as they lend themselves to better testing and the rules themselves can often be quickly sense-checked by eye. I'll return to this theme shortly using real OSM data.
No comments:
Post a Comment
Sorry, as Google seem unable to filter obvious spam I now have to moderate comments. Please be patient.