Wednesday, 24 July 2013

Persistence in the Urban Environment : 2 Portland Oregon Buildings and experimental clustering

Areas of the Mutlnomah County, Oregon grouped into 5 clusters based on patterns of building age.
Each building is plotted individually using the colour assigned to its own cluster.
(data from City of Portland Open Data Initiative, background OSM)
A few months ago I wrote about how different parts of cities have radically different development patterns. I illustrated this with examples from three cities which I am familiar with, but all my interpretation was subjective. I was therefore very interested in a blog post by MapBox about the Open Data building footprints for the metro area of Portland, Oregon. Apart from the fact that this is a comprehensive data set the most interesting thing is that most (around 87%) of the buildings have a data associated with them. It therefore seemed like a perfect data set to see if one could classify areas of a city in terms of the profile of building activity.

My basic idea was to do something along the following lines:
  • Partition the buildings into contiguous equal sized groups
  • Create some kind of time-series for each group reflecting building history
  • Run a clustering algorithm against the latter data
  • See if it produced interesting results

In practice the approach I took was as follows:
  • Work only with centroids of buildings
  • Create a Delaunay triangulation of the building centroid nodes
  • Convert this into a graph with buildings as nodes and the sides of the Delaunay triangles as edges.
  • Partition the graph into equal sized subgraphs (target size was 500-1000 buildings)
  • Create histograms of numbers of buildings built in each decade from 1840 to 2010 (starting year)
  • Run a hierarchical clustering algoirthm (Ward's method chosen at least in part because of the nifty work done by Prof. Dr. Werner Bätzing on populations in Alpine villages (warning PDF))
  • Visualise the results (see main image above).
There were a number of technical issues arising in this approach, most of which concerned steps 2 and 3. Unfortunately Delaunay triangulation is not yet available in PostGIS, so I had to carry it out in Quantum GIS. The latter step worked fine (although it took a long time), but the triangular polygons produced did not have any references to the original IDs of the building centroids, instead everything related to the row-number of the data pulled into PostGIS. I had not realised the problem at the time & it was quite a bit of work to re-build these references. The approach works but a few additional steps would have saved me a lot of grief.

Delaunay triangulation of 600 buildings
from Portland Oregon Open Data.
(the large triangles at the edges must be removed by clipping)
 Firstly, as a SQL query does not guarantee row order it is important to sort the data when pulling it into QGIS. This ensures that row order is likely to be constant if the step is repeated. Using the DB Manager tool enables one to do this. A further protection is to number the result set rows in a manner consistent with QGIS (first row is 0). I have done this by adding to my standard select statement the additional column:

row_number()-1 (over order by uid) as rownum
# uid is the primary key of table

Every value in the Delaunay triangulation can then be unambiguously back to the source data. From QGIS these are POINTA, POINTB, and POINTC in the result attributes.

With the Delaunay triangles with original building IDs reassigned to the nodes of the triangle it is then possible to convert this into a straight representation as a graph. Firstly any triangles which are outside the original target area should be removed simply by clipping with the original bounding box (in practice I used Multnomah county to do this, reducing the data size as well). This ensures that buildings which are not close together are not accidentally linked. For basic graph partitioning we only really need the connections between pairs of nodes: i.e, all unique occurrences of (POINTA,POINTB), (POINTB,POINTC),(POINTC,POINTA). I found by importing the Delaunay data imported into PostGIS, dumping the nodes and applying a window function to the result:

select * from (
  select gid, idx, refer ref1, lead(refer,1) over w ref2  

  from (select gid, (dp).path[2] as idx
          , case (dp).path[2] when 1 then "POINTA"
                 when 2 then "POINTB"
                 when 3 then  "POINTC"
                 else "POINTA" end refer,
        from (select *, st_dumppoints(geom) dp
              from delaunay) a
        ) b
  window  w as (partition by gid order by idx) ) c
where c.ref2 is not null;
No doubt this SQL could be tidied a bit!

The data needs to be cleaned to remove duplicate edges (pairs of ref1,ref2), but can then be used to generate a listing of the graph suitable for input to an open-source graph partition package (I used metis under Cygwin), as follows:
create table pbx_graph_mitis as
select ref1, array_agg(xref2 order by xref2) xref2_arr from
(select * from pbx_graph order by ref1,ref2) a
group by ref1, ref2;
This data was exported from PostGIS ran through metis with a simple no of partitions parameter. The result file was then pulled back into PostGIS and all building centroid rows had the partition number added. Thereafter it was a simple case of creating histogram data for each partition:

select part_no
, sum(case when decade_built = 1840 then cnt else 0 end) b1840
, sum(case when decade_built = 1850 then cnt else 0 end) b1850
, sum(case when decade_built = 1860 then cnt else 0 end) b1860
, sum(case when decade_built = 1870 then cnt else 0 end) b1870
, sum(case when decade_built = 1880 then cnt else 0 end) b1880
, sum(case when decade_built = 1890 then cnt else 0 end) b1890
, sum(case when decade_built = 1900 then cnt else 0 end) b1900
, sum(case when decade_built = 1910 then cnt else 0 end) b1910
, sum(case when decade_built = 1920 then cnt else 0 end) b1920
, sum(case when decade_built = 1930 then cnt else 0 end) b1930
, sum(case when decade_built = 1940 then cnt else 0 end) b1940
, sum(case when decade_built = 1950 then cnt else 0 end) b1950
, sum(case when decade_built = 1960 then cnt else 0 end) b1960
, sum(case when decade_built = 1970 then cnt else 0 end) b1970
, sum(case when decade_built = 1980 then cnt else 0 end) b1980
, sum(case when decade_built = 1990 then cnt else 0 end) b1990
, sum(case when decade_built = 2000 then cnt else 0 end) b2000
, sum(case when decade_built = 2010 then cnt else 0 end) b2010
from (
select 10*floor(year_built/10) decade_built, part_no, count(*) cnt
from pbx_paritioned
group by 10*floor(year_built/10), part_no) a
group by part_no
order by 1

Once again this data was exported as a CSV file.

I used a totally naive approach to doing Ward Clustering in R. Essentially I followed the cookbook as described here. This is the resulting clustering diagram:

Cluster dendrogram output from R, the five clusters used are highlighted in red

Methodologically this approach is rather flakey. You do not have to be a mathematician to spot various problems. The obvious issues which occur to me are:
  • I did not exclude buildings without a building date
  • I did include small buildings such as sheds and garages which are much more likely to be ephemeral than houses and other larger buildings
  • Newly built sections will have values of 0 for all prior decades, but I hope to cluster such sections with older sections where the original buildings are largely intact.
  • The clustering may just reflect either: landuse (residential versus commercial) or age of an area when construction started. In both cases the clustering would provide trivial information
  • The choice of clustering method was predicated on finding a workable example rather than any deep statistical principles.
  • The number of clusters chosen (5) was totally arbitrary and influenced by the example I was copying.
  • There is no control data to assess whether the clusters reflect anything meaningful (hampered by my complete absence of any feel for the geography of Portland).
On the other hand, I think the basic approach is valid, and given additional data sets with suitable control data could be extended

Finally, the whole point of looking at persistence is to gain insights into ways in which we can map both cities as they are and as they have been. Having a large reasonably complete dataset such as Portland's buildings enables us to learn what sort of information is useful, and also learn how others have solved data problems. In this case two aspects of the data were critical: building start date, and building type. It is not important that start date be accurate to year, indeed it is often easy to place a building in a give decade merely by inspection, whereas assigning a year may require detailed research. Building type is important so as to be able to exclude things like garages. It may have further importance in a more elaborate clustering technique. Both these tags exist and are used in OSM: this analysis suggests that they are important and valuable attributes for buildings.

Ideally I would like to run tree-based classification models for this type of segmentation or clustering as they lend themselves to better testing and the rules themselves can often be quickly sense-checked by eye. I'll return to this theme shortly using real OSM data.

No comments:

Post a Comment