Weighting meinGrün Zielgeometrien from geolocated Social Media posts for measuring greenspace popularity

This Notebook contains the complete process to apply weights to meinGrün Zielgeometrien extracted from LBSM data (geolocated Social Media Posts).

Intersection of points and coordinates is not complex. However, there're several potential pitfalls in the process.

Pitfalls:

  • for improving representativity, we're using different datasources (Twitter, Instagram Flickr)
    • all of these have different composition, biases and sources of noise
  • for coordinate mapping, a primary source of noise are the different geoaccuracies (gps/'latlng')
    • flickr has the highest geoaccuracy, while Twitter and Instagram are mostly geoaccuracy 'place'
    • this must be taken into account when assigning coordinates to different shapes (we'll be using a SEARCH_RADIUS 25 and 50m)
  • the overall quantity of photos is spread quite differently
    • Instagram makes up 80% of data, Flickr about 20% and Twitter 1%
    • normalization of three datasources is necessary, so that each source has a balanced impact on final result
    • Twitter may introduce biases as to the low amount of data, which can be taken into account by reducing Twitter overall impact
  • normalization on area is necessary, otherwise large shapes would dominate because of their higher likeliness of intersecting with many coordinates
    • area must be calculated for different search distances used per origin:
      • 50m buffer area for Twitter/Instagram coordinates
      • 25m buffer area for Flickr coordinates
  • long tail: typically, Social Media posts peak at few places; but we're also interested in the 'long tail of data' (the other 80% of less frequented places)
    • we can apply log or sqrt to flatten this peak and improve visibility of less frequented places
  • finally we need to classify and label data into arbritary classes
    • after normalization, absolute numbers are meaningless, therefore we can 'stretch' results per shape to 1 to 1000 range for improving comparability across runs
    • based on 1 to 1000 range, we can apply natural_breaks or other classifiers to label bins of common classes (e.g. 6 bins are used herein)

Lets first start with importing the necessary dependencies. First, make sure you have these packages installed (use miniconda with packages from conda-forge and pyviz channel!) and execute the following code block

In [1]:
import fiona
import geoviews as gv
import geoviews.feature as gf
from pathlib import Path
from shapely.geometry import shape
import geopandas as gp
import pandas as pd
from shapely.geometry import Point
import holoviews as hv
import numpy as np
from geopandas.tools import sjoin
# enable shapely.speedups which makes some of the spatial queries running faster.
import shapely.speedups
from cartopy import crs as ccrs
from collections import defaultdict
from collections import namedtuple
hv.notebook_extension('bokeh')
shapely.speedups.enable()

Because we're going to display pretty large maps, graphics and interactive visualizations in this notebook, the following line may improve jupyter speed.

In [81]:
%config Completer.use_jedi = False

Define Input Parameters

This section contains the only manual code edits that are needed to run this notebook.

  • ZIELGEOMETRIEN_DATA: this is the geojson containing the Zielgeometrien (polygons), social media posts will be used to calculate weights for each polygon in this geojson
  • LBSM_INPUT_DATA: this is the social media data used to calculate weights, e.g. a collection of post-records with lat-lng coordinates this file was generated using tagmaps package, which applies filters and cleans up noisy social media raw data

The files should be places in a subfolder called "./01_Input/" (relative to this notebook).

In [84]:
TARGETSHAPE_VERSION = 'v4.1'

# Level 0:
# LEVEL = "L0"
# TARGET_SHAPE_ID_COL = "TARGET_ID"
# GREEN_TARGET_COL = "TARGET_green_target"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Dresden_4.1/targets.json"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Heidelberg/targets.json"

# Level 1 (Wege):
# LEVEL = "L1"
# TARGET_SHAPE_ID_COL = "WAY_ID"
# GREEN_TARGET_COL = "PARENT_TARGET_green_target"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Dresden_4.1/targetParts_Wege.json"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Heidelberg/targets_Wege.json"

# Level 2 (Wege LC):
LEVEL = "L2"
TARGET_SHAPE_ID_COL = "LULC_ID"
GREEN_TARGET_COL = "PARENT_TARGET_green_target"
ZIELGEOMETRIEN_DATA = "GeoJSON_Dresden_4.1/targetParts_Wege_LULC.json"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Heidelberg/targets_Wege.json"

# LBSM_INPUT_DATA = "Output_cleaned_Dresden_FlickrOnly.csv"
# LBSM_INPUT_DATA = "Output_cleaned_Heidelberg_FlickrOnly.csv"
# LBSM_INPUT_DATA = "2019-09-03_meinGruen_Heidelberg.csv"
LBSM_INPUT_DATA = "2019-09-02_meinGruen_Dresden.csv"
In [3]:
import datetime as dt
def get_str_formatted_today():
    """Returns date as string (YYYY-mm-dd)"""
    today = dt.date.today()
    today_str = today.strftime("%Y-%m-%d")
    return today_str
TODAY = get_str_formatted_today()

Define manual boundaries for map zooms

Define initial extent where the map should be zoomed to. If empty, the map will show the full boundary of the provided osm-graph (which may result in slow performance of the final map). To select boundary lat/lng coordinates, use (online tool link).

In [4]:
# Dresden Altstadt
LIM_LNG_MAX_1 = 13.750784
LIM_LNG_MIN_1 = 13.730367
LIM_LAT_MAX_1 = 51.055349
LIM_LAT_MIN_1 = 51.048746

# Dresden Center
LIM_LNG_MAX_2 = 13.788357
LIM_LNG_MIN_2 = 13.706689
LIM_LAT_MAX_2 = 51.060386
LIM_LAT_MIN_2 = 51.033973

# Dresden Neustadt:
LIM_LNG_MAX_3 = 13.782692
LIM_LNG_MIN_3 = 13.701024
LIM_LAT_MAX_3 = 51.087890
LIM_LAT_MIN_3 = 51.061492

# Heidelberg Center
LIM_LNG_MAX_4 = 8.73
LIM_LNG_MIN_4 = 8.68
LIM_LAT_MAX_4 = 49.420
LIM_LAT_MIN_4 = 49.404

# Total Bounds
LIM_LNG_MAX_5 = 14.0876945
LIM_LNG_MIN_5 = 13.46475261
LIM_LAT_MAX_5 = 51.24305035
LIM_LAT_MIN_5 = 50.90449772

Define the search max radius (in meters) to be used when intersecting polys with points and provide a name (used for writing data):

In [5]:
SEARCH_RADIUS_MAX =  50
CITY_NAME = "Dresden"
# CITY_NAME = "Heidelberg"

Load Shape Data

Load GeoJSON

In [6]:
geom_input_filepath = Path.cwd() / "01_Input" / ZIELGEOMETRIEN_DATA
In [7]:
%%time
meingruen_targetgeom_shapes = gp.read_file(geom_input_filepath)
total_bounds = meingruen_targetgeom_shapes.total_bounds
print(f'Total bounds: {total_bounds}')
print(f'Number of shapes: {len(meingruen_targetgeom_shapes)}')
Total bounds: [13.46475261 50.90449772 14.0876945  51.24305035]
Number of shapes: 290015
CPU times: user 56.2 s, sys: 3.66 s, total: 59.9 s
Wall time: 1min

Assign bounds to globa variables so we can use those later.

In [8]:
LIM_LNG_MIN, LIM_LAT_MIN, LIM_LNG_MAX, LIM_LAT_MAX = total_bounds
  • Project from WGS1984 (4326) to World Mercator (epsg:3395)
In [9]:
gdf_targetgeom_shapes = meingruen_targetgeom_shapes.to_crs({'init': 'epsg:3395'})

Preview Shape Table

In [10]:
gdf_targetgeom_shapes.head()
Out[10]:
OBJECTID PARENT_TARGET_ID PARENT_TARGET_green_target PARENT_TARGET_gemeinde_DD PARENT_WAY_ID LULC_ID LULC_origin LULC_access LULC_name LULC_type ... classNoData_ratio class0_area class1_area class2_area class3_area class4_area classNoData_area SHAPE_Length SHAPE_Area geometry
0 1 dd_OSM_57149 False False dd_OSM_57149_5 dd_OSM_57149_5_22 osm None grass ... 0.721358 162.244025 166.413835 NaN NaN 23.679440 910.435348 0.003540 1.617494e-07 (POLYGON ((1500542.089541753 6590588.524965676...
1 283517 dd_EBK-TSP-OSM_27937 False True dd_EBK-TSP-OSM_27937_2 dd_EBK-TSP-OSM_27937_2_3 osm_building None building ... NaN 77.572044 NaN NaN NaN 10.155274 NaN 0.000570 1.124700e-08 (POLYGON ((1513460.81741509 6597924.052811448,...
2 116638 dd_OSM_54537 False False dd_OSM_54537_2 dd_OSM_54537_2_1 osm None building ... NaN 0.174807 28.345594 NaN NaN 1.379752 NaN 0.000265 3.828652e-09 (POLYGON ((1522592.089413918 6587890.020932667...
3 73544 dd_EBK-TSP-OSM_30836 False True dd_EBK-TSP-OSM_30836_1 dd_EBK-TSP-OSM_30836_1_19 osm_building None building ... NaN 81.507395 NaN NaN NaN 10.156085 NaN 0.000464 1.175166e-08 (POLYGON ((1543222.089401244 6597951.100413561...
4 192364 dd_EBK-TSP-OSM_31363 False True dd_EBK-TSP-OSM_31363_18 dd_EBK-TSP-OSM_31363_18_1 osm_landuse Kleingartenverein Spitzwegstraße 1 e.V. allotments ... NaN 430.625513 NaN NaN NaN 49.888683 NaN 0.001968 6.154865e-08 (POLYGON ((1531773.928959356 6590565.064079593...

5 rows × 29 columns

Remove unnecessary columns

There're many columns in this dataset that are of no interest to the computation in this notebook. Have a look whats there:

In [11]:
list(gdf_targetgeom_shapes.columns.values)
Out[11]:
['OBJECTID',
 'PARENT_TARGET_ID',
 'PARENT_TARGET_green_target',
 'PARENT_TARGET_gemeinde_DD',
 'PARENT_WAY_ID',
 'LULC_ID',
 'LULC_origin',
 'LULC_access',
 'LULC_name',
 'LULC_type',
 'LULC_type_ger',
 'LULC_rank',
 'LULC_AREA',
 'LULC_LENGTH',
 'class0_ratio',
 'class1_ratio',
 'class2_ratio',
 'class3_ratio',
 'class4_ratio',
 'classNoData_ratio',
 'class0_area',
 'class1_area',
 'class2_area',
 'class3_area',
 'class4_area',
 'classNoData_area',
 'SHAPE_Length',
 'SHAPE_Area',
 'geometry']

Drop columns that are not needed:

In [12]:
def try_drop_cols_df(gdf, drop_cols):
    """Drop columns from pandas dataframe with try..except"""
    for drop_col in drop_cols:
        # apply drop
        try:
            gdf.drop(
                columns=[drop_col], inplace=True)
        except KeyError:
            pass
        
drop_list = ['class0_ratio', 'class1_ratio', 'class2_ratio', 'class3_ratio', 
             'class4_ratio', 'class4_ratio', 'class4_ratio', 'class4_ratio',
             'classNoData_ratio', 'class0_area', 'class1_area', 'class2_area', 
             'class3_area', 'class4_area', 'classNoData_area', 'SHAPE_Length',
             'SHAPE_Area', 'WAY_AREA', 'WAY_LENGTH', 'PARENT_TARGET_ID',
             'PARENT_TARGET_gemeinde_DD','PARENT_WAY_ID', 'LULC_origin', 
             'LULC_origin', 'LULC_access', 'LULC_name', 'LULC_type', 
             'LULC_type_ger', 'LULC_rank', 'LULC_AREA', 'LULC_LENGTH']

# apply drop
try_drop_cols_df(gdf_targetgeom_shapes, drop_list)

Check the new list of remaining columns

In [18]:
list(gdf_targetgeom_shapes.columns.values)
Out[18]:
['OBJECTID', 'PARENT_TARGET_green_target', 'LULC_ID', 'geometry']

Optional: Preview Plot with Geopandas directly

In [13]:
gdf_targetgeom_shapes.geometry.plot(figsize=(20,10))
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe9aa232c50>

Print bounds of target shapes (World Mercator Porjection)

In [14]:
gdf_targetgeom_shapes.geometry.total_bounds
Out[14]:
array([1498889.40453656, 6571235.69297612, 1568234.9781912 ,
       6631058.34569445])

Projections UTM

There are several projections used in this notebook:

  • Web Mercator WGS1984 (epsg:4326) for input data/output data in decimal degrees
  • World Mercator (epsg:3395) for map display in Geoviews/Bokeh
  • UTM projected coordinate system for local distance calculations

For accurate calculations in meters distance, we need to project data to UTM. Tagmaps package has a methods that will return the best UTM projection based on a lat-lng coordinate:

In [15]:
from tagmaps.classes.utils import Utils
In [16]:
from shapely.geometry import Point, Polygon
import pyproj
def get_poly_from_bounds(bounds):
    '''Add small buffer around path bounds''' 
    xmin, ymin, xmax, ymax = bounds
    bbox_polygon = Polygon([[xmin, ymax], [xmax, ymax], [xmax, ymin], [xmin, ymin]])
    return bbox_polygon

bound_poly = get_poly_from_bounds(meingruen_targetgeom_shapes.geometry.total_bounds)
print(f'input_lon_center: {bound_poly.centroid.x}')
print(f'input_lat_center: {bound_poly.centroid.y}')
epsg_code = Utils._convert_wgs_to_utm(
            bound_poly.centroid.x, bound_poly.centroid.y)
crs_proj = pyproj.Proj(init=f'epsg:{epsg_code}')
crs_wgs = pyproj.Proj(init=f'epsg:4326')

# report
print(f'Target EPSG Code: {epsg_code}')
print(f'Target projection: {crs_proj}')
input_lon_center: 13.776223556499998
input_lat_center: 51.07377403249999
Target EPSG Code: 32633
Target projection: Proj('+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs', preserve_units=True)

Optional: Plot Preview in Geoviews with Bokeh Backend

Convert to Geoviews layer for plotting

In [17]:
import cartopy
from shapely.geometry import Polygon, MultiPolygon
from shapely.geometry.collection import GeometryCollection
# convert to WGS1984: epsg:4326
gdf_targetgeom_shapes.to_crs({'init': f'epsg:{epsg_code}'}, inplace=True)
# convert to Geoviews Shapes
gv_shapes = gv.Polygons([gv.Shape(feat) for feat in gdf_targetgeom_shapes.geometry if isinstance(feat, (Polygon, MultiPolygon))], crs=cartopy.crs.epsg(epsg_code))
# Alternative does not work:
# gv_shapes = gv.Path(
#    osm_highway_shapes.head(50).geometry, kdims=['x', 'y'], vdims=['@osmid', 'highway']).opts(tools=['hover'])

Plot Rasterized static with datashade

  • use datashade, faster display but lower quality
In [15]:
from holoviews.operation.datashader import datashade, shade, dynspread, rasterize
In [16]:
datashade(gv_shapes).opts(width=800, height=480)
<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
Out[16]:

Plot interactive rasterized

In [ ]:
%%opts Image [colorbar=True clipping_colors={'NaN': (0, 0, 0, 0)}]
# %%opts Image [clipping_colors={'NaN': 'transparent'}]
# see https://anaconda.org/jbednar/datashade_vs_rasterize/notebook

original = rasterize(gv_shapes, precompute=True)
# clipped  = original.redim.range(z=(0.1, 1)).opts(clipping_colors={'min': 'transparent', 'NaN': 'transparent'})

# transparent alternative (litle bit slower)
from holoviews.operation.datashader import rasterize, dynspread, datashade
clipped = dynspread(datashade(gv_shapes))

gv.tile_sources.EsriImagery * \
clipped.opts(
    width=800,
    height=480,
    alpha=0.8)

Intersection

For appending points to polygons there're several options available:

  • use geopandas.sjoin (unfortunately, no distance available). Example
  • use spatial rtree and best fit custom method
  • use from shapely.ops import nearest_points

Since it is the fastest and most versatile, we'll explore the second option.

1. Load LBSM Data

Load CSV of LBSM points to Pandas dataframe, two formats available

  • Raw data (Instagram, Flickr, Twitter)
  • Cleaned Format merged based on UPL (only use Flickr because Instagram/Twitter are not available in highest geo-accuracy).

First, detect the format.

In [19]:
import csv
file_input_path = Path.cwd() / "01_Input" / LBSM_INPUT_DATA
with open(file_input_path, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames
# there're two different formats of input data
# cleaned_posts (UPL)
# and raw data (UP)
# cleaned posts have a slightly different format
# which is detected here based on the CSV's header
if 'latitude' in fieldnames:
    raw_format = True
    usecols = ['origin_id', 'latitude', 'longitude', 'user_guid', 'post_create_date', 'post_publish_date']
    dtypes={'origin_id': str,'latitude': float, 'longitude': float, 'user_guid': str, 'post_create_date': str, 'post_publish_date': str}
elif 'lat' in fieldnames:
    raw_format = False
    usecols = ['origin_id', 'lat', 'lng', 'user_guid', 'post_create_date', 'post_publish_date']
    dtypes={'origin_id': str,'lat': float, 'lng': float, 'user_guid': str, 'post_create_date': str, 'post_publish_date': str}
else:
    print("Unknown Format")

Load data in dataframe df

In [20]:
%%time
df = pd.read_csv(file_input_path, usecols=usecols, dtype=dtypes, encoding='utf-8')
print(f'Loaded {len(df)} posts.')
Loaded 991117 posts.
CPU times: user 2.69 s, sys: 609 ms, total: 3.3 s
Wall time: 3.32 s

For Flickr, post_create_date and post_publish_date is available, for Instagram and Twitter, only post_publish_date is available.

  • Combine both with a preference on post_create_date
  • wrapped in try..except for multiple execution handling
  • rename final columns
In [71]:
try:
    df['post_create_date'].fillna(df['post_publish_date'], inplace=True)
    df.drop(columns=['post_publish_date'], inplace=True)
    df.rename(columns={'post_create_date': 'post_date'}, inplace=True)
except KeyError:
    pass
In [72]:
# df.iloc[: , [5, 13]].head()
df.head()
Out[72]:
origin_id latitude longitude user_guid post_date geometry
0 3 51.040557 13.731987 2de2c7b4c046fb9c2a7fcfeaa479356d 2016-06-10 10:46:41 POINT (411103.140159386 5655099.9591291)
1 1 51.091362 13.697633 145301fdbcf5d5261ff46d63e0962586 2015-10-09 16:11:39 POINT (408794.6639901432 5660791.445413328)
2 1 51.042050 13.797690 df8244ec7295c142668fbcb530b6c2cc 2018-02-06 00:41:07 POINT (415711.9591703562 5655188.755580138)
3 1 51.102810 13.686440 98d6841fdbfef53a01b6247d70bae2bc 2018-03-11 18:48:22 POINT (408033.5643395127 5662078.377559571)
4 1 51.044881 13.735635 5832633cf2de965477d171bdca251eb5 2016-05-16 09:56:44 POINT (411367.1581114109 5655576.364487658)

Optional: Get some statistics for origin

  • 1 = Instagram
  • 2 = Flickr
  • 3 = Twitter
In [21]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams.update({'xtick.labelsize': 'small'})
plt.rcParams.update({'ytick.labelsize': 'small'})
plt.rc('legend', fontsize=10)

sns.countplot(x='origin_id', hue='origin_id', data=df)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe905386518>
In [22]:
size_count = df.groupby('origin_id').size()
total_count = len(df)
for key, value in size_count.items():
    print(f'Origin ID {key} with {value} posts, total percentage is: {value/(total_count/100):.2f}%')
Origin ID 1 with 806181 posts, total percentage is: 81.34%
Origin ID 2 with 183249 posts, total percentage is: 18.49%
Origin ID 3 with 1687 posts, total percentage is: 0.17%

2. Convert pandas dataframe to geopandas geodataframe

In [77]:
from shapely.geometry import Point
    
def make_point(row):
    if raw_format:
        return Point(row.longitude, row.latitude)
    return Point(row.lng, row.lat)

# Go through every row, and make a point out of its lat and lon
points = df.apply(make_point, axis=1)

# Make a new GeoDataFrame
# using the data from our old df
# but also adding in the geometry we just made
lbsm_data_gdf = gp.GeoDataFrame(df, geometry=points)
# set input projection
lbsm_data_gdf.crs = {'init' :'epsg:4326'}
# project to UTM
lbsm_data_gdf.to_crs({'init': f'epsg:{epsg_code}'}, inplace=True)
In [79]:
lbsm_data_gdf.head()
Out[79]:
origin_id latitude longitude user_guid post_date geometry
0 3 51.040557 13.731987 2de2c7b4c046fb9c2a7fcfeaa479356d 2016-06-10 10:46:41 POINT (411103.140159386 5655099.9591291)
1 1 51.091362 13.697633 145301fdbcf5d5261ff46d63e0962586 2015-10-09 16:11:39 POINT (408794.6639901432 5660791.445413328)
2 1 51.042050 13.797690 df8244ec7295c142668fbcb530b6c2cc 2018-02-06 00:41:07 POINT (415711.9591703562 5655188.755580138)
3 1 51.102810 13.686440 98d6841fdbfef53a01b6247d70bae2bc 2018-03-11 18:48:22 POINT (408033.5643395127 5662078.377559571)
4 1 51.044881 13.735635 5832633cf2de965477d171bdca251eb5 2016-05-16 09:56:44 POINT (411367.1581114109 5655576.364487658)

Optional: Preview Single Target Geometry Graph Polygon Intersection

Lets test the inteserction for a small sample

  • 1 polygon from meiNGrün Target Geometries
  • 1 lbsm post

a) Get one polygon from meinGrün Target Geometries

  • modify the number to get different samples
In [47]:
# Get row for specific ID
# id = 9
# id = 20
# id = 26
id = 29
gdf_targetgeom_shapes.iloc[id].geometry
Out[47]:

It is possible to get bounds of the line segment by:

In [48]:
gdf_targetgeom_shapes.iloc[id].geometry.bounds
Out[48]:
(416826.3844851471, 5654901.461591735, 417052.89840409183, 5655199.953776324)

b) Prepare poly geoviews layer

In [53]:
gv_sample_poly = gv.Shape(
    gdf_targetgeom_shapes.iloc[id].geometry,
    label=f'meinGrün Target Polygon', crs=cartopy.crs.epsg(epsg_code))
gv_sample_poly_buf = gv.Shape(
    gdf_targetgeom_shapes.iloc[id].geometry.buffer(50, resolution=3, cap_style=1, join_style=1, mitre_limit=5.0)
    ,
    label=f'meinGrün Target Polygon', crs=cartopy.crs.epsg(epsg_code))

c) Optional: Test Prepare point geoviews layer (slow)

In [50]:
%%time
from shapely.geometry import Point, Polygon

def add_buffer(geom, buffer_num=50, return_poly=False):
    '''Add small buffer around poly bounds
    Buffer defaults to: 200m
    ''' 
    xmin, ymin, xmax, ymax = geom.bounds
    xmin = xmin - buffer_num
    ymin = ymin - buffer_num
    xmax = xmax + buffer_num
    ymax = ymax + buffer_num
    if return_poly:
        bbox_polygon = Polygon(
            [[xmin, ymax],
             [xmax, ymax],
             [xmax, ymin],
             [xmin, ymin]])
        return bbox_polygon
    return xmin, ymin, xmax, ymax

# adjust number for sample polygon
sample_polygon = gdf_targetgeom_shapes.iloc[id]
xmin, ymin, xmax, ymax = add_buffer(
    sample_polygon.geometry)

# intersect/clip points based on bbox input
# this is very slow! we'll use an improved version for mass intersection
sample_point_gdf = lbsm_data_gdf.cx[xmin:xmax, ymin:ymax]

print(f'Intersections found for id {id}: {len(sample_point_gdf)} posts.')

# for automatic selection of polygon with many intersections
# use code block below

# max_intersect = 0
# max_id_ref = None
# for id in range(1,30):
#     # adjust number for sample polygon
#     sample_polygon = gdf_targetgeom_shapes.iloc[id]
#     xmin, ymin, xmax, ymax = add_buffer(
#         sample_polygon.geometry)
#     
#     # intersect/clip points based on bbox input
#     sample_point_temp = lbsm_data_gdf.cx[xmin:xmax, ymin:ymax]
#     number_of_intersections = len(sample_point_temp)
#     if number_of_intersections > max_intersect:
#         max_id_ref = id
#         max_intersect = number_of_intersections
#         sample_point_gdf = sample_point_temp
#         
# # report
# print(f'Max Intersections found for id {max_id_ref}: {len(sample_point_gdf)} posts.')
Intersections found for id 29: 7 posts.
CPU times: user 12.7 s, sys: 62.5 ms, total: 12.7 s
Wall time: 12.7 s

d) Prepare Geoviews Layer for selected Sample Points

In [51]:
sample_point_selection = gv.Points(sample_point_gdf, crs=cartopy.crs.epsg(epsg_code))

e) Output sample selection map

In [57]:
import pyproj
def set_active_tool(plot, element):
    # enable wheel_zoom by default
    plot.state.toolbar.active_scroll = plot.state.tools[0]

# the Overlay will automatically extent to data dimensions,
# the polygons of HDBSCAN cover the city area:
# since we want the initial zoom to be zoomed in to the selected sample geometry
# we need to supply xlim and ylim
# The Bokeh renderer only support Web Mercator Projection
# we therefore need to project Großer Garten Bounds first
# from WGS1984 to Web Mercator
sample_bottomleft = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmin, ymin)
sample_topright = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmax, ymax)
# to show 200m radius
# centre_line = pyproj.transform(
#     pyproj.Proj(init=f'EPSG:{epsg_code}'),
#     pyproj.Proj(init='EPSG:3857'),
#     gdf_targetgeom_shapes.iloc[max_id_ref].geometry.centroid.x,
#     gdf_targetgeom_shapes.iloc[max_id_ref].geometry.centroid.y)

hv.Overlay(
    gv.tile_sources.EsriImagery * \
    # gv_sample_poly_buf.opts(line_width=0, fill_alpha=0.5, fill_color='grey') * \
    gv_sample_poly.opts(line_color='red', line_width=3, fill_alpha=0.5, tools=['hover']) * \
    # hv.Ellipse(centre_line[0], centre_line[1], 200, label='200m Buffer').opts(line_color='red', line_width=3, show_legend=True) * \
    sample_point_selection.opts(tools=['hover'], size=5, color='yellow') # * \
    # gv_testpoint
    ).opts(
        width=800,
        height=480,
        finalize_hooks=[set_active_tool],
        xlim=(sample_bottomleft[0], sample_topright[0]),
        ylim=(sample_bottomleft[1], sample_topright[1])
    )
Out[57]:

Optional: Intersect test using rtree

We'll create an rtree with boxes for each line in the OSM Graph. This will be the most expensive operation, afterwards we can use Shapely's nearest function to snap points to a small number of possible lines. First, construct RTree for the geometry from osm paths. We'll extend the boundary by a small buffer surrounding the path so that we can catch points at both ends.

  • adjust buffer_num to your needs, minimum buffer_num should be the maximum SEARCH_RADIUS beeing used
In [58]:
%%time
from rtree import index

# Populate R-tree index with bounds of polygons
idx = index.Index()
for pos, poly in enumerate(gdf_targetgeom_shapes.geometry):
    idx.insert(pos, add_buffer(poly, buffer_num=50))
CPU times: user 34.7 s, sys: 0 ns, total: 34.7 s
Wall time: 34.7 s

Intersect our testpoint with the rtree

Next, we intersect our testpoint with the rtree and select those paths that are near the testpoint. We can define a search_radius. This is very fast because the heavy lifting is done by the spatial rtree intersection (idx.intersection((testpoint.coords[0]))) while the expensive distance calculation from shapely (gdf.geometry[i].project(testpoint)) is only done for the few paths objects that are preselected based on their boundary.

  • Important: Modify SEARCH_RADIUS (m) to test output for various distances

Since Instagram and Twitter have lower Geoaccuracies on average, we use a large Search Radius of 50m, wheras for Flickr we'll use 25m.

In [59]:
SEARCH_RADIUS_ORIGIN = {
    "1":50,"2":25,"3":50
}
In [76]:
lbsm_data_gdf.head()
Out[76]:
origin_id latitude longitude user_guid post_create_date post_publish_date geometry
0 3 51.040557 13.731987 2de2c7b4c046fb9c2a7fcfeaa479356d NaN 2016-06-10 10:46:41 POINT (411103.140159386 5655099.9591291)
1 1 51.091362 13.697633 145301fdbcf5d5261ff46d63e0962586 NaN 2015-10-09 16:11:39 POINT (408794.6639901432 5660791.445413328)
2 1 51.042050 13.797690 df8244ec7295c142668fbcb530b6c2cc NaN 2018-02-06 00:41:07 POINT (415711.9591703562 5655188.755580138)
3 1 51.102810 13.686440 98d6841fdbfef53a01b6247d70bae2bc NaN 2018-03-11 18:48:22 POINT (408033.5643395127 5662078.377559571)
4 1 51.044881 13.735635 5832633cf2de965477d171bdca251eb5 NaN 2016-05-16 09:56:44 POINT (411367.1581114109 5655576.364487658)
In [60]:
%%time
# Query testpoint to see which polygon it is in
# using first Rtree index, then Shapely geometry's within
# SEARCH_RADIUS =  25
# to look for other points here, override id:
# id = 20
lbsm_post = lbsm_data_gdf.iloc[id]
SEARCH_RADIUS = SEARCH_RADIUS_ORIGIN.get(lbsm_post.origin_id)

testpoint = lbsm_post.geometry
gv_testpoint = gv.Points([testpoint], crs=cartopy.crs.epsg(epsg_code), label='Testpoint').opts(tools=['hover'], size=5, color='yellow')
# calculate boundary to show
xmin, ymin, xmax, ymax = add_buffer(lbsm_post.geometry)
sample_bottomleft = pyproj.transform(pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmin, ymin)
sample_topright = pyproj.transform(pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmax, ymax)

# with rtree only
poly_idx_rtree = [i for i in idx.intersection((testpoint.coords[0]))]

# with search radius
poly_idx = [i for i in idx.intersection((testpoint.coords[0]))
            if gdf_targetgeom_shapes.geometry[i].distance(testpoint) < SEARCH_RADIUS]
            #if gdf.geometry[i].interpolate(gdf.geometry[i].project(testpoint)).distance(testpoint) > search_radius]

# without search radius, extact intersection
poly_idx_exact = [i for i in idx.intersection((testpoint.coords[0]))
                   if gdf_targetgeom_shapes.geometry[i].intersects(testpoint)]

# report
print(f'{len(poly_idx_rtree)} preselected polygons through rtree')
print(f'{len(poly_idx)} potentially relevant polygons found (within search radius {SEARCH_RADIUS}).')
print(f'{len(poly_idx_exact)} exact intersection polygons found.')
29 preselected polygons through rtree
13 potentially relevant polygons found (within search radius 50).
1 exact intersection polygons found.
CPU times: user 62.5 ms, sys: 62.5 ms, total: 125 ms
Wall time: 243 ms
In [61]:
print(poly_idx)
[264362, 175021, 260431, 227910, 81690, 68946, 75593, 222158, 57426, 139542, 75968, 84505, 31346]
  • Select subset of relevant polygons based on id's
In [62]:
gdf_targetgeom_shapes.iloc[poly_idx].head()
Out[62]:
OBJECTID PARENT_TARGET_green_target LULC_ID geometry
264362 207751 False dd_EBK-TSP-OSM_27773_1_4 (POLYGON ((410493.7919960411 5655589.181288132...
175021 222906 False dd_EBK-TSP-OSM_20278_1_4 (POLYGON ((410417.471424504 5655661.089929336,...
260431 200887 False dd_EBK-TSP-OSM_27773_2_2 (POLYGON ((410507.1696264762 5655643.86359268,...
227910 257301 False dd_EBK-TSP-OSM_27773_1_7 (POLYGON ((410493.7919960411 5655589.181288132...
81690 60140 False dd_EBK-TSP-OSM_27773_4_1 (POLYGON ((410424.0985141604 5655687.181977004...
  • prepare geoviews layer for preview
In [63]:
testpoly_sel = gv.Polygons(
    [gv.Shape(feat) for feat in gdf_targetgeom_shapes.iloc[poly_idx].geometry],
    crs=cartopy.crs.epsg(epsg_code),
    label=f'Found {len(poly_idx)} Target Polygons (search radius {SEARCH_RADIUS}m)')
testpoly_sel_exact = gv.Polygons(
    [gv.Shape(feat) for feat in gdf_targetgeom_shapes.iloc[poly_idx_exact].geometry],
    crs=cartopy.crs.epsg(epsg_code),
    label=f'Found {len(poly_idx_exact)} Target Polygons (exact intersection)')
  • output map (careful: might be slow)
In [64]:
hv.Overlay(
    gv.tile_sources.EsriImagery * \
    # gv_shapes.opts(line_color='white', line_width=1) * \
    testpoly_sel.opts(line_color='yellow', line_width=3, fill_alpha=0, show_legend=True) * \
    testpoly_sel_exact.opts(line_color='red', line_width=3, fill_alpha=0, show_legend=True, line_dash='dotted') * \
    gv_testpoint.opts(tools=['hover'], size=10, color='red', show_legend=True)
    #hv.Ellipse(gv_testpoint.x, gv_testpoint.y, 35, label='200m Buffer').opts(line_color='yellow', line_width=1, show_legend=True)
    ).opts(
        width=800,
        height=480,
        xlim=(sample_bottomleft[0], sample_topright[0]),
        ylim=(sample_bottomleft[1], sample_topright[1]),
        title_format=f"Selected Target Geometries for Testpoint and {SEARCH_RADIUS}m Search Radius",
    )
Out[64]:

Intersect all points with grid

We execute this intersection for each point in our LBSM dataset. We update all counts in close polygons from the meiNGrün Target Geometries

1. Add statistics field to input geometries

In [65]:
gdf_targetgeom_shapes.head()
Out[65]:
OBJECTID PARENT_TARGET_green_target LULC_ID geometry
0 1 False dd_OSM_57149_5_22 (POLYGON ((393349.4458630043 5652514.374407471...
1 283517 False dd_EBK-TSP-OSM_27937_2_3 (POLYGON ((401577.7054682793 5656972.97522185,...
2 116638 False dd_OSM_54537_2_1 (POLYGON ((407212.8660183108 5650545.754251033...
3 73544 False dd_EBK-TSP-OSM_30836_1_19 (POLYGON ((420314.3206534541 5656666.771566805...
4 192364 False dd_EBK-TSP-OSM_31363_18_1 (POLYGON ((413028.7688842474 5652131.050560708...

Initialize count structures

They're several things we want to measure per meinGrün target shape:

  • User_Posts: Number of social media posts in total
  • User_Days: Counting each user once per day (post timestamp)
  • User_Post_Locations: Counting each user-location coordinate only once (post latlng)
  • InstagramCount: Number of Instagram posts in total
  • FlickrCount: Number of Flickr posts in total
  • TwitterCount: Number of Twitter posts in total
In [73]:
# init count structures
gdf_targetgeom_shapes['User_Posts'] = 0
gdf_targetgeom_shapes['User_Days'] = [set() for _ in range(len(gdf_targetgeom_shapes))]
gdf_targetgeom_shapes['User_Post_Locations'] = [set() for _ in range(len(gdf_targetgeom_shapes))]
gdf_targetgeom_shapes['InstagramCount'] = 0
gdf_targetgeom_shapes['FlickrCount'] = 0
gdf_targetgeom_shapes['TwitterCount'] = 0
gdf = gdf_targetgeom_shapes
In [74]:
gdf_targetgeom_shapes.head()
Out[74]:
OBJECTID PARENT_TARGET_green_target LULC_ID geometry User_Posts User_Days User_Post_Locations InstagramCount FlickrCount TwitterCount
0 1 False dd_OSM_57149_5_22 (POLYGON ((393349.4458630043 5652514.374407471... 0 {} {} 0 0 0
1 283517 False dd_EBK-TSP-OSM_27937_2_3 (POLYGON ((401577.7054682793 5656972.97522185,... 0 {} {} 0 0 0
2 116638 False dd_OSM_54537_2_1 (POLYGON ((407212.8660183108 5650545.754251033... 0 {} {} 0 0 0
3 73544 False dd_EBK-TSP-OSM_30836_1_19 (POLYGON ((420314.3206534541 5656666.771566805... 0 {} {} 0 0 0
4 192364 False dd_EBK-TSP-OSM_31363_18_1 (POLYGON ((413028.7688842474 5652131.050560708... 0 {} {} 0 0 0
In [80]:
%%time
from IPython.display import clear_output

def get_userday(post):
    """Remove time from date-time string, e.g. 2019-01-01"""
    if isinstance(post.post_date, str) and len(post.post_date) >= 10:
        return f'{post.user_guid}:{post.post_date[:10]}'
    return f'{post.user_guid}:1900-01-01'
def get_location(post_geom):
    """Concat lat:lng to string"""
    return f'{post_geom.y}:{post_geom.x}'
def concat_userlocation(user_guid, location_str):
    """Concat user:lat:lng to string"""
    return f'{user_guid}:{location_str}'

    
x = 0
x_i = 0
# get ids from column names
column_id_up = gdf.columns.get_loc("User_Posts")
column_id_ud = gdf.columns.get_loc("User_Days")
column_id_upl = gdf.columns.get_loc("User_Post_Locations")
col_id_instagram = gdf.columns.get_loc('InstagramCount')
col_id_flickr = gdf.columns.get_loc('FlickrCount')
col_id_twitter = gdf.columns.get_loc('TwitterCount')
origin_col_dict = {
    "1":col_id_instagram, "2":col_id_flickr, "3":col_id_twitter}
# initialize dict lookup for location-poly-idx
latlng_dict_polyidx = dict()

total_records = len(lbsm_data_gdf)
for lbsm_record in lbsm_data_gdf.itertuples():
    # print(lbsm_record.geometry)
    # break
    # update status
    x += 1
    msg_text = (
        f'Processed records: {x} ({x/(total_records/100):.2f}%). '
        f'Intersection found: {x_i}')
    if x % 100 == 0:
        clear_output(wait=True)
        print(msg_text)
        
    # get search radius based on currrent posts's origin
    SEARCH_RADIUS = SEARCH_RADIUS_ORIGIN.get(
        lbsm_record.origin_id)
    # we save some processing speed by remembering 
    # repeated locations and sacrificing some ram
    # e.g. spatial intersection is more expensive than dict-lookup
    location_str = get_location(lbsm_record.geometry)
    if location_str in latlng_dict_polyidx:
        poly_idx = latlng_dict_polyidx[location_str]
    else:
        poly_idx = [i for i in idx.intersection((lbsm_record.geometry.coords[0]))
                if gdf.geometry[i].distance(lbsm_record.geometry) < SEARCH_RADIUS]
        # add to dict for improving speed for repetitive locations
        latlng_dict_polyidx[location_str] = poly_idx
    # count intersection
    if len(poly_idx) > 0:
        x_i += 1
        userday = get_userday(lbsm_record)
        userlocation = concat_userlocation(lbsm_record.user_guid, location_str)
        # update poly count column
        # the iloc indexer for Pandas Dataframe is used for integer-
        # location based indexing / selection by position.
        # this may update/increment multiple rows/records
        origin_col = origin_col_dict.get(lbsm_record.origin_id)
        gdf.iloc[poly_idx, origin_col] += 1
        gdf.iloc[poly_idx, column_id_up] += 1
        # the following 2 operations are quite expensive on memory/cpu
        # disable if not needed
        # gdf.iloc[poly_idx, column_id_ud].apply(lambda x: x.add(userday))
        # gdf.iloc[poly_idx, column_id_upl].apply(lambda x: x.add(userlocation))
    
# final status
clear_output(wait=True)
print(msg_text)
Processed records: 991117 (100.00%). Intersection found: 892445
CPU times: user 1h 3min 14s, sys: 42.2 s, total: 1h 3min 56s
Wall time: 1h 4min 51s

Optional: store or load intermediate aggregated dataframe results

Store:

In [85]:
gdf.to_pickle(f"gdf_{CITY_NAME}_{LEVEL}_{TARGETSHAPE_VERSION}.pickle")

Load:

In [35]:
gdf = pd.read_pickle(f"gdf_{CITY_NAME}_{LEVEL}_{TARGETSHAPE_VERSION}.pickle")

Preview results table

In [86]:
gdf[[TARGET_SHAPE_ID_COL, 'User_Posts', 'User_Days', 'User_Post_Locations', 'InstagramCount', 'FlickrCount', 'TwitterCount']].head()
Out[86]:
LULC_ID User_Posts User_Days User_Post_Locations InstagramCount FlickrCount TwitterCount
0 dd_OSM_57149_5_22 0 {} {} 0 0 0
1 dd_EBK-TSP-OSM_27937_2_3 0 {} {} 0 0 0
2 dd_OSM_54537_2_1 0 {} {} 0 0 0
3 dd_EBK-TSP-OSM_30836_1_19 0 {} {} 0 0 0
4 dd_EBK-TSP-OSM_31363_18_1 0 {} {} 0 0 0

Convert Sets to unique counts

In [87]:
gdf['UD'] = gdf.apply(lambda row: len(row.User_Days), axis=1)
gdf['UPL'] = gdf.apply(lambda row: len(row.User_Post_Locations), axis=1)
In [88]:
gdf[[TARGET_SHAPE_ID_COL, 'User_Posts', 'User_Days', 'User_Post_Locations', 'UD', 'UPL', 'InstagramCount', 'FlickrCount', 'TwitterCount']].head()
Out[88]:
LULC_ID User_Posts User_Days User_Post_Locations UD UPL InstagramCount FlickrCount TwitterCount
0 dd_OSM_57149_5_22 0 {} {} 0 0 0 0 0
1 dd_EBK-TSP-OSM_27937_2_3 0 {} {} 0 0 0 0 0
2 dd_OSM_54537_2_1 0 {} {} 0 0 0 0 0
3 dd_EBK-TSP-OSM_30836_1_19 0 {} {} 0 0 0 0 0
4 dd_EBK-TSP-OSM_31363_18_1 0 {} {} 0 0 0 0 0

Remove set column, since they're not needed anymore:

In [89]:
gdf.drop(columns=['User_Post_Locations', 'User_Days'], inplace=True)
gdf.rename(columns={'User_Posts': 'UP'}, inplace=True)
gdf.rename(columns={'InstagramCount': 'UP_Instagram'}, inplace=True)
gdf.rename(columns={'FlickrCount': 'UP_Flickr'}, inplace=True)
gdf.rename(columns={'TwitterCount': 'UP_Twitter'}, inplace=True)

Out final Geodataframe with aggregated measurements per green target shape:

In [92]:
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].head()
Out[92]:
LULC_ID UP UD UPL UP_Instagram UP_Flickr UP_Twitter
0 dd_OSM_57149_5_22 0 0 0 0 0 0
1 dd_EBK-TSP-OSM_27937_2_3 0 0 0 0 0 0
2 dd_OSM_54537_2_1 0 0 0 0 0 0
3 dd_EBK-TSP-OSM_30836_1_19 0 0 0 0 0 0
4 dd_EBK-TSP-OSM_31363_18_1 0 0 0 0 0 0
# remove duplicate columns gdf = gdf.loc[:,~gdf.columns.duplicated()]

Get Histogram

The histogram shows the typical long tail pattern: most shapes have few posts, but a few shapes have a lot of posts.

In [93]:
np.sqrt(np.sqrt(gdf['UP'])).hist(bins=20, alpha=0.5)
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe922ea30b8>

Normalize for visualization

Absolute counts per shape are not very meaningful, since large shapes have a higher likeliness of intersecting with many posts.

  • We can normalize using the area (square kilometres) of each target shape)
  • Furthermore, we can "flatten" the long tail a bit by applying a square root function
  • these normalized numbers are only relevant based on their relative value, therefore we can stretch results to 1 to 1000 range for comparability across different runs

Calculate buffered Area for search distances

In [94]:
# Get row for specific ID
id = 0
gdf.iloc[id].geometry
Out[94]:
In [95]:
gdf.iloc[id].geometry.buffer(50, resolution=3, cap_style=1, join_style=1, mitre_limit=5.0)
Out[95]:
In [96]:
print(f'{gdf.iloc[id].geometry.area:.2f} m²')
print(f'50m Search radius: {gdf.iloc[id].geometry.buffer(50, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area:.2f} m²')
print(f'25m Search radius: {gdf.iloc[id].geometry.buffer(25, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area:.2f} m²')
1262.12 m²
50m Search radius: 22494.49 m²
25m Search radius: 10003.08 m²
In [97]:
def m2_to_km2(m_num):
    """Convert square meters to square kilometres"""
    return m_num/1000000

print(f'{m2_to_km2(gdf.iloc[id].geometry.area):.4f} km²')
print(f'50m Search radius: {m2_to_km2(gdf.iloc[id].geometry.buffer(50, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area):.4f} km²')
print(f'25m Search radius: {m2_to_km2(gdf.iloc[id].geometry.buffer(25, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area):.4f} km²')
0.0013 km²
50m Search radius: 0.0225 km²
25m Search radius: 0.0100 km²

Calculate km²

  • for all geometries in the dataset
  • for all 25m buffered geometries in dataset
  • for all 50m buffered geometries in dataset
In [98]:
def shape_buf(geom, buf_dist_m):
    """Buffer geometry based on m distance"""
    buf_geom = geom.buffer(buf_dist_m, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0)
    return buf_geom

gdf['area_skm'] = gdf['geometry'].apply(lambda x: m2_to_km2(x.area))
# same result:
# gdf['poly_area'] = gdf.geometry.area / 10**6
gdf['area_skm_SR25'] = gdf['geometry'].apply(lambda x: m2_to_km2(shape_buf(x, 25).area))
gdf['area_skm_SR50'] = gdf['geometry'].apply(lambda x: m2_to_km2(shape_buf(x, 50).area))
In [100]:
# gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter', 'area_skm', 'area_skm_SR25', 'area_skm_SR50']].head()
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter', 'area_skm', 'area_skm_SR25', 'area_skm_SR50']].head()
Out[100]:
LULC_ID UP UP_Instagram UP_Flickr UP_Twitter area_skm area_skm_SR25 area_skm_SR50
0 dd_OSM_57149_5_22 0 0 0 0 0.001262 0.010003 0.022494
1 dd_EBK-TSP-OSM_27937_2_3 0 0 0 0 0.000088 0.003105 0.010027
2 dd_OSM_54537_2_1 0 0 0 0 0.000030 0.002544 0.008978
3 dd_EBK-TSP-OSM_30836_1_19 0 0 0 0 0.000092 0.003036 0.009900
4 dd_EBK-TSP-OSM_31363_18_1 0 0 0 0 0.000480 0.007524 0.018478

Normalize

In [101]:
import math
def stretch_norm_dfcolumn(df, column, min_stretch, max_stretch):
    """Interpolates a Geodataframe column's values to a new range"""
    min_val = df[column].min()
    max_val = df[column].max()
    # replace values in-place
    # the lambda function is applied to all values from _column_
    # df.loc[:, column] means: replace values of _column_ in-place (modifying the original dataframe)
    # .loc[row_indexer,col_indexer]
    df.loc[:, column] = df[column].apply(
        lambda x: np.interp(x, (min_val, max_val), (min_stretch, max_stretch)))
    # similar but not recommended:
    # df[column] = np.interp(df[column], (min_val, max_val), (min_stretch, max_stretch))

# it is possible to 'flatten' the unequal distribution of values
# use one time sqrt for a bit of flatten, use two times sqrt for more flatten
gdf['UP_Norm'] = np.sqrt(gdf['UP']/gdf['area_skm'])
# gdf['UP_Norm'] = np.sqrt(np.sqrt(gdf['UP']/gdf['poly_area']))

# interpolate to 1 to 1000 range
stretch_norm_dfcolumn(gdf, 'UP_Norm', 1, 1000)

# print
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].sort_values('UP_Norm', ascending=False).head()
# gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].sort_values('UP_Norm', ascending=False).head()
Out[101]:
LULC_ID UP UP_Norm UP_Instagram UP_Flickr UP_Twitter
132445 dd_EBK-TSP-OSM_26382_9_2 38453 1000.000000 38147 306 0
285693 dd_EBK-TSP-OSM_22142_1_4 34606 932.013209 34476 130 0
160705 dd_EBK-TSP-OSM_34121_20_5 13197 740.291677 12779 413 5
33544 dd_EBK-TSP-OSM_34121_32_3 12599 738.057831 12318 269 12
49720 dd_EBK-TSP-OSM_34121_20_3 13110 722.994056 12779 326 5

Normalize Origin Influence

It becomes obvious that Instagram dominates results. We want to generate an even result between the 3 data sources.

Our next goal therefore is to allow only

  • 40% Influence by Instagram (reduced because Instagram has low geoaccuracy)
  • 50% Influence by Flickr (increased because Flickr featurss highest geoaccuracy)
  • 10% Influence by Twitter (reduced because twitter features sampling noise)

We can do this by using the information from UP_Instagram, UP_Flickr, and UP_Twitter columns.

Since we've used a Search_Radius of 50m for Instagram and Twitter and a Search_Radius of 25m for Flickr,
we need to normalize count values separately based on origin and buffered area.

Remember the original percentage of the three datasources:

In [102]:
origin_perc_total = dict()
for key, value in size_count.items():
    total_perc = value/(total_count/100)
    print(f'Origin ID {key} with {value} posts, total percentage is: {total_perc:.2f}%')
    origin_perc_total[key] = total_perc
Origin ID 1 with 806181 posts, total percentage is: 81.34%
Origin ID 2 with 183249 posts, total percentage is: 18.49%
Origin ID 3 with 1687 posts, total percentage is: 0.17%

Thus, we need to perform a two step normalization:

  • first, normalize based on total percentage of each datasource
  • second, apply manuall influence flags (40%, 50%, 10%)
In [103]:
def norm_ip_origin(row):
    total_up = row['UP']
    # get one third of the total count of row
    if total_up == 0:
        return 0
    # get total counts per origin
    # normalize per search radius
    # flatten curve using square
    instagram_up = np.sqrt(row['UP_Instagram']/row['area_skm_SR50'])
    flickr_up = np.sqrt(row['UP_Flickr']/row['area_skm_SR25'])
    twitter_up = np.sqrt(row['UP_Twitter']/row['area_skm_SR50'])
    total_up = instagram_up + flickr_up + twitter_up
    # calc perc per origin percentage
    # and apply manual weights
    instagram_up_norm = instagram_up / (origin_perc_total.get("1") / 100) * 0.4
    flickr_up_norm = flickr_up  / (origin_perc_total.get("2") / 100) * 0.5
    twitter_up_norm = twitter_up / (origin_perc_total.get("3") / 100) * 0.1
    sum_norm = instagram_up_norm + flickr_up_norm + twitter_up_norm
    return sum_norm
    
gdf['UP_Norm_Origin'] = gdf.apply(lambda row: norm_ip_origin(row), axis=1)
# gdf['UP_Norm_Origin'] = np.sqrt(gdf['UP_Norm_Origin']/gdf['poly_area'])
stretch_norm_dfcolumn(gdf, 'UP_Norm_Origin', 1, 1000)

Compare difference of UP_Norm and UP_Norm_Origin

In [104]:
# gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UP_Norm_Origin', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].sort_values('UP', ascending=False).head(20)
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UP_Norm_Origin', 'UP_Instagram', 
     'UP_Flickr', 'UP_Twitter', 'area_skm', 'area_skm_SR25', 'area_skm_SR50']].sort_values('UP', ascending=False).head(20)
Out[104]:
LULC_ID UP UP_Norm UP_Norm_Origin UP_Instagram UP_Flickr UP_Twitter area_skm area_skm_SR25 area_skm_SR50
248618 dd_EBK-TSP-OSM_26382_2_1 57004 53.461750 369.938689 55207 1788 9 0.002721 0.013830 0.028080
182090 dd_EBK-TSP-OSM_26382_9_1 55367 84.977281 480.061435 52127 3235 5 0.001031 0.008031 0.018173
235404 dd_EBK-TSP-OSM_26382_9_4 53160 125.470855 442.073712 51912 1243 5 0.000451 0.005870 0.014814
112 dd_EBK-TSP-OSM_26382_2_13 52506 115.769344 411.428651 51646 855 5 0.000524 0.005867 0.014979
72017 dd_EBK-TSP-OSM_26382_2_7 52147 202.599161 427.210134 51646 496 5 0.000169 0.004202 0.012148
105777 dd_EBK-TSP-OSM_26382_7_4 52075 82.163515 498.159155 48868 3200 7 0.001039 0.008076 0.018265
119927 dd_EBK-TSP-OSM_30690_1_2 51810 150.307755 367.894779 51095 714 1 0.000305 0.004075 0.011761
156847 dd_EBK-TSP-OSM_26382_1_17 49804 113.144364 438.841253 48865 932 7 0.000520 0.005953 0.015088
2329 dd_EBK-TSP-OSM_26382_8_2 45420 77.333068 341.792179 41986 3434 0 0.001024 0.008096 0.018273
33693 dd_EBK-TSP-OSM_32865_1_1 45287 41.873304 818.249290 39994 5210 83 0.003561 0.011325 0.023009
135025 dd_EBK-TSP-OSM_20688_1_1 41149 53.517037 775.020907 38049 3022 78 0.001960 0.009784 0.021526
166968 dd_EBK-TSP-OSM_26382_2_12 40919 69.227221 365.229054 39538 1377 4 0.001155 0.008230 0.018734
244689 dd_EBK-TSP-OSM_23723_1_12 40605 105.151400 803.560374 38400 2138 67 0.000492 0.007030 0.017120
238384 dd_EBK-TSP-OSM_26382_3_3 40524 69.195812 306.329142 39171 1352 1 0.001145 0.008151 0.018600
102554 dd_EBK-TSP-OSM_23723_1_14 40450 40.883099 722.758891 37875 2497 78 0.003341 0.011319 0.023157
205135 dd_EBK-TSP-OSM_26382_8_1 40117 105.993831 290.257789 38674 1443 0 0.000478 0.005924 0.014877
264364 dd_EBK-TSP-OSM_26382_2_5 39607 205.066103 299.205147 38929 678 0 0.000125 0.003245 0.010274
139719 dd_EBK-TSP-OSM_26382_9_3 38538 553.253864 287.336545 38147 391 0 0.000017 0.002444 0.008791
132445 dd_EBK-TSP-OSM_26382_9_2 38453 1000.000000 280.882430 38147 306 0 0.000005 0.002209 0.008333
15067 dd_EBK-TSP-OSM_20261_1_1 38239 42.428103 447.089538 36116 2106 17 0.002927 0.011816 0.024618
  • rows with more equal distribution of posts across Flickr, Twitter and Instagram are increased in ranking/weights
  • rows with unequal distribution of posts towards only one data source (e.g. Instagram) are reduced in their importance
In [105]:
gdf['UP_Norm_Origin'].hist(bins=20, alpha=0.5)
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe91f1ed3c8>
In [106]:
# get top poly
top_norm = gdf.sort_values('UP_Norm_Origin', ascending=False).iloc[0]
top_norm_count = gdf.sort_values('UP', ascending=False).iloc[0]

# normalized
top_norm_poly = top_norm.geometry
top_norm_value = top_norm.UP_Norm_Origin

# raw
top_raw_poly_count = top_norm_count.geometry
top_raw_value_count = top_norm_count.UP

# construct geoviews layer
gv_sample_poly_top_norm = gv.Shape(
    top_norm_poly,
    label=f'meinGrün Top Target Polygon ({top_norm_value} Normalized Post Count based on Area and Service Origin)',
    crs=cartopy.crs.epsg(epsg_code))

# construct geoviews layer
gv_sample_poly_top_count = gv.Shape(
    top_raw_poly_count,
    label=f'meinGrün Top Target Polygon ({top_raw_value_count} Total Post Count)',
    crs=cartopy.crs.epsg(epsg_code))

# calculate boundary to show
xmin, ymin, xmax, ymax = add_buffer(top_norm_poly)
sample_bottomleft = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmin, ymin)
sample_topright = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmax, ymax)
In [107]:
hv.Overlay(
    gv.tile_sources.EsriImagery * \
    gv_sample_poly_top_count.opts(line_color='yellow', line_width=3, fill_alpha=0.2, show_legend=True) * \
    gv_sample_poly_top_norm.opts(line_color='red', line_width=3, fill_alpha=0.2, show_legend=True)
    ).opts(
        width=800,
        height=480,
        xlim=(sample_bottomleft[0], sample_topright[0]),
        ylim=(sample_bottomleft[1], sample_topright[1]),
        title_format=f"Selected Target Geometries for Testpoint and {SEARCH_RADIUS}m/50m Search Radius",
    )
Out[107]:

Visualize Results

Now we can easily pass the GeoDataFrame to a Polygon object and declare the Measure as the first value dimension which it will color by:

Plot Sample Areas

In [108]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import Patch

gdf_meingruen_official_targets = gdf.query(f'{GREEN_TARGET_COL} == 1')

# make a legend using a dictionary to assign colors
cmap = matplotlib.cm.get_cmap('Greens')
popularity_legend = {
    'very high': cmap(1000/1000), 
    'high': cmap(450/1000),
    'average': cmap(165/1000),
    'low': cmap(56/1000), 
    'very low': cmap(14/1000), 
    'no data': cmap(1)}

def add_patches(legend):
    ax = legend.axes
    handles, labels = ax.get_legend_handles_labels()
    labels.append("LBSM Popularity")
    handles.append(Patch(facecolor="none", fill=False, edgecolor='none', linewidth=0))
    for key in popularity_legend:
            data_key = Patch(edgecolor='none', facecolor=popularity_legend[key])
            handles.append(data_key)
            labels.append(key)
    labels.append("Official green target")
    handles.append(Patch(facecolor="none", fill=False, edgecolor='red', linewidth=0.7))

    legend._legend_box = None
    legend._init_legend_box(handles, labels)
    legend._set_loc(legend._loc)
    legend.set_title(legend.get_title().get_text())
    
def get_map(lim_lng_min, lim_lat_min, lim_lng_max, lim_lat_max, title, gdf, column, show_green_targets):
    """Plot area in matplotlib (pyplot).
    Note: Natural Breaks algorithm is used here to classify colors, remove to see raw values"""
    bb_bottomleft = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_min, lim_lat_min)
    bb_topright = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_max, lim_lat_max)
    fig, ax = plt.subplots()
    fig.set_size_inches(18.5, 10.5, forward=True)
    ax.set_title(f'LBSM Target Geometry Weights for {title}', fontsize=20)
    ax.set_xlim([bb_bottomleft[0],bb_topright[0]])
    ax.set_ylim([bb_bottomleft[1],bb_topright[1]])
    
    # plt.legend(handles=patch_list)
    plt.savefig('legend.png', bbox_inches='tight')
    base = gdf.plot(column=column, figsize=(20,10), cmap='Greens', ax=ax, scheme="natural_breaks", label='Normalized User Post Locations (UPL) based on Area', legend = True)
    if show_green_targets:
        lgd = ax.legend(loc='upper right')
        add_patches(lgd)
        gdf_meingruen_official_targets.plot(ax=base, facecolor="none", edgecolor='red', lw=0.7, label='Official green target', legend = True)
        # plot the data

def get_map_b(lim_lng_min, lim_lat_min, lim_lng_max, lim_lat_max, title, gdf, column):
    """Plot area in matplotlib (pyplot).
    Note: Natural Breaks algorithm is used here to classify colors, remove to see raw values"""
    bb_bottomleft = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_min, lim_lat_min)
    bb_topright = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_max, lim_lat_max)
    axes = plt.gca()
    fig = plt.gcf()
    fig.set_size_inches(18.5, 10.5, forward=True)
    # fig.suptitle('LBSM Target Geometry Weights for Dresden Altstadt', fontsize=20)
    axes.set_title(f'LBSM Target Geometry Weights for {title}', fontsize=20)
    axes.set_xlim([bb_bottomleft[0],bb_topright[0]])
    axes.set_ylim([bb_bottomleft[1],bb_topright[1]])
    base = gdf.plot(column=column, figsize=(20,10), cmap='Greens', ax=axes, scheme="natural_breaks", label='Normalized User Post Locations (UPL) based on Area', legend = True)
    if show_green_targets:
        gdf_meingruen_official_targets.plot(ax=base, facecolor="none", edgecolor='red', lw=0.7, label='Official green target', legend = True)
    
    #gdf.to_crs({'init': 'epsg:3395'}).plot(column='UP_Norm', figsize=(20,10), cmap='Reds', ax=axes)
In [109]:
get_map(LIM_LNG_MIN, LIM_LAT_MIN, LIM_LNG_MAX, LIM_LAT_MAX, f'{CITY_NAME} {LEVEL} {TARGETSHAPE_VERSION}', gdf, 'UP_Norm_Origin', show_green_targets=False)
get_map(LIM_LNG_MIN_4, LIM_LAT_MIN_4, LIM_LNG_MAX_4, LIM_LAT_MAX_4, f"Heidelberg Center {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=True)
In [110]:
get_map(LIM_LNG_MIN_1, LIM_LAT_MIN_1, LIM_LNG_MAX_1, LIM_LAT_MAX_1, f"Dresden Altstadt {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=False)
In [111]:
get_map(LIM_LNG_MIN_2, LIM_LAT_MIN_2, LIM_LNG_MAX_2, LIM_LAT_MAX_2, f"Dresden Center {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=False)
In [112]:
get_map(LIM_LNG_MIN_3, LIM_LAT_MIN_3, LIM_LNG_MAX_3, LIM_LAT_MAX_3, f"Dresden Neustadt {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=False)

To save Figures above as SVG, run the following cell after each figure generation

In [113]:
plt.savefig(f"{CITY_NAME.lower()}_{LEVEL}_{TARGETSHAPE_VERSION}_A1.svg", format="svg")
<Figure size 432x288 with 0 Axes>

Optional: Plot Green Areas

gdf_filtered = gdf.query(f'{GREEN_TARGET_COL} == 1').copy() # stretch_norm_dfcolumn(gdf_filtered, 'UP_Norm', 1, 1000) # get_map(LIM_LNG_MIN_5, LIM_LAT_MIN_5, LIM_LNG_MAX_5, LIM_LAT_MAX_5, "Dresden", gdf_filtered, 'UP_Norm') get_map_b(LIM_LNG_MIN_2, LIM_LAT_MIN_2, LIM_LNG_MAX_2, LIM_LAT_MAX_2, "Dresden Center", gdf_filtered, 'UP_Norm')

Natural Breaks Classification

Using pysal, it is possible to update our dataframe using Natural Breaks Classification

  • see for other available classifiers
In [114]:
import pysal.viz.mapclassify as mc
# Define the number of classes
n_classes = 6
# Create a Natural Breaks classifier
classifier = mc.NaturalBreaks.make(k=n_classes)
# classifier = mc.HeadTailBreaks.make()
# Classify the data
classifications = gdf[['UP_Norm']].apply(classifier)
gdf['UP_Norm_nb'] = classifications
classifications = gdf[['UP_Norm_Origin']].apply(classifier)
gdf['UP_Norm_Origin_nb'] = classifications

We'll also add bin labels for describing classes:

In [115]:
classes_ref_5 = {
    5 : 'very high', 
    4 : 'high',
    3 : 'average',
    2 : 'low', 
    1 : 'very low',
    0 : 'no data'}

gdf['popularity'] = gdf['UP_Norm_Origin_nb'].map(classes_ref_5)
# make sure that the bin "no data" is only assigned for records without UPL
gdf.loc[(gdf.UP_Norm_Origin > 1) & (gdf.UP_Norm_Origin_nb == 0), 'popularity'] = 'very low'
gdf.loc[(gdf.UP_Norm > 1) & (gdf.UP_Norm_nb == 0), 'UP_Norm_nb'] = 1
gdf.loc[(gdf.UP_Norm_Origin > 1) & (gdf.UP_Norm_Origin_nb == 0), 'UP_Norm_Origin_nb'] = 1

Have a look at the values and where we came from (left to right):

In [118]:
print(gdf[['UP', 'UP_Norm', 'UP_Norm_nb', 'UP_Norm_Origin', 'UP_Norm_Origin_nb', 'popularity']].head(20))
print(gdf[['UP', 'UP_Norm', 'UP_Norm_nb', 'UP_Norm_Origin', 'UP_Norm_Origin_nb', 'popularity']].sort_values('UP_Norm', ascending=False).head(20))
     UP    UP_Norm  UP_Norm_nb  UP_Norm_Origin  UP_Norm_Origin_nb popularity
0     0   1.000000           0        1.000000                  0    no data
1     0   1.000000           0        1.000000                  0    no data
2     0   1.000000           0        1.000000                  0    no data
3     0   1.000000           0        1.000000                  0    no data
4     0   1.000000           0        1.000000                  0    no data
5     0   1.000000           0        1.000000                  0    no data
6     0   1.000000           0        1.000000                  0    no data
7     0   1.000000           0        1.000000                  0    no data
8     0   1.000000           0        1.000000                  0    no data
9     0   1.000000           0        1.000000                  0    no data
10    0   1.000000           0        1.000000                  0    no data
11  314   6.663382           1       30.494954                  1   very low
12    0   1.000000           0        1.000000                  0    no data
13  611  34.448285           2      208.307908                  3    average
14    0   1.000000           0        1.000000                  0    no data
15    0   1.000000           0        1.000000                  0    no data
16    8   4.492373           1        2.895190                  1   very low
17    1   2.587174           1        1.657375                  1   very low
18    0   1.000000           0        1.000000                  0    no data
19    3   3.399718           1        8.652027                  1   very low
           UP      UP_Norm  UP_Norm_nb  UP_Norm_Origin  UP_Norm_Origin_nb  \
132445  38453  1000.000000           5      280.882430                  4   
285693  34606   932.013209           5      223.947961                  3   
160705  13197   740.291677           5      440.145088                  4   
33544   12599   738.057831           5      516.452076                  5   
49720   13110   722.994056           5      422.241568                  4   
179427  12683   712.914202           5      498.342430                  5   
77199   13147   690.194638           5      428.560687                  4   
287744  11050   687.528537           5      412.914615                  4   
167696  12464   682.206098           5      489.713721                  5   
138028  10947   681.225994           5      412.120998                  4   
243684  12432   662.464682           5      411.260696                  4   
68385   12419   650.699024           5      471.793570                  5   
17454   12585   646.904871           5      508.041114                  5   
266015  12854   624.653644           5      447.129402                  4   
275376  12852   624.533082           5      447.117643                  4   
122762  12776   620.041834           5      440.655294                  4   
119574  14248   609.883898           5      161.575280                  3   
244894  11751   599.507929           5      593.967505                  5   
139719  38538   553.253864           5      287.336545                  4   
283010  10520   533.475642           5      407.724199                  4   

       popularity  
132445       high  
285693    average  
160705       high  
33544   very high  
49720        high  
179427  very high  
77199        high  
287744       high  
167696  very high  
138028       high  
243684       high  
68385   very high  
17454   very high  
266015       high  
275376       high  
122762       high  
119574    average  
244894  very high  
139719       high  
283010       high  
In [119]:
import seaborn as sns
sns.countplot(x='UP_Norm_Origin_nb', hue='UP_Norm_Origin_nb', data=gdf)
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe90d044dd8>

Filter relevant target geometries

Create geoviews layer for final display. We'll filter segments by a threshold, as a means to only show streets with a minimum of popularity.

  • We can also filter based on the column TARGET_green_target
In [120]:
# on mixed type input (e.g. 'Polygon', 'GeometryCollection')
# apply filter below to avoid Geopandas Exception
gdf_temp = gdf.loc[gdf['geometry'].apply(lambda x: isinstance(x, (Polygon, MultiPolygon)))]

gdf_filtered = gdf_temp.query(f'{GREEN_TARGET_COL} == 1').copy()
print(f'Number of total segments: {len(gdf)}. Number of filtered segments: {len(gdf_filtered)}')
Number of total segments: 290015. Number of filtered segments: 51645
In [121]:
gv_shapes_weighted = gv.Polygons(
    gdf_filtered, vdims=['UP_Norm_Origin_nb'], crs=cartopy.crs.epsg(epsg_code))

Final Output:

  • if you want to output to a html file with full-width-map, remove comment from:
    • %%output filename="dresden_streets" and
    • responsive=True
  • if you want to preview inside this jupyter notebook, leave comments for the above two lines and remove comments from:
    • width=800
    • height=480
In [ ]:
#%%output filename=f"{TODAY}_{CITY_NAME}_weighted_green_targets_{LEVEL}_{TARGETSHAPE_VERSION}"
from holoviews import dim
def set_active_tool(plot, element):
    # enable wheel_zoom by default
    plot.state.toolbar.active_scroll = plot.state.tools[0]

# the Overlay will automatically extent to data dimensions,
# since we want the initial zoom to be zoomed in to the Cite Center
# we need to supply xlim and ylim
# The Bokeh renderer only support Web Mercator Projection
# we therefore need to project Großer Garten Bounds first
# from WGS1984 Decimal Degrees coordinates to Web Mercator

bb_bottomleft = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init='EPSG:3857'), LIM_LNG_MIN, LIM_LAT_MIN)
bb_topright = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init='EPSG:3857'), LIM_LNG_MAX, LIM_LAT_MAX)

hv.Overlay(
    gv.tile_sources.EsriImagery.opts(alpha=0.65) * \
    gv_shapes_weighted.opts(line_width=0.1, cmap='Greens', color_index='UP_Norm_Origin_nb', colorbar=True, fill_alpha=0.9, color_levels=5)
    # gv_shapes_weighted.opts(line_width=1+np.sqrt(dim('UP_Sqr')), cmap='Reds', color_index='UP_Norm', colorbar=True)
    ).opts(
        title_format=f"Weighted meinGrün Target Shapes {TARGETSHAPE_VERSION} for {CITY_NAME} and {LEVEL} based on LBSM user-post-location proximity (based on {len(df)} posts from Flickr, Instagram, and Twitter)",
        #responsive=True,
        finalize_hooks=[set_active_tool],
        width=800,
        height=480,
        xlim=(bb_bottomleft[0],bb_topright[0]),
        ylim=(bb_bottomleft[1],bb_topright[1])
    )

Export Results (shapefile, CSV)

Have a look at the header first:

In [123]:
gdf.head()
Out[123]:
OBJECTID PARENT_TARGET_green_target LULC_ID geometry UP UP_Instagram UP_Flickr UP_Twitter UD UPL area_skm area_skm_SR25 area_skm_SR50 UP_Norm UP_Norm_Origin UP_Norm_nb UP_Norm_Origin_nb popularity
0 1 False dd_OSM_57149_5_22 (POLYGON ((393349.4458630043 5652514.374407471... 0 0 0 0 0 0 0.001262 0.010003 0.022494 1.0 1.0 0 0 no data
1 283517 False dd_EBK-TSP-OSM_27937_2_3 (POLYGON ((401577.7054682793 5656972.97522185,... 0 0 0 0 0 0 0.000088 0.003105 0.010027 1.0 1.0 0 0 no data
2 116638 False dd_OSM_54537_2_1 (POLYGON ((407212.8660183108 5650545.754251033... 0 0 0 0 0 0 0.000030 0.002544 0.008978 1.0 1.0 0 0 no data
3 73544 False dd_EBK-TSP-OSM_30836_1_19 (POLYGON ((420314.3206534541 5656666.771566805... 0 0 0 0 0 0 0.000092 0.003036 0.009900 1.0 1.0 0 0 no data
4 192364 False dd_EBK-TSP-OSM_31363_18_1 (POLYGON ((413028.7688842474 5652131.050560708... 0 0 0 0 0 0 0.000480 0.007524 0.018478 1.0 1.0 0 0 no data

Select columns to export

In [124]:
sel_cols = [TARGET_SHAPE_ID_COL, 'geometry', 'UP', 'UP_Flickr', 'UP_Twitter', 'UP_Instagram', 'UD', 'UPL', 'UP_Norm', 'UP_Norm_Origin', 'popularity']

Write to Shapefile

In [125]:
# if not calculated, drop:
if 'UP' not in gdf_temp.index:
    sel_cols.remove('UD')
if 'UP' not in gdf_temp.index:
    sel_cols.remove('UPL')
print(sel_cols)
gdf_filtered_columns = gdf_temp[sel_cols].copy()

# define column types manually
# gdf_filtered_columns = gdf[sel_cols].copy()
# note that for clarity we rename UP to UPL (UserPostLocation) here
# gdf_filtered_columns.rename(columns = {'UP':'UPL'}, inplace = True)
filename = f"{TODAY}_meingruenshapes_{TARGETSHAPE_VERSION}_weighted_{CITY_NAME.lower()}_{LEVEL}"
schema = {'geometry': 'Polygon',
          'properties': {
              TARGET_SHAPE_ID_COL: 'str',
              'UP': 'int',
              'UP_Flickr': 'int',
              'UP_Twitter': 'int',
              'UP_Instagram': 'int',
              # 'UD': 'int',
              # 'UPL': 'int',
              'UP_Norm': 'float',
              'UP_Norm_Origin': 'float',
              'popularity': 'str'}
         }
gdf_filtered_columns.to_crs({'init': 'epsg:4326'}).to_file(driver = 'ESRI Shapefile', schema=schema, filename= f"{filename}.shp")
['LULC_ID', 'geometry', 'UP', 'UP_Flickr', 'UP_Twitter', 'UP_Instagram', 'UP_Norm', 'UP_Norm_Origin', 'popularity']

Write to CSV

In [93]:
if 'geometry' in sel_cols:
    sel_cols.remove('geometry')
sel_headers = [TARGET_SHAPE_ID_COL,'UP', 'UP_Flickr', 'UP_Twitter', 'UP_Instagram', 'UD', 'UPL', 'UP_Norm', 'UP_Norm_Origin', 'popularity']

if 'UP' not in gdf_temp.index:
    sel_headers.remove('UD')
if 'UP' not in gdf_temp.index:
    sel_headers.remove('UPL')
gdf.to_csv(f'{filename}.csv',
           columns=sel_cols, header=sel_headers,
           index=False)
In [ ]: