Weighting meinGrün Zielgeometrien from geolocated Social Media posts for measuring greenspace popularity¶

This Notebook contains the complete process to apply weights to meinGrün Zielgeometrien extracted from LBSM data (geolocated Social Media Posts).

Intersection of points and coordinates is not complex. However, there're several potential pitfalls in the process.

Pitfalls:

for improving representativity, we're using different datasources (Twitter, Instagram Flickr)
- all of these have different composition, biases and sources of noise
for coordinate mapping, a primary source of noise are the different geoaccuracies (gps/'latlng')
- flickr has the highest geoaccuracy, while Twitter and Instagram are mostly geoaccuracy 'place'
- this must be taken into account when assigning coordinates to different shapes (we'll be using a SEARCH_RADIUS 25 and 50m)
the overall quantity of photos is spread quite differently
- Instagram makes up 80% of data, Flickr about 20% and Twitter 1%
- normalization of three datasources is necessary, so that each source has a balanced impact on final result
- Twitter may introduce biases as to the low amount of data, which can be taken into account by reducing Twitter overall impact
normalization on area is necessary, otherwise large shapes would dominate because of their higher likeliness of intersecting with many coordinates
- area must be calculated for different search distances used per origin:
  - 50m buffer area for Twitter/Instagram coordinates
  - 25m buffer area for Flickr coordinates
long tail: typically, Social Media posts peak at few places; but we're also interested in the 'long tail of data' (the other 80% of less frequented places)
- we can apply log or sqrt to flatten this peak and improve visibility of less frequented places
finally we need to classify and label data into arbritary classes
- after normalization, absolute numbers are meaningless, therefore we can 'stretch' results per shape to 1 to 1000 range for improving comparability across runs
- based on 1 to 1000 range, we can apply natural_breaks or other classifiers to label bins of common classes (e.g. 6 bins are used herein)

Lets first start with importing the necessary dependencies. First, make sure you have these packages installed (use miniconda with packages from conda-forge and pyviz channel!) and execute the following code block

import fiona
import geoviews as gv
import geoviews.feature as gf
from pathlib import Path
from shapely.geometry import shape
import geopandas as gp
import pandas as pd
from shapely.geometry import Point
import holoviews as hv
import numpy as np
from geopandas.tools import sjoin
# enable shapely.speedups which makes some of the spatial queries running faster.
import shapely.speedups
from cartopy import crs as ccrs
from collections import defaultdict
from collections import namedtuple
hv.notebook_extension('bokeh')
shapely.speedups.enable()

Because we're going to display pretty large maps, graphics and interactive visualizations in this notebook, the following line may improve jupyter speed.

%config Completer.use_jedi = False

Define Input Parameters¶

This section contains the only manual code edits that are needed to run this notebook.

ZIELGEOMETRIEN_DATA: this is the geojson containing the Zielgeometrien (polygons), social media posts will be used to calculate weights for each polygon in this geojson
LBSM_INPUT_DATA: this is the social media data used to calculate weights, e.g. a collection of post-records with lat-lng coordinates this file was generated using tagmaps package, which applies filters and cleans up noisy social media raw data

The files should be places in a subfolder called "./01_Input/" (relative to this notebook).

TARGETSHAPE_VERSION = 'v4.1'

# Level 0:
# LEVEL = "L0"
# TARGET_SHAPE_ID_COL = "TARGET_ID"
# GREEN_TARGET_COL = "TARGET_green_target"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Dresden_4.1/targets.json"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Heidelberg/targets.json"

# Level 1 (Wege):
# LEVEL = "L1"
# TARGET_SHAPE_ID_COL = "WAY_ID"
# GREEN_TARGET_COL = "PARENT_TARGET_green_target"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Dresden_4.1/targetParts_Wege.json"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Heidelberg/targets_Wege.json"

# Level 2 (Wege LC):
LEVEL = "L2"
TARGET_SHAPE_ID_COL = "LULC_ID"
GREEN_TARGET_COL = "PARENT_TARGET_green_target"
ZIELGEOMETRIEN_DATA = "GeoJSON_Dresden_4.1/targetParts_Wege_LULC.json"
# ZIELGEOMETRIEN_DATA = "GeoJSON_Heidelberg/targets_Wege.json"

# LBSM_INPUT_DATA = "Output_cleaned_Dresden_FlickrOnly.csv"
# LBSM_INPUT_DATA = "Output_cleaned_Heidelberg_FlickrOnly.csv"
# LBSM_INPUT_DATA = "2019-09-03_meinGruen_Heidelberg.csv"
LBSM_INPUT_DATA = "2019-09-02_meinGruen_Dresden.csv"

import datetime as dt
def get_str_formatted_today():
    """Returns date as string (YYYY-mm-dd)"""
    today = dt.date.today()
    today_str = today.strftime("%Y-%m-%d")
    return today_str
TODAY = get_str_formatted_today()

Define manual boundaries for map zooms

Define initial extent where the map should be zoomed to. If empty, the map will show the full boundary of the provided osm-graph (which may result in slow performance of the final map). To select boundary lat/lng coordinates, use (online tool link).

# Dresden Altstadt
LIM_LNG_MAX_1 = 13.750784
LIM_LNG_MIN_1 = 13.730367
LIM_LAT_MAX_1 = 51.055349
LIM_LAT_MIN_1 = 51.048746

# Dresden Center
LIM_LNG_MAX_2 = 13.788357
LIM_LNG_MIN_2 = 13.706689
LIM_LAT_MAX_2 = 51.060386
LIM_LAT_MIN_2 = 51.033973

# Dresden Neustadt:
LIM_LNG_MAX_3 = 13.782692
LIM_LNG_MIN_3 = 13.701024
LIM_LAT_MAX_3 = 51.087890
LIM_LAT_MIN_3 = 51.061492

# Heidelberg Center
LIM_LNG_MAX_4 = 8.73
LIM_LNG_MIN_4 = 8.68
LIM_LAT_MAX_4 = 49.420
LIM_LAT_MIN_4 = 49.404

# Total Bounds
LIM_LNG_MAX_5 = 14.0876945
LIM_LNG_MIN_5 = 13.46475261
LIM_LAT_MAX_5 = 51.24305035
LIM_LAT_MIN_5 = 50.90449772

Define the search max radius (in meters) to be used when intersecting polys with points and provide a name (used for writing data):

SEARCH_RADIUS_MAX =  50
CITY_NAME = "Dresden"
# CITY_NAME = "Heidelberg"

Load Shape Data¶

Load GeoJSON

geom_input_filepath = Path.cwd() / "01_Input" / ZIELGEOMETRIEN_DATA

%%time
meingruen_targetgeom_shapes = gp.read_file(geom_input_filepath)
total_bounds = meingruen_targetgeom_shapes.total_bounds
print(f'Total bounds: {total_bounds}')
print(f'Number of shapes: {len(meingruen_targetgeom_shapes)}')

Total bounds: [13.46475261 50.90449772 14.0876945  51.24305035]
Number of shapes: 290015
CPU times: user 56.2 s, sys: 3.66 s, total: 59.9 s
Wall time: 1min

Assign bounds to globa variables so we can use those later.

LIM_LNG_MIN, LIM_LAT_MIN, LIM_LNG_MAX, LIM_LAT_MAX = total_bounds

Project from WGS1984 (4326) to World Mercator (epsg:3395)

gdf_targetgeom_shapes = meingruen_targetgeom_shapes.to_crs({'init': 'epsg:3395'})

Preview Shape Table

gdf_targetgeom_shapes.head()

Remove unnecessary columns

There're many columns in this dataset that are of no interest to the computation in this notebook. Have a look whats there:

list(gdf_targetgeom_shapes.columns.values)

['OBJECTID',
 'PARENT_TARGET_ID',
 'PARENT_TARGET_green_target',
 'PARENT_TARGET_gemeinde_DD',
 'PARENT_WAY_ID',
 'LULC_ID',
 'LULC_origin',
 'LULC_access',
 'LULC_name',
 'LULC_type',
 'LULC_type_ger',
 'LULC_rank',
 'LULC_AREA',
 'LULC_LENGTH',
 'class0_ratio',
 'class1_ratio',
 'class2_ratio',
 'class3_ratio',
 'class4_ratio',
 'classNoData_ratio',
 'class0_area',
 'class1_area',
 'class2_area',
 'class3_area',
 'class4_area',
 'classNoData_area',
 'SHAPE_Length',
 'SHAPE_Area',
 'geometry']

Drop columns that are not needed:

def try_drop_cols_df(gdf, drop_cols):
    """Drop columns from pandas dataframe with try..except"""
    for drop_col in drop_cols:
        # apply drop
        try:
            gdf.drop(
                columns=[drop_col], inplace=True)
        except KeyError:
            pass
        
drop_list = ['class0_ratio', 'class1_ratio', 'class2_ratio', 'class3_ratio', 
             'class4_ratio', 'class4_ratio', 'class4_ratio', 'class4_ratio',
             'classNoData_ratio', 'class0_area', 'class1_area', 'class2_area', 
             'class3_area', 'class4_area', 'classNoData_area', 'SHAPE_Length',
             'SHAPE_Area', 'WAY_AREA', 'WAY_LENGTH', 'PARENT_TARGET_ID',
             'PARENT_TARGET_gemeinde_DD','PARENT_WAY_ID', 'LULC_origin', 
             'LULC_origin', 'LULC_access', 'LULC_name', 'LULC_type', 
             'LULC_type_ger', 'LULC_rank', 'LULC_AREA', 'LULC_LENGTH']

# apply drop
try_drop_cols_df(gdf_targetgeom_shapes, drop_list)

Check the new list of remaining columns

list(gdf_targetgeom_shapes.columns.values)

['OBJECTID', 'PARENT_TARGET_green_target', 'LULC_ID', 'geometry']

Optional: Preview Plot with Geopandas directly¶

gdf_targetgeom_shapes.geometry.plot(figsize=(20,10))

<matplotlib.axes._subplots.AxesSubplot at 0x7fe9aa232c50>

Print bounds of target shapes (World Mercator Porjection)

gdf_targetgeom_shapes.geometry.total_bounds

array([1498889.40453656, 6571235.69297612, 1568234.9781912 ,
       6631058.34569445])

Projections UTM¶

There are several projections used in this notebook:

Web Mercator WGS1984 (epsg:4326) for input data/output data in decimal degrees
World Mercator (epsg:3395) for map display in Geoviews/Bokeh
UTM projected coordinate system for local distance calculations

For accurate calculations in meters distance, we need to project data to UTM. Tagmaps package has a methods that will return the best UTM projection based on a lat-lng coordinate:

from tagmaps.classes.utils import Utils

from shapely.geometry import Point, Polygon
import pyproj
def get_poly_from_bounds(bounds):
    '''Add small buffer around path bounds''' 
    xmin, ymin, xmax, ymax = bounds
    bbox_polygon = Polygon([[xmin, ymax], [xmax, ymax], [xmax, ymin], [xmin, ymin]])
    return bbox_polygon

bound_poly = get_poly_from_bounds(meingruen_targetgeom_shapes.geometry.total_bounds)
print(f'input_lon_center: {bound_poly.centroid.x}')
print(f'input_lat_center: {bound_poly.centroid.y}')
epsg_code = Utils._convert_wgs_to_utm(
            bound_poly.centroid.x, bound_poly.centroid.y)
crs_proj = pyproj.Proj(init=f'epsg:{epsg_code}')
crs_wgs = pyproj.Proj(init=f'epsg:4326')

# report
print(f'Target EPSG Code: {epsg_code}')
print(f'Target projection: {crs_proj}')

input_lon_center: 13.776223556499998
input_lat_center: 51.07377403249999
Target EPSG Code: 32633
Target projection: Proj('+proj=utm +zone=33 +datum=WGS84 +units=m +no_defs', preserve_units=True)

Optional: Plot Preview in Geoviews with Bokeh Backend¶

Convert to Geoviews layer for plotting

import cartopy
from shapely.geometry import Polygon, MultiPolygon
from shapely.geometry.collection import GeometryCollection
# convert to WGS1984: epsg:4326
gdf_targetgeom_shapes.to_crs({'init': f'epsg:{epsg_code}'}, inplace=True)
# convert to Geoviews Shapes
gv_shapes = gv.Polygons([gv.Shape(feat) for feat in gdf_targetgeom_shapes.geometry if isinstance(feat, (Polygon, MultiPolygon))], crs=cartopy.crs.epsg(epsg_code))
# Alternative does not work:
# gv_shapes = gv.Path(
#    osm_highway_shapes.head(50).geometry, kdims=['x', 'y'], vdims=['@osmid', 'highway']).opts(tools=['hover'])

Plot Rasterized static with datashade

use datashade, faster display but lower quality

from holoviews.operation.datashader import datashade, shade, dynspread, rasterize

datashade(gv_shapes).opts(width=800, height=480)

<string>:6: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.

Plot interactive rasterized

%%opts Image [colorbar=True clipping_colors={'NaN': (0, 0, 0, 0)}]
# %%opts Image [clipping_colors={'NaN': 'transparent'}]
# see https://anaconda.org/jbednar/datashade_vs_rasterize/notebook

original = rasterize(gv_shapes, precompute=True)
# clipped  = original.redim.range(z=(0.1, 1)).opts(clipping_colors={'min': 'transparent', 'NaN': 'transparent'})

# transparent alternative (litle bit slower)
from holoviews.operation.datashader import rasterize, dynspread, datashade
clipped = dynspread(datashade(gv_shapes))

gv.tile_sources.EsriImagery * \
clipped.opts(
    width=800,
    height=480,
    alpha=0.8)

Intersection¶

For appending points to polygons there're several options available:

use geopandas.sjoin (unfortunately, no distance available). Example
use spatial rtree and best fit custom method
use from shapely.ops import nearest_points

Since it is the fastest and most versatile, we'll explore the second option.

1. Load LBSM Data

Load CSV of LBSM points to Pandas dataframe, two formats available

Raw data (Instagram, Flickr, Twitter)
Cleaned Format merged based on UPL (only use Flickr because Instagram/Twitter are not available in highest geo-accuracy).

First, detect the format.

import csv
file_input_path = Path.cwd() / "01_Input" / LBSM_INPUT_DATA
with open(file_input_path, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames
# there're two different formats of input data
# cleaned_posts (UPL)
# and raw data (UP)
# cleaned posts have a slightly different format
# which is detected here based on the CSV's header
if 'latitude' in fieldnames:
    raw_format = True
    usecols = ['origin_id', 'latitude', 'longitude', 'user_guid', 'post_create_date', 'post_publish_date']
    dtypes={'origin_id': str,'latitude': float, 'longitude': float, 'user_guid': str, 'post_create_date': str, 'post_publish_date': str}
elif 'lat' in fieldnames:
    raw_format = False
    usecols = ['origin_id', 'lat', 'lng', 'user_guid', 'post_create_date', 'post_publish_date']
    dtypes={'origin_id': str,'lat': float, 'lng': float, 'user_guid': str, 'post_create_date': str, 'post_publish_date': str}
else:
    print("Unknown Format")

Load data in dataframe df

%%time
df = pd.read_csv(file_input_path, usecols=usecols, dtype=dtypes, encoding='utf-8')
print(f'Loaded {len(df)} posts.')

Loaded 991117 posts.
CPU times: user 2.69 s, sys: 609 ms, total: 3.3 s
Wall time: 3.32 s

For Flickr, post_create_date and post_publish_date is available, for Instagram and Twitter, only post_publish_date is available.

Combine both with a preference on post_create_date
wrapped in try..except for multiple execution handling
rename final columns

try:
    df['post_create_date'].fillna(df['post_publish_date'], inplace=True)
    df.drop(columns=['post_publish_date'], inplace=True)
    df.rename(columns={'post_create_date': 'post_date'}, inplace=True)
except KeyError:
    pass

# df.iloc[: , [5, 13]].head()
df.head()

Optional: Get some statistics for origin

1 = Instagram
2 = Flickr
3 = Twitter

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams.update({'xtick.labelsize': 'small'})
plt.rcParams.update({'ytick.labelsize': 'small'})
plt.rc('legend', fontsize=10)

sns.countplot(x='origin_id', hue='origin_id', data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe905386518>

size_count = df.groupby('origin_id').size()
total_count = len(df)
for key, value in size_count.items():
    print(f'Origin ID {key} with {value} posts, total percentage is: {value/(total_count/100):.2f}%')

Origin ID 1 with 806181 posts, total percentage is: 81.34%
Origin ID 2 with 183249 posts, total percentage is: 18.49%
Origin ID 3 with 1687 posts, total percentage is: 0.17%

2. Convert pandas dataframe to geopandas geodataframe

from shapely.geometry import Point
    
def make_point(row):
    if raw_format:
        return Point(row.longitude, row.latitude)
    return Point(row.lng, row.lat)

# Go through every row, and make a point out of its lat and lon
points = df.apply(make_point, axis=1)

# Make a new GeoDataFrame
# using the data from our old df
# but also adding in the geometry we just made
lbsm_data_gdf = gp.GeoDataFrame(df, geometry=points)
# set input projection
lbsm_data_gdf.crs = {'init' :'epsg:4326'}
# project to UTM
lbsm_data_gdf.to_crs({'init': f'epsg:{epsg_code}'}, inplace=True)

lbsm_data_gdf.head()

Optional: Preview Single Target Geometry Graph Polygon Intersection¶

Lets test the inteserction for a small sample

1 polygon from meiNGrün Target Geometries
1 lbsm post

a) Get one polygon from meinGrün Target Geometries

modify the number to get different samples

# Get row for specific ID
# id = 9
# id = 20
# id = 26
id = 29
gdf_targetgeom_shapes.iloc[id].geometry

It is possible to get bounds of the line segment by:

gdf_targetgeom_shapes.iloc[id].geometry.bounds

(416826.3844851471, 5654901.461591735, 417052.89840409183, 5655199.953776324)

b) Prepare poly geoviews layer

gv_sample_poly = gv.Shape(
    gdf_targetgeom_shapes.iloc[id].geometry,
    label=f'meinGrün Target Polygon', crs=cartopy.crs.epsg(epsg_code))
gv_sample_poly_buf = gv.Shape(
    gdf_targetgeom_shapes.iloc[id].geometry.buffer(50, resolution=3, cap_style=1, join_style=1, mitre_limit=5.0)
    ,
    label=f'meinGrün Target Polygon', crs=cartopy.crs.epsg(epsg_code))

c) Optional: Test Prepare point geoviews layer (slow)

%%time
from shapely.geometry import Point, Polygon

def add_buffer(geom, buffer_num=50, return_poly=False):
    '''Add small buffer around poly bounds
    Buffer defaults to: 200m
    ''' 
    xmin, ymin, xmax, ymax = geom.bounds
    xmin = xmin - buffer_num
    ymin = ymin - buffer_num
    xmax = xmax + buffer_num
    ymax = ymax + buffer_num
    if return_poly:
        bbox_polygon = Polygon(
            [[xmin, ymax],
             [xmax, ymax],
             [xmax, ymin],
             [xmin, ymin]])
        return bbox_polygon
    return xmin, ymin, xmax, ymax

# adjust number for sample polygon
sample_polygon = gdf_targetgeom_shapes.iloc[id]
xmin, ymin, xmax, ymax = add_buffer(
    sample_polygon.geometry)

# intersect/clip points based on bbox input
# this is very slow! we'll use an improved version for mass intersection
sample_point_gdf = lbsm_data_gdf.cx[xmin:xmax, ymin:ymax]

print(f'Intersections found for id {id}: {len(sample_point_gdf)} posts.')

# for automatic selection of polygon with many intersections
# use code block below

# max_intersect = 0
# max_id_ref = None
# for id in range(1,30):
#     # adjust number for sample polygon
#     sample_polygon = gdf_targetgeom_shapes.iloc[id]
#     xmin, ymin, xmax, ymax = add_buffer(
#         sample_polygon.geometry)
#     
#     # intersect/clip points based on bbox input
#     sample_point_temp = lbsm_data_gdf.cx[xmin:xmax, ymin:ymax]
#     number_of_intersections = len(sample_point_temp)
#     if number_of_intersections > max_intersect:
#         max_id_ref = id
#         max_intersect = number_of_intersections
#         sample_point_gdf = sample_point_temp
#         
# # report
# print(f'Max Intersections found for id {max_id_ref}: {len(sample_point_gdf)} posts.')

Intersections found for id 29: 7 posts.
CPU times: user 12.7 s, sys: 62.5 ms, total: 12.7 s
Wall time: 12.7 s

d) Prepare Geoviews Layer for selected Sample Points

sample_point_selection = gv.Points(sample_point_gdf, crs=cartopy.crs.epsg(epsg_code))

e) Output sample selection map

import pyproj
def set_active_tool(plot, element):
    # enable wheel_zoom by default
    plot.state.toolbar.active_scroll = plot.state.tools[0]

# the Overlay will automatically extent to data dimensions,
# the polygons of HDBSCAN cover the city area:
# since we want the initial zoom to be zoomed in to the selected sample geometry
# we need to supply xlim and ylim
# The Bokeh renderer only support Web Mercator Projection
# we therefore need to project Großer Garten Bounds first
# from WGS1984 to Web Mercator
sample_bottomleft = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmin, ymin)
sample_topright = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmax, ymax)
# to show 200m radius
# centre_line = pyproj.transform(
#     pyproj.Proj(init=f'EPSG:{epsg_code}'),
#     pyproj.Proj(init='EPSG:3857'),
#     gdf_targetgeom_shapes.iloc[max_id_ref].geometry.centroid.x,
#     gdf_targetgeom_shapes.iloc[max_id_ref].geometry.centroid.y)

hv.Overlay(
    gv.tile_sources.EsriImagery * \
    # gv_sample_poly_buf.opts(line_width=0, fill_alpha=0.5, fill_color='grey') * \
    gv_sample_poly.opts(line_color='red', line_width=3, fill_alpha=0.5, tools=['hover']) * \
    # hv.Ellipse(centre_line[0], centre_line[1], 200, label='200m Buffer').opts(line_color='red', line_width=3, show_legend=True) * \
    sample_point_selection.opts(tools=['hover'], size=5, color='yellow') # * \
    # gv_testpoint
    ).opts(
        width=800,
        height=480,
        finalize_hooks=[set_active_tool],
        xlim=(sample_bottomleft[0], sample_topright[0]),
        ylim=(sample_bottomleft[1], sample_topright[1])
    )

Optional: Intersect test using rtree¶

We'll create an rtree with boxes for each line in the OSM Graph. This will be the most expensive operation, afterwards we can use Shapely's nearest function to snap points to a small number of possible lines. First, construct RTree for the geometry from osm paths. We'll extend the boundary by a small buffer surrounding the path so that we can catch points at both ends.

adjust buffer_num to your needs, minimum buffer_num should be the maximum SEARCH_RADIUS beeing used

%%time
from rtree import index

# Populate R-tree index with bounds of polygons
idx = index.Index()
for pos, poly in enumerate(gdf_targetgeom_shapes.geometry):
    idx.insert(pos, add_buffer(poly, buffer_num=50))

CPU times: user 34.7 s, sys: 0 ns, total: 34.7 s
Wall time: 34.7 s

Intersect our testpoint with the rtree

Next, we intersect our testpoint with the rtree and select those paths that are near the testpoint. We can define a search_radius. This is very fast because the heavy lifting is done by the spatial rtree intersection (idx.intersection((testpoint.coords[0]))) while the expensive distance calculation from shapely (gdf.geometry[i].project(testpoint)) is only done for the few paths objects that are preselected based on their boundary.

Important: Modify SEARCH_RADIUS (m) to test output for various distances

Since Instagram and Twitter have lower Geoaccuracies on average, we use a large Search Radius of 50m, wheras for Flickr we'll use 25m.

SEARCH_RADIUS_ORIGIN = {
    "1":50,"2":25,"3":50
}

lbsm_data_gdf.head()

%%time
# Query testpoint to see which polygon it is in
# using first Rtree index, then Shapely geometry's within
# SEARCH_RADIUS =  25
# to look for other points here, override id:
# id = 20
lbsm_post = lbsm_data_gdf.iloc[id]
SEARCH_RADIUS = SEARCH_RADIUS_ORIGIN.get(lbsm_post.origin_id)

testpoint = lbsm_post.geometry
gv_testpoint = gv.Points([testpoint], crs=cartopy.crs.epsg(epsg_code), label='Testpoint').opts(tools=['hover'], size=5, color='yellow')
# calculate boundary to show
xmin, ymin, xmax, ymax = add_buffer(lbsm_post.geometry)
sample_bottomleft = pyproj.transform(pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmin, ymin)
sample_topright = pyproj.transform(pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmax, ymax)

# with rtree only
poly_idx_rtree = [i for i in idx.intersection((testpoint.coords[0]))]

# with search radius
poly_idx = [i for i in idx.intersection((testpoint.coords[0]))
            if gdf_targetgeom_shapes.geometry[i].distance(testpoint) < SEARCH_RADIUS]
            #if gdf.geometry[i].interpolate(gdf.geometry[i].project(testpoint)).distance(testpoint) > search_radius]

# without search radius, extact intersection
poly_idx_exact = [i for i in idx.intersection((testpoint.coords[0]))
                   if gdf_targetgeom_shapes.geometry[i].intersects(testpoint)]

# report
print(f'{len(poly_idx_rtree)} preselected polygons through rtree')
print(f'{len(poly_idx)} potentially relevant polygons found (within search radius {SEARCH_RADIUS}).')
print(f'{len(poly_idx_exact)} exact intersection polygons found.')

29 preselected polygons through rtree
13 potentially relevant polygons found (within search radius 50).
1 exact intersection polygons found.
CPU times: user 62.5 ms, sys: 62.5 ms, total: 125 ms
Wall time: 243 ms

print(poly_idx)

[264362, 175021, 260431, 227910, 81690, 68946, 75593, 222158, 57426, 139542, 75968, 84505, 31346]

Select subset of relevant polygons based on id's

gdf_targetgeom_shapes.iloc[poly_idx].head()

prepare geoviews layer for preview

testpoly_sel = gv.Polygons(
    [gv.Shape(feat) for feat in gdf_targetgeom_shapes.iloc[poly_idx].geometry],
    crs=cartopy.crs.epsg(epsg_code),
    label=f'Found {len(poly_idx)} Target Polygons (search radius {SEARCH_RADIUS}m)')
testpoly_sel_exact = gv.Polygons(
    [gv.Shape(feat) for feat in gdf_targetgeom_shapes.iloc[poly_idx_exact].geometry],
    crs=cartopy.crs.epsg(epsg_code),
    label=f'Found {len(poly_idx_exact)} Target Polygons (exact intersection)')

output map (careful: might be slow)

hv.Overlay(
    gv.tile_sources.EsriImagery * \
    # gv_shapes.opts(line_color='white', line_width=1) * \
    testpoly_sel.opts(line_color='yellow', line_width=3, fill_alpha=0, show_legend=True) * \
    testpoly_sel_exact.opts(line_color='red', line_width=3, fill_alpha=0, show_legend=True, line_dash='dotted') * \
    gv_testpoint.opts(tools=['hover'], size=10, color='red', show_legend=True)
    #hv.Ellipse(gv_testpoint.x, gv_testpoint.y, 35, label='200m Buffer').opts(line_color='yellow', line_width=1, show_legend=True)
    ).opts(
        width=800,
        height=480,
        xlim=(sample_bottomleft[0], sample_topright[0]),
        ylim=(sample_bottomleft[1], sample_topright[1]),
        title_format=f"Selected Target Geometries for Testpoint and {SEARCH_RADIUS}m Search Radius",
    )

Intersect all points with grid¶

We execute this intersection for each point in our LBSM dataset. We update all counts in close polygons from the meiNGrün Target Geometries

1. Add statistics field to input geometries

gdf_targetgeom_shapes.head()

Initialize count structures

They're several things we want to measure per meinGrün target shape:

User_Posts: Number of social media posts in total
User_Days: Counting each user once per day (post timestamp)
User_Post_Locations: Counting each user-location coordinate only once (post latlng)
InstagramCount: Number of Instagram posts in total
FlickrCount: Number of Flickr posts in total
TwitterCount: Number of Twitter posts in total

# init count structures
gdf_targetgeom_shapes['User_Posts'] = 0
gdf_targetgeom_shapes['User_Days'] = [set() for _ in range(len(gdf_targetgeom_shapes))]
gdf_targetgeom_shapes['User_Post_Locations'] = [set() for _ in range(len(gdf_targetgeom_shapes))]
gdf_targetgeom_shapes['InstagramCount'] = 0
gdf_targetgeom_shapes['FlickrCount'] = 0
gdf_targetgeom_shapes['TwitterCount'] = 0
gdf = gdf_targetgeom_shapes

gdf_targetgeom_shapes.head()

%%time
from IPython.display import clear_output

def get_userday(post):
    """Remove time from date-time string, e.g. 2019-01-01"""
    if isinstance(post.post_date, str) and len(post.post_date) >= 10:
        return f'{post.user_guid}:{post.post_date[:10]}'
    return f'{post.user_guid}:1900-01-01'
def get_location(post_geom):
    """Concat lat:lng to string"""
    return f'{post_geom.y}:{post_geom.x}'
def concat_userlocation(user_guid, location_str):
    """Concat user:lat:lng to string"""
    return f'{user_guid}:{location_str}'

    
x = 0
x_i = 0
# get ids from column names
column_id_up = gdf.columns.get_loc("User_Posts")
column_id_ud = gdf.columns.get_loc("User_Days")
column_id_upl = gdf.columns.get_loc("User_Post_Locations")
col_id_instagram = gdf.columns.get_loc('InstagramCount')
col_id_flickr = gdf.columns.get_loc('FlickrCount')
col_id_twitter = gdf.columns.get_loc('TwitterCount')
origin_col_dict = {
    "1":col_id_instagram, "2":col_id_flickr, "3":col_id_twitter}
# initialize dict lookup for location-poly-idx
latlng_dict_polyidx = dict()

total_records = len(lbsm_data_gdf)
for lbsm_record in lbsm_data_gdf.itertuples():
    # print(lbsm_record.geometry)
    # break
    # update status
    x += 1
    msg_text = (
        f'Processed records: {x} ({x/(total_records/100):.2f}%). '
        f'Intersection found: {x_i}')
    if x % 100 == 0:
        clear_output(wait=True)
        print(msg_text)
        
    # get search radius based on currrent posts's origin
    SEARCH_RADIUS = SEARCH_RADIUS_ORIGIN.get(
        lbsm_record.origin_id)
    # we save some processing speed by remembering 
    # repeated locations and sacrificing some ram
    # e.g. spatial intersection is more expensive than dict-lookup
    location_str = get_location(lbsm_record.geometry)
    if location_str in latlng_dict_polyidx:
        poly_idx = latlng_dict_polyidx[location_str]
    else:
        poly_idx = [i for i in idx.intersection((lbsm_record.geometry.coords[0]))
                if gdf.geometry[i].distance(lbsm_record.geometry) < SEARCH_RADIUS]
        # add to dict for improving speed for repetitive locations
        latlng_dict_polyidx[location_str] = poly_idx
    # count intersection
    if len(poly_idx) > 0:
        x_i += 1
        userday = get_userday(lbsm_record)
        userlocation = concat_userlocation(lbsm_record.user_guid, location_str)
        # update poly count column
        # the iloc indexer for Pandas Dataframe is used for integer-
        # location based indexing / selection by position.
        # this may update/increment multiple rows/records
        origin_col = origin_col_dict.get(lbsm_record.origin_id)
        gdf.iloc[poly_idx, origin_col] += 1
        gdf.iloc[poly_idx, column_id_up] += 1
        # the following 2 operations are quite expensive on memory/cpu
        # disable if not needed
        # gdf.iloc[poly_idx, column_id_ud].apply(lambda x: x.add(userday))
        # gdf.iloc[poly_idx, column_id_upl].apply(lambda x: x.add(userlocation))
    
# final status
clear_output(wait=True)
print(msg_text)

Processed records: 991117 (100.00%). Intersection found: 892445
CPU times: user 1h 3min 14s, sys: 42.2 s, total: 1h 3min 56s
Wall time: 1h 4min 51s

Optional: store or load intermediate aggregated dataframe results

Store:

gdf.to_pickle(f"gdf_{CITY_NAME}_{LEVEL}_{TARGETSHAPE_VERSION}.pickle")

Load:

gdf = pd.read_pickle(f"gdf_{CITY_NAME}_{LEVEL}_{TARGETSHAPE_VERSION}.pickle")

Preview results table

gdf[[TARGET_SHAPE_ID_COL, 'User_Posts', 'User_Days', 'User_Post_Locations', 'InstagramCount', 'FlickrCount', 'TwitterCount']].head()

Convert Sets to unique counts

gdf['UD'] = gdf.apply(lambda row: len(row.User_Days), axis=1)
gdf['UPL'] = gdf.apply(lambda row: len(row.User_Post_Locations), axis=1)

gdf[[TARGET_SHAPE_ID_COL, 'User_Posts', 'User_Days', 'User_Post_Locations', 'UD', 'UPL', 'InstagramCount', 'FlickrCount', 'TwitterCount']].head()

Remove set column, since they're not needed anymore:

gdf.drop(columns=['User_Post_Locations', 'User_Days'], inplace=True)
gdf.rename(columns={'User_Posts': 'UP'}, inplace=True)
gdf.rename(columns={'InstagramCount': 'UP_Instagram'}, inplace=True)
gdf.rename(columns={'FlickrCount': 'UP_Flickr'}, inplace=True)
gdf.rename(columns={'TwitterCount': 'UP_Twitter'}, inplace=True)

Out final Geodataframe with aggregated measurements per green target shape:

gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].head()

Get Histogram

The histogram shows the typical long tail pattern: most shapes have few posts, but a few shapes have a lot of posts.

np.sqrt(np.sqrt(gdf['UP'])).hist(bins=20, alpha=0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe922ea30b8>

Normalize for visualization¶

Absolute counts per shape are not very meaningful, since large shapes have a higher likeliness of intersecting with many posts.

We can normalize using the area (square kilometres) of each target shape)
Furthermore, we can "flatten" the long tail a bit by applying a square root function
these normalized numbers are only relevant based on their relative value, therefore we can stretch results to 1 to 1000 range for comparability across different runs

Calculate buffered Area for search distances

# Get row for specific ID
id = 0
gdf.iloc[id].geometry

gdf.iloc[id].geometry.buffer(50, resolution=3, cap_style=1, join_style=1, mitre_limit=5.0)

print(f'{gdf.iloc[id].geometry.area:.2f} m²')
print(f'50m Search radius: {gdf.iloc[id].geometry.buffer(50, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area:.2f} m²')
print(f'25m Search radius: {gdf.iloc[id].geometry.buffer(25, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area:.2f} m²')

1262.12 m²
50m Search radius: 22494.49 m²
25m Search radius: 10003.08 m²

def m2_to_km2(m_num):
    """Convert square meters to square kilometres"""
    return m_num/1000000

print(f'{m2_to_km2(gdf.iloc[id].geometry.area):.4f} km²')
print(f'50m Search radius: {m2_to_km2(gdf.iloc[id].geometry.buffer(50, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area):.4f} km²')
print(f'25m Search radius: {m2_to_km2(gdf.iloc[id].geometry.buffer(25, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0).area):.4f} km²')

0.0013 km²
50m Search radius: 0.0225 km²
25m Search radius: 0.0100 km²

Calculate km²

for all geometries in the dataset
for all 25m buffered geometries in dataset
for all 50m buffered geometries in dataset

def shape_buf(geom, buf_dist_m):
    """Buffer geometry based on m distance"""
    buf_geom = geom.buffer(buf_dist_m, resolution=16, cap_style=1, join_style=1, mitre_limit=5.0)
    return buf_geom

gdf['area_skm'] = gdf['geometry'].apply(lambda x: m2_to_km2(x.area))
# same result:
# gdf['poly_area'] = gdf.geometry.area / 10**6
gdf['area_skm_SR25'] = gdf['geometry'].apply(lambda x: m2_to_km2(shape_buf(x, 25).area))
gdf['area_skm_SR50'] = gdf['geometry'].apply(lambda x: m2_to_km2(shape_buf(x, 50).area))

# gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter', 'area_skm', 'area_skm_SR25', 'area_skm_SR50']].head()
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter', 'area_skm', 'area_skm_SR25', 'area_skm_SR50']].head()

Normalize

import math
def stretch_norm_dfcolumn(df, column, min_stretch, max_stretch):
    """Interpolates a Geodataframe column's values to a new range"""
    min_val = df[column].min()
    max_val = df[column].max()
    # replace values in-place
    # the lambda function is applied to all values from _column_
    # df.loc[:, column] means: replace values of _column_ in-place (modifying the original dataframe)
    # .loc[row_indexer,col_indexer]
    df.loc[:, column] = df[column].apply(
        lambda x: np.interp(x, (min_val, max_val), (min_stretch, max_stretch)))
    # similar but not recommended:
    # df[column] = np.interp(df[column], (min_val, max_val), (min_stretch, max_stretch))

# it is possible to 'flatten' the unequal distribution of values
# use one time sqrt for a bit of flatten, use two times sqrt for more flatten
gdf['UP_Norm'] = np.sqrt(gdf['UP']/gdf['area_skm'])
# gdf['UP_Norm'] = np.sqrt(np.sqrt(gdf['UP']/gdf['poly_area']))

# interpolate to 1 to 1000 range
stretch_norm_dfcolumn(gdf, 'UP_Norm', 1, 1000)

# print
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].sort_values('UP_Norm', ascending=False).head()
# gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].sort_values('UP_Norm', ascending=False).head()

Normalize Origin Influence

It becomes obvious that Instagram dominates results. We want to generate an even result between the 3 data sources.

Our next goal therefore is to allow only

40% Influence by Instagram (reduced because Instagram has low geoaccuracy)
50% Influence by Flickr (increased because Flickr featurss highest geoaccuracy)
10% Influence by Twitter (reduced because twitter features sampling noise)

We can do this by using the information from UP_Instagram, UP_Flickr, and UP_Twitter columns.

Since we've used a Search_Radius of 50m for Instagram and Twitter and a Search_Radius of 25m for Flickr,
we need to normalize count values separately based on origin and buffered area.

Remember the original percentage of the three datasources:

origin_perc_total = dict()
for key, value in size_count.items():
    total_perc = value/(total_count/100)
    print(f'Origin ID {key} with {value} posts, total percentage is: {total_perc:.2f}%')
    origin_perc_total[key] = total_perc

Origin ID 1 with 806181 posts, total percentage is: 81.34%
Origin ID 2 with 183249 posts, total percentage is: 18.49%
Origin ID 3 with 1687 posts, total percentage is: 0.17%

Thus, we need to perform a two step normalization:

first, normalize based on total percentage of each datasource
second, apply manuall influence flags (40%, 50%, 10%)

def norm_ip_origin(row):
    total_up = row['UP']
    # get one third of the total count of row
    if total_up == 0:
        return 0
    # get total counts per origin
    # normalize per search radius
    # flatten curve using square
    instagram_up = np.sqrt(row['UP_Instagram']/row['area_skm_SR50'])
    flickr_up = np.sqrt(row['UP_Flickr']/row['area_skm_SR25'])
    twitter_up = np.sqrt(row['UP_Twitter']/row['area_skm_SR50'])
    total_up = instagram_up + flickr_up + twitter_up
    # calc perc per origin percentage
    # and apply manual weights
    instagram_up_norm = instagram_up / (origin_perc_total.get("1") / 100) * 0.4
    flickr_up_norm = flickr_up  / (origin_perc_total.get("2") / 100) * 0.5
    twitter_up_norm = twitter_up / (origin_perc_total.get("3") / 100) * 0.1
    sum_norm = instagram_up_norm + flickr_up_norm + twitter_up_norm
    return sum_norm
    
gdf['UP_Norm_Origin'] = gdf.apply(lambda row: norm_ip_origin(row), axis=1)
# gdf['UP_Norm_Origin'] = np.sqrt(gdf['UP_Norm_Origin']/gdf['poly_area'])
stretch_norm_dfcolumn(gdf, 'UP_Norm_Origin', 1, 1000)

Compare difference of UP_Norm and UP_Norm_Origin

# gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UP_Norm_Origin', 'UD', 'UPL', 'UP_Instagram', 'UP_Flickr', 'UP_Twitter']].sort_values('UP', ascending=False).head(20)
gdf[[TARGET_SHAPE_ID_COL, 'UP', 'UP_Norm', 'UP_Norm_Origin', 'UP_Instagram', 
     'UP_Flickr', 'UP_Twitter', 'area_skm', 'area_skm_SR25', 'area_skm_SR50']].sort_values('UP', ascending=False).head(20)

rows with more equal distribution of posts across Flickr, Twitter and Instagram are increased in ranking/weights
rows with unequal distribution of posts towards only one data source (e.g. Instagram) are reduced in their importance

gdf['UP_Norm_Origin'].hist(bins=20, alpha=0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe91f1ed3c8>

# get top poly
top_norm = gdf.sort_values('UP_Norm_Origin', ascending=False).iloc[0]
top_norm_count = gdf.sort_values('UP', ascending=False).iloc[0]

# normalized
top_norm_poly = top_norm.geometry
top_norm_value = top_norm.UP_Norm_Origin

# raw
top_raw_poly_count = top_norm_count.geometry
top_raw_value_count = top_norm_count.UP

# construct geoviews layer
gv_sample_poly_top_norm = gv.Shape(
    top_norm_poly,
    label=f'meinGrün Top Target Polygon ({top_norm_value} Normalized Post Count based on Area and Service Origin)',
    crs=cartopy.crs.epsg(epsg_code))

# construct geoviews layer
gv_sample_poly_top_count = gv.Shape(
    top_raw_poly_count,
    label=f'meinGrün Top Target Polygon ({top_raw_value_count} Total Post Count)',
    crs=cartopy.crs.epsg(epsg_code))

# calculate boundary to show
xmin, ymin, xmax, ymax = add_buffer(top_norm_poly)
sample_bottomleft = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmin, ymin)
sample_topright = pyproj.transform(
    pyproj.Proj(init=f'EPSG:{epsg_code}'), pyproj.Proj(init='EPSG:3857'), xmax, ymax)

hv.Overlay(
    gv.tile_sources.EsriImagery * \
    gv_sample_poly_top_count.opts(line_color='yellow', line_width=3, fill_alpha=0.2, show_legend=True) * \
    gv_sample_poly_top_norm.opts(line_color='red', line_width=3, fill_alpha=0.2, show_legend=True)
    ).opts(
        width=800,
        height=480,
        xlim=(sample_bottomleft[0], sample_topright[0]),
        ylim=(sample_bottomleft[1], sample_topright[1]),
        title_format=f"Selected Target Geometries for Testpoint and {SEARCH_RADIUS}m/50m Search Radius",
    )

Visualize Results¶

Now we can easily pass the GeoDataFrame to a Polygon object and declare the Measure as the first value dimension which it will color by:

Plot Sample Areas¶

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import Patch

gdf_meingruen_official_targets = gdf.query(f'{GREEN_TARGET_COL} == 1')

# make a legend using a dictionary to assign colors
cmap = matplotlib.cm.get_cmap('Greens')
popularity_legend = {
    'very high': cmap(1000/1000), 
    'high': cmap(450/1000),
    'average': cmap(165/1000),
    'low': cmap(56/1000), 
    'very low': cmap(14/1000), 
    'no data': cmap(1)}

def add_patches(legend):
    ax = legend.axes
    handles, labels = ax.get_legend_handles_labels()
    labels.append("LBSM Popularity")
    handles.append(Patch(facecolor="none", fill=False, edgecolor='none', linewidth=0))
    for key in popularity_legend:
            data_key = Patch(edgecolor='none', facecolor=popularity_legend[key])
            handles.append(data_key)
            labels.append(key)
    labels.append("Official green target")
    handles.append(Patch(facecolor="none", fill=False, edgecolor='red', linewidth=0.7))

    legend._legend_box = None
    legend._init_legend_box(handles, labels)
    legend._set_loc(legend._loc)
    legend.set_title(legend.get_title().get_text())
    
def get_map(lim_lng_min, lim_lat_min, lim_lng_max, lim_lat_max, title, gdf, column, show_green_targets):
    """Plot area in matplotlib (pyplot).
    Note: Natural Breaks algorithm is used here to classify colors, remove to see raw values"""
    bb_bottomleft = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_min, lim_lat_min)
    bb_topright = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_max, lim_lat_max)
    fig, ax = plt.subplots()
    fig.set_size_inches(18.5, 10.5, forward=True)
    ax.set_title(f'LBSM Target Geometry Weights for {title}', fontsize=20)
    ax.set_xlim([bb_bottomleft[0],bb_topright[0]])
    ax.set_ylim([bb_bottomleft[1],bb_topright[1]])
    
    # plt.legend(handles=patch_list)
    plt.savefig('legend.png', bbox_inches='tight')
    base = gdf.plot(column=column, figsize=(20,10), cmap='Greens', ax=ax, scheme="natural_breaks", label='Normalized User Post Locations (UPL) based on Area', legend = True)
    if show_green_targets:
        lgd = ax.legend(loc='upper right')
        add_patches(lgd)
        gdf_meingruen_official_targets.plot(ax=base, facecolor="none", edgecolor='red', lw=0.7, label='Official green target', legend = True)
        # plot the data

def get_map_b(lim_lng_min, lim_lat_min, lim_lng_max, lim_lat_max, title, gdf, column):
    """Plot area in matplotlib (pyplot).
    Note: Natural Breaks algorithm is used here to classify colors, remove to see raw values"""
    bb_bottomleft = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_min, lim_lat_min)
    bb_topright = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init=f'EPSG:{epsg_code}'), lim_lng_max, lim_lat_max)
    axes = plt.gca()
    fig = plt.gcf()
    fig.set_size_inches(18.5, 10.5, forward=True)
    # fig.suptitle('LBSM Target Geometry Weights for Dresden Altstadt', fontsize=20)
    axes.set_title(f'LBSM Target Geometry Weights for {title}', fontsize=20)
    axes.set_xlim([bb_bottomleft[0],bb_topright[0]])
    axes.set_ylim([bb_bottomleft[1],bb_topright[1]])
    base = gdf.plot(column=column, figsize=(20,10), cmap='Greens', ax=axes, scheme="natural_breaks", label='Normalized User Post Locations (UPL) based on Area', legend = True)
    if show_green_targets:
        gdf_meingruen_official_targets.plot(ax=base, facecolor="none", edgecolor='red', lw=0.7, label='Official green target', legend = True)
    
    #gdf.to_crs({'init': 'epsg:3395'}).plot(column='UP_Norm', figsize=(20,10), cmap='Reds', ax=axes)

get_map(LIM_LNG_MIN, LIM_LAT_MIN, LIM_LNG_MAX, LIM_LAT_MAX, f'{CITY_NAME} {LEVEL} {TARGETSHAPE_VERSION}', gdf, 'UP_Norm_Origin', show_green_targets=False)

get_map(LIM_LNG_MIN_1, LIM_LAT_MIN_1, LIM_LNG_MAX_1, LIM_LAT_MAX_1, f"Dresden Altstadt {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=False)

get_map(LIM_LNG_MIN_2, LIM_LAT_MIN_2, LIM_LNG_MAX_2, LIM_LAT_MAX_2, f"Dresden Center {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=False)

get_map(LIM_LNG_MIN_3, LIM_LAT_MIN_3, LIM_LNG_MAX_3, LIM_LAT_MAX_3, f"Dresden Neustadt {LEVEL} {TARGETSHAPE_VERSION}", gdf, 'UP_Norm_Origin', show_green_targets=False)

To save Figures above as SVG, run the following cell after each figure generation

plt.savefig(f"{CITY_NAME.lower()}_{LEVEL}_{TARGETSHAPE_VERSION}_A1.svg", format="svg")

<Figure size 432x288 with 0 Axes>

Optional: Plot Green Areas¶

Natural Breaks Classification¶

Using pysal, it is possible to update our dataframe using Natural Breaks Classification

see for other available classifiers

import pysal.viz.mapclassify as mc
# Define the number of classes
n_classes = 6
# Create a Natural Breaks classifier
classifier = mc.NaturalBreaks.make(k=n_classes)
# classifier = mc.HeadTailBreaks.make()
# Classify the data
classifications = gdf[['UP_Norm']].apply(classifier)
gdf['UP_Norm_nb'] = classifications
classifications = gdf[['UP_Norm_Origin']].apply(classifier)
gdf['UP_Norm_Origin_nb'] = classifications

We'll also add bin labels for describing classes:

classes_ref_5 = {
    5 : 'very high', 
    4 : 'high',
    3 : 'average',
    2 : 'low', 
    1 : 'very low',
    0 : 'no data'}

gdf['popularity'] = gdf['UP_Norm_Origin_nb'].map(classes_ref_5)
# make sure that the bin "no data" is only assigned for records without UPL
gdf.loc[(gdf.UP_Norm_Origin > 1) & (gdf.UP_Norm_Origin_nb == 0), 'popularity'] = 'very low'
gdf.loc[(gdf.UP_Norm > 1) & (gdf.UP_Norm_nb == 0), 'UP_Norm_nb'] = 1
gdf.loc[(gdf.UP_Norm_Origin > 1) & (gdf.UP_Norm_Origin_nb == 0), 'UP_Norm_Origin_nb'] = 1

Have a look at the values and where we came from (left to right):

print(gdf[['UP', 'UP_Norm', 'UP_Norm_nb', 'UP_Norm_Origin', 'UP_Norm_Origin_nb', 'popularity']].head(20))
print(gdf[['UP', 'UP_Norm', 'UP_Norm_nb', 'UP_Norm_Origin', 'UP_Norm_Origin_nb', 'popularity']].sort_values('UP_Norm', ascending=False).head(20))

     UP    UP_Norm  UP_Norm_nb  UP_Norm_Origin  UP_Norm_Origin_nb popularity
0     0   1.000000           0        1.000000                  0    no data
1     0   1.000000           0        1.000000                  0    no data
2     0   1.000000           0        1.000000                  0    no data
3     0   1.000000           0        1.000000                  0    no data
4     0   1.000000           0        1.000000                  0    no data
5     0   1.000000           0        1.000000                  0    no data
6     0   1.000000           0        1.000000                  0    no data
7     0   1.000000           0        1.000000                  0    no data
8     0   1.000000           0        1.000000                  0    no data
9     0   1.000000           0        1.000000                  0    no data
10    0   1.000000           0        1.000000                  0    no data
11  314   6.663382           1       30.494954                  1   very low
12    0   1.000000           0        1.000000                  0    no data
13  611  34.448285           2      208.307908                  3    average
14    0   1.000000           0        1.000000                  0    no data
15    0   1.000000           0        1.000000                  0    no data
16    8   4.492373           1        2.895190                  1   very low
17    1   2.587174           1        1.657375                  1   very low
18    0   1.000000           0        1.000000                  0    no data
19    3   3.399718           1        8.652027                  1   very low
           UP      UP_Norm  UP_Norm_nb  UP_Norm_Origin  UP_Norm_Origin_nb  \
132445  38453  1000.000000           5      280.882430                  4   
285693  34606   932.013209           5      223.947961                  3   
160705  13197   740.291677           5      440.145088                  4   
33544   12599   738.057831           5      516.452076                  5   
49720   13110   722.994056           5      422.241568                  4   
179427  12683   712.914202           5      498.342430                  5   
77199   13147   690.194638           5      428.560687                  4   
287744  11050   687.528537           5      412.914615                  4   
167696  12464   682.206098           5      489.713721                  5   
138028  10947   681.225994           5      412.120998                  4   
243684  12432   662.464682           5      411.260696                  4   
68385   12419   650.699024           5      471.793570                  5   
17454   12585   646.904871           5      508.041114                  5   
266015  12854   624.653644           5      447.129402                  4   
275376  12852   624.533082           5      447.117643                  4   
122762  12776   620.041834           5      440.655294                  4   
119574  14248   609.883898           5      161.575280                  3   
244894  11751   599.507929           5      593.967505                  5   
139719  38538   553.253864           5      287.336545                  4   
283010  10520   533.475642           5      407.724199                  4   

       popularity  
132445       high  
285693    average  
160705       high  
33544   very high  
49720        high  
179427  very high  
77199        high  
287744       high  
167696  very high  
138028       high  
243684       high  
68385   very high  
17454   very high  
266015       high  
275376       high  
122762       high  
119574    average  
244894  very high  
139719       high  
283010       high

import seaborn as sns
sns.countplot(x='UP_Norm_Origin_nb', hue='UP_Norm_Origin_nb', data=gdf)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe90d044dd8>

Filter relevant target geometries

Create geoviews layer for final display. We'll filter segments by a threshold, as a means to only show streets with a minimum of popularity.

We can also filter based on the column TARGET_green_target

# on mixed type input (e.g. 'Polygon', 'GeometryCollection')
# apply filter below to avoid Geopandas Exception
gdf_temp = gdf.loc[gdf['geometry'].apply(lambda x: isinstance(x, (Polygon, MultiPolygon)))]

gdf_filtered = gdf_temp.query(f'{GREEN_TARGET_COL} == 1').copy()
print(f'Number of total segments: {len(gdf)}. Number of filtered segments: {len(gdf_filtered)}')

Number of total segments: 290015. Number of filtered segments: 51645

gv_shapes_weighted = gv.Polygons(
    gdf_filtered, vdims=['UP_Norm_Origin_nb'], crs=cartopy.crs.epsg(epsg_code))

Final Output:

if you want to output to a html file with full-width-map, remove comment from:
- %%output filename="dresden_streets" and
- responsive=True
if you want to preview inside this jupyter notebook, leave comments for the above two lines and remove comments from:
- width=800
- height=480

#%%output filename=f"{TODAY}_{CITY_NAME}_weighted_green_targets_{LEVEL}_{TARGETSHAPE_VERSION}"
from holoviews import dim
def set_active_tool(plot, element):
    # enable wheel_zoom by default
    plot.state.toolbar.active_scroll = plot.state.tools[0]

# the Overlay will automatically extent to data dimensions,
# since we want the initial zoom to be zoomed in to the Cite Center
# we need to supply xlim and ylim
# The Bokeh renderer only support Web Mercator Projection
# we therefore need to project Großer Garten Bounds first
# from WGS1984 Decimal Degrees coordinates to Web Mercator

bb_bottomleft = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init='EPSG:3857'), LIM_LNG_MIN, LIM_LAT_MIN)
bb_topright = pyproj.transform(pyproj.Proj(init='EPSG:4326'), pyproj.Proj(init='EPSG:3857'), LIM_LNG_MAX, LIM_LAT_MAX)

hv.Overlay(
    gv.tile_sources.EsriImagery.opts(alpha=0.65) * \
    gv_shapes_weighted.opts(line_width=0.1, cmap='Greens', color_index='UP_Norm_Origin_nb', colorbar=True, fill_alpha=0.9, color_levels=5)
    # gv_shapes_weighted.opts(line_width=1+np.sqrt(dim('UP_Sqr')), cmap='Reds', color_index='UP_Norm', colorbar=True)
    ).opts(
        title_format=f"Weighted meinGrün Target Shapes {TARGETSHAPE_VERSION} for {CITY_NAME} and {LEVEL} based on LBSM user-post-location proximity (based on {len(df)} posts from Flickr, Instagram, and Twitter)",
        #responsive=True,
        finalize_hooks=[set_active_tool],
        width=800,
        height=480,
        xlim=(bb_bottomleft[0],bb_topright[0]),
        ylim=(bb_bottomleft[1],bb_topright[1])
    )

Export Results (shapefile, CSV)¶

Have a look at the header first:

gdf.head()

Select columns to export

sel_cols = [TARGET_SHAPE_ID_COL, 'geometry', 'UP', 'UP_Flickr', 'UP_Twitter', 'UP_Instagram', 'UD', 'UPL', 'UP_Norm', 'UP_Norm_Origin', 'popularity']

Write to Shapefile

# if not calculated, drop:
if 'UP' not in gdf_temp.index:
    sel_cols.remove('UD')
if 'UP' not in gdf_temp.index:
    sel_cols.remove('UPL')
print(sel_cols)
gdf_filtered_columns = gdf_temp[sel_cols].copy()

# define column types manually
# gdf_filtered_columns = gdf[sel_cols].copy()
# note that for clarity we rename UP to UPL (UserPostLocation) here
# gdf_filtered_columns.rename(columns = {'UP':'UPL'}, inplace = True)
filename = f"{TODAY}_meingruenshapes_{TARGETSHAPE_VERSION}_weighted_{CITY_NAME.lower()}_{LEVEL}"
schema = {'geometry': 'Polygon',
          'properties': {
              TARGET_SHAPE_ID_COL: 'str',
              'UP': 'int',
              'UP_Flickr': 'int',
              'UP_Twitter': 'int',
              'UP_Instagram': 'int',
              # 'UD': 'int',
              # 'UPL': 'int',
              'UP_Norm': 'float',
              'UP_Norm_Origin': 'float',
              'popularity': 'str'}
         }
gdf_filtered_columns.to_crs({'init': 'epsg:4326'}).to_file(driver = 'ESRI Shapefile', schema=schema, filename= f"{filename}.shp")

['LULC_ID', 'geometry', 'UP', 'UP_Flickr', 'UP_Twitter', 'UP_Instagram', 'UP_Norm', 'UP_Norm_Origin', 'popularity']

Write to CSV

if 'geometry' in sel_cols:
    sel_cols.remove('geometry')
sel_headers = [TARGET_SHAPE_ID_COL,'UP', 'UP_Flickr', 'UP_Twitter', 'UP_Instagram', 'UD', 'UPL', 'UP_Norm', 'UP_Norm_Origin', 'popularity']

if 'UP' not in gdf_temp.index:
    sel_headers.remove('UD')
if 'UP' not in gdf_temp.index:
    sel_headers.remove('UPL')
gdf.to_csv(f'{filename}.csv',
           columns=sel_cols, header=sel_headers,
           index=False)

	OBJECTID	PARENT_TARGET_ID	PARENT_TARGET_green_target	PARENT_TARGET_gemeinde_DD	PARENT_WAY_ID	LULC_ID	LULC_origin	LULC_name	LULC_type	...	classNoData_ratio	class0_area	class1_area	class2_area	class3_area	class4_area	classNoData_area	SHAPE_Length	SHAPE_Area	geometry
0	1	dd_OSM_57149	False	False	dd_OSM_57149_5	dd_OSM_57149_5_22	osm	None	grass	...	0.721358	162.244025	166.413835	NaN	NaN	23.679440	910.435348	0.003540	1.617494e-07	(POLYGON ((1500542.089541753 6590588.524965676...
1	283517	dd_EBK-TSP-OSM_27937	False	True	dd_EBK-TSP-OSM_27937_2	dd_EBK-TSP-OSM_27937_2_3	osm_building	None	building	...	NaN	77.572044	NaN	NaN	NaN	10.155274	NaN	0.000570	1.124700e-08	(POLYGON ((1513460.81741509 6597924.052811448,...
2	116638	dd_OSM_54537	False	False	dd_OSM_54537_2	dd_OSM_54537_2_1	osm	None	building	...	NaN	0.174807	28.345594	NaN	NaN	1.379752	NaN	0.000265	3.828652e-09	(POLYGON ((1522592.089413918 6587890.020932667...
3	73544	dd_EBK-TSP-OSM_30836	False	True	dd_EBK-TSP-OSM_30836_1	dd_EBK-TSP-OSM_30836_1_19	osm_building	None	building	...	NaN	81.507395	NaN	NaN	NaN	10.156085	NaN	0.000464	1.175166e-08	(POLYGON ((1543222.089401244 6597951.100413561...
4	192364	dd_EBK-TSP-OSM_31363	False	True	dd_EBK-TSP-OSM_31363_18	dd_EBK-TSP-OSM_31363_18_1	osm_landuse	Kleingartenverein Spitzwegstraße 1 e.V.	allotments	...	NaN	430.625513	NaN	NaN	NaN	49.888683	NaN	0.001968	6.154865e-08	(POLYGON ((1531773.928959356 6590565.064079593...

	origin_id	latitude	longitude	user_guid	post_date	geometry
0	3	51.040557	13.731987	2de2c7b4c046fb9c2a7fcfeaa479356d	2016-06-10 10:46:41	POINT (411103.140159386 5655099.9591291)
1	1	51.091362	13.697633	145301fdbcf5d5261ff46d63e0962586	2015-10-09 16:11:39	POINT (408794.6639901432 5660791.445413328)
2	1	51.042050	13.797690	df8244ec7295c142668fbcb530b6c2cc	2018-02-06 00:41:07	POINT (415711.9591703562 5655188.755580138)
3	1	51.102810	13.686440	98d6841fdbfef53a01b6247d70bae2bc	2018-03-11 18:48:22	POINT (408033.5643395127 5662078.377559571)
4	1	51.044881	13.735635	5832633cf2de965477d171bdca251eb5	2016-05-16 09:56:44	POINT (411367.1581114109 5655576.364487658)

	origin_id	latitude	longitude	user_guid	post_date	geometry
0	3	51.040557	13.731987	2de2c7b4c046fb9c2a7fcfeaa479356d	2016-06-10 10:46:41	POINT (411103.140159386 5655099.9591291)
1	1	51.091362	13.697633	145301fdbcf5d5261ff46d63e0962586	2015-10-09 16:11:39	POINT (408794.6639901432 5660791.445413328)
2	1	51.042050	13.797690	df8244ec7295c142668fbcb530b6c2cc	2018-02-06 00:41:07	POINT (415711.9591703562 5655188.755580138)
3	1	51.102810	13.686440	98d6841fdbfef53a01b6247d70bae2bc	2018-03-11 18:48:22	POINT (408033.5643395127 5662078.377559571)
4	1	51.044881	13.735635	5832633cf2de965477d171bdca251eb5	2016-05-16 09:56:44	POINT (411367.1581114109 5655576.364487658)

	origin_id	latitude	longitude	user_guid	post_create_date	post_publish_date	geometry
0	3	51.040557	13.731987	2de2c7b4c046fb9c2a7fcfeaa479356d	NaN	2016-06-10 10:46:41	POINT (411103.140159386 5655099.9591291)
1	1	51.091362	13.697633	145301fdbcf5d5261ff46d63e0962586	NaN	2015-10-09 16:11:39	POINT (408794.6639901432 5660791.445413328)
2	1	51.042050	13.797690	df8244ec7295c142668fbcb530b6c2cc	NaN	2018-02-06 00:41:07	POINT (415711.9591703562 5655188.755580138)
3	1	51.102810	13.686440	98d6841fdbfef53a01b6247d70bae2bc	NaN	2018-03-11 18:48:22	POINT (408033.5643395127 5662078.377559571)
4	1	51.044881	13.735635	5832633cf2de965477d171bdca251eb5	NaN	2016-05-16 09:56:44	POINT (411367.1581114109 5655576.364487658)

	OBJECTID	PARENT_TARGET_green_target	LULC_ID	geometry
264362	207751	False	dd_EBK-TSP-OSM_27773_1_4	(POLYGON ((410493.7919960411 5655589.181288132...
175021	222906	False	dd_EBK-TSP-OSM_20278_1_4	(POLYGON ((410417.471424504 5655661.089929336,...
260431	200887	False	dd_EBK-TSP-OSM_27773_2_2	(POLYGON ((410507.1696264762 5655643.86359268,...
227910	257301	False	dd_EBK-TSP-OSM_27773_1_7	(POLYGON ((410493.7919960411 5655589.181288132...
81690	60140	False	dd_EBK-TSP-OSM_27773_4_1	(POLYGON ((410424.0985141604 5655687.181977004...

	OBJECTID	PARENT_TARGET_green_target	LULC_ID	geometry
0	1	False	dd_OSM_57149_5_22	(POLYGON ((393349.4458630043 5652514.374407471...
1	283517	False	dd_EBK-TSP-OSM_27937_2_3	(POLYGON ((401577.7054682793 5656972.97522185,...
2	116638	False	dd_OSM_54537_2_1	(POLYGON ((407212.8660183108 5650545.754251033...
3	73544	False	dd_EBK-TSP-OSM_30836_1_19	(POLYGON ((420314.3206534541 5656666.771566805...
4	192364	False	dd_EBK-TSP-OSM_31363_18_1	(POLYGON ((413028.7688842474 5652131.050560708...

	LULC_ID	area_skm	area_skm_SR25	area_skm_SR50
0	dd_OSM_57149_5_22	0.001262	0.010003	0.022494
1	dd_EBK-TSP-OSM_27937_2_3	0.000088	0.003105	0.010027
2	dd_OSM_54537_2_1	0.000030	0.002544	0.008978
3	dd_EBK-TSP-OSM_30836_1_19	0.000092	0.003036	0.009900
4	dd_EBK-TSP-OSM_31363_18_1	0.000480	0.007524	0.018478

	LULC_ID	UP	UP_Norm	UP_Instagram	UP_Flickr	UP_Twitter
132445	dd_EBK-TSP-OSM_26382_9_2	38453	1000.000000	38147	306	0
285693	dd_EBK-TSP-OSM_22142_1_4	34606	932.013209	34476	130	0
160705	dd_EBK-TSP-OSM_34121_20_5	13197	740.291677	12779	413	5
33544	dd_EBK-TSP-OSM_34121_32_3	12599	738.057831	12318	269	12
49720	dd_EBK-TSP-OSM_34121_20_3	13110	722.994056	12779	326	5

	LULC_ID	UP	UP_Norm	UP_Norm_Origin	UP_Instagram	UP_Flickr	UP_Twitter	area_skm	area_skm_SR25	area_skm_SR50
248618	dd_EBK-TSP-OSM_26382_2_1	57004	53.461750	369.938689	55207	1788	9	0.002721	0.013830	0.028080
182090	dd_EBK-TSP-OSM_26382_9_1	55367	84.977281	480.061435	52127	3235	5	0.001031	0.008031	0.018173
235404	dd_EBK-TSP-OSM_26382_9_4	53160	125.470855	442.073712	51912	1243	5	0.000451	0.005870	0.014814
112	dd_EBK-TSP-OSM_26382_2_13	52506	115.769344	411.428651	51646	855	5	0.000524	0.005867	0.014979
72017	dd_EBK-TSP-OSM_26382_2_7	52147	202.599161	427.210134	51646	496	5	0.000169	0.004202	0.012148
105777	dd_EBK-TSP-OSM_26382_7_4	52075	82.163515	498.159155	48868	3200	7	0.001039	0.008076	0.018265
119927	dd_EBK-TSP-OSM_30690_1_2	51810	150.307755	367.894779	51095	714	1	0.000305	0.004075	0.011761
156847	dd_EBK-TSP-OSM_26382_1_17	49804	113.144364	438.841253	48865	932	7	0.000520	0.005953	0.015088
2329	dd_EBK-TSP-OSM_26382_8_2	45420	77.333068	341.792179	41986	3434	0	0.001024	0.008096	0.018273
33693	dd_EBK-TSP-OSM_32865_1_1	45287	41.873304	818.249290	39994	5210	83	0.003561	0.011325	0.023009
135025	dd_EBK-TSP-OSM_20688_1_1	41149	53.517037	775.020907	38049	3022	78	0.001960	0.009784	0.021526
166968	dd_EBK-TSP-OSM_26382_2_12	40919	69.227221	365.229054	39538	1377	4	0.001155	0.008230	0.018734
244689	dd_EBK-TSP-OSM_23723_1_12	40605	105.151400	803.560374	38400	2138	67	0.000492	0.007030	0.017120
238384	dd_EBK-TSP-OSM_26382_3_3	40524	69.195812	306.329142	39171	1352	1	0.001145	0.008151	0.018600
102554	dd_EBK-TSP-OSM_23723_1_14	40450	40.883099	722.758891	37875	2497	78	0.003341	0.011319	0.023157
205135	dd_EBK-TSP-OSM_26382_8_1	40117	105.993831	290.257789	38674	1443	0	0.000478	0.005924	0.014877
264364	dd_EBK-TSP-OSM_26382_2_5	39607	205.066103	299.205147	38929	678	0	0.000125	0.003245	0.010274
139719	dd_EBK-TSP-OSM_26382_9_3	38538	553.253864	287.336545	38147	391	0	0.000017	0.002444	0.008791
132445	dd_EBK-TSP-OSM_26382_9_2	38453	1000.000000	280.882430	38147	306	0	0.000005	0.002209	0.008333
15067	dd_EBK-TSP-OSM_20261_1_1	38239	42.428103	447.089538	36116	2106	17	0.002927	0.011816	0.024618

	LULC_ID	User_Days	User_Post_Locations
0	dd_OSM_57149_5_22	{}	{}
1	dd_EBK-TSP-OSM_27937_2_3	{}	{}
2	dd_OSM_54537_2_1	{}	{}
3	dd_EBK-TSP-OSM_30836_1_19	{}	{}
4	dd_EBK-TSP-OSM_31363_18_1	{}	{}