Privacy-aware topical explorations (emoji/hashtag) using Sunrise-dataset (Instagram)¶

This notebook demonstrates:

visualizations of topical data across spatial dimension
how to perform intersections of hll sets
how to union hll sets

The data is providedas CSV files in the privacy-aware hll structure format.

The hll structure is more or less a convention for systematically converting the raw lbsn structure to a privacy-aware format that is based on HLL.

The convention introduces two basic components:

the overlay and
the base

The overlay defines what is measured, which can be, for example, number of users (user_hll), number of posts (post_hll), or number of userdays (date_hll). Overlays are stored as HyperLogLog (HLL) sets, which contain statistically approximated versions of original data. It is possible to get an approximation of counts by calculating the cardinality of a HLL Set.

The data in this notebook consists of:

bases

emoji
hashtag
hashtag_latlng
emoji_latlng

overlay

postcount
userdays
usercount

Preparations¶

Load dependencies¶

import os
import geoviews as gv
import holoviews as hv
import pandas as pd
import numpy as np
import warnings
from typing import List
with warnings.catch_warnings(): 
    # prevent missing glyph warnings
    warnings.simplefilter("ignore")
    import matplotlib.pyplot as plt
from pathlib import Path
from collections import namedtuple
from geoviews import opts
from holoviews import dim
from cartopy import crs as ccrs
from IPython.display import display, Markdown
# init bokeh
from modules import preparations
preparations.init_imports()
pd.set_option('display.max_colwidth', 50)

Load data from CSV¶

There're 4 CSV files in this directory:

display(pd.DataFrame(Path('.').glob('*.csv')))

Size of a sample csv:

hll_size_mb = Path("sunrise-hashtag-ranking.csv").stat().st_size / (1024*1024)
print(f"Size: {hll_size_mb:.2f} MB")

Size: 51.83 MB

Preview CSV contents¶

Bases (the context for metrics) are highlighted in green

</br>

The cardinality (the count of distinct elemements) of the hll sets

</br>

Overlays (hll metrics) are highlighted in yellow

from modules.tools import display_header_stats, print_file_list
display(Markdown("**Data Files:**"))
files = list(Path('.').glob('*.csv'))
print_file_list(files)
display(Markdown("**Content preview:**"))
display_header_stats(files)

.
├── sunrise-hashtag-ranking.csv
├── instagram-sunrise-dataset_emoji_latlng.csv
├── instagram-sunrise-dataset_hashtag_latlng.csv
└── instagram-sunrise-dataset-totalstats.csv

HLL unions¶

Connect to hll worker¶

To be able to perform hll calculations, we connect to a postgres db with citus hll extension installed:

Define connection parameters:

db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"

connect:

import psycopg2
db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)

Store in a method for easier access:

from modules import tools
db_conn = tools.DbConn(db_connection)

Test connection:

db_query = """
    SELECT 1;
"""
db_conn.query(db_query)

Example union¶

Lets have a look at the first two records in sunrise-hashtag-ranking.csv. We'll only use columns hashtag, postcount and post_hll.

df = pd.read_csv(
    "sunrise-hashtag-ranking.csv", nrows=2, usecols=["hashtag", "postcount", "post_hll"])
display(df)

postcount is the cardinality (the number of distinct elements) of the hll set (post_hll).

This column precalculated while creating the CSV. We can calculate the same number from the hll sets from our CSV:

db_query = f"""
SELECT hll_cardinality(post_hll)::int as postcount 
FROM (VALUES
          ('{df["post_hll"][0]}'::hll),
          ('{df["post_hll"][1]}'::hll)
        ) s(post_hll)
"""
db_conn.query(db_query)

Union

Above, the function hll_cardinality(hll) is used to calculate the count of distinct elements. The simple sum of these posts is 4855660+1450241=6305901

Another function is to union hll sets: hll_union_agg()

Below, we union the two sets for posts that used either nature or sunrise.

db_query = f"""
SELECT hll_cardinality(hll_union_agg(s.post_hll))::int
FROM (VALUES
          ('{df["post_hll"][0]}'::hll),
          ('{df["post_hll"][1]}'::hll)
        ) s(post_hll)
"""
df = db_conn.query(db_query)
df.head()

Every post will be counted only once. Since a post can have both the hashtag "nature" and the hashtag "sunrise", the sum is smaller than simply adding numbers. Sunrise and nature were used together in about 1,415,249 posts (22%).

In other words: in 22% of posts on Instagram where "sunrise" is used, "nature" is also used. This can be seen as a significant contextual correlation between these words.

Intersection

This is already a demonstration of intersection, as is emphasized in the citus hll implementation:

Using the inclusion-exclusion principle and the union function, you can also estimate the intersection of sets represented by hlls.

The following graphic illustrates this ability:

We can make this calculation process a little bit more conventient.

Lets look at another CSV file, instagram-sunrise-dataset_hashtag_latlng.csv, which contains the bases hashtag and lat/lng.

This time, we look at usercount, not postcount.

df = pd.read_csv(
    "instagram-sunrise-dataset_hashtag_latlng.csv", nrows=2, usecols=["hashtag", "latitude", "longitude", "usercount", "user_hll"])
display(df)

Plot points on a map:

points_lonlat = gv.Points(
    df,
    kdims=['longitude', 'latitude'], vdims=['usercount', 'hashtag']) 
tools.get_point_map(
    points_lonlat,
    dim_col='usercount',
    title='Estimate User Count per coordinate and hashtag')

The two locations are the highest ranked locations for sunrise in our dataset, based on total usercount.

Intersection question: How many users visited both these locations and referenced the "sunrise"?

db_query = f"""
SELECT hll_cardinality(hll_union_agg(s.post_hll))::int
FROM (VALUES
          ('{df["user_hll"][0]}'::hll),
          ('{df["user_hll"][1]}'::hll)
        ) s(post_hll)
"""
df_1 = db_conn.query(db_query)
df_1.head()

db_query = f"""
SELECT 
    hll_cardinality('{df["user_hll"][0]}')::int + 
            hll_cardinality('{df["user_hll"][1]}')::int - 
            hll_cardinality(hll_union('{df["user_hll"][0]}', '{df["user_hll"][1]}'))::int as IntersectionCount,
    hll_cardinality(hll_union('{df["user_hll"][0]}', '{df["user_hll"][1]}'))::int as unionCount,
    hll_cardinality('{df["user_hll"][0]}')::int as LA_count,
    hll_cardinality('{df["user_hll"][1]}')::int as NY_count
"""
df_2 = db_conn.query(db_query)
df_2["intersection_percentage"] = df_2.intersectioncount / (df_2.unioncount / 100)
display(df_2.head())
display(Markdown((
    f'<div class="alert alert-success"> <b> {df_2["intersection_percentage"][0]:,.2f} % '
    f'of users visited both NY and LA and shared picture with the hashtag sunrise'
    f'</b></div>')))

Since such a low correlation was to be expected, lets have a look at another example intersection. We'll filter the top 100 hashtags, and select only those that appeared at location -8.646196,115.180664 (Bali).

df = pd.read_csv(
    "instagram-sunrise-dataset_hashtag_latlng.csv",
    nrows=20, usecols=["hashtag", "latitude", "longitude", "usercount", "user_hll"])
tools.filter_df_bbox(df, bbox=(114.884171,-8.801279,115.405678,-8.552843))
display(df)

points_lonlat = gv.Points(
    df,
    kdims=['longitude', 'latitude'], vdims=['usercount', 'hashtag']) 
tools.get_point_map(
    points_lonlat,
    dim_col='usercount',
    title='Estimate User Count per coordinate and hashtag',
    global_extent=True)

Two different hashtags used at the same location by a number of users. How many users used both of these hastags?

def compile_intersection_union_sql(hll_set_a, hll_set_b, record_a_name: str, record_b_name: str):
    db_query = f"""
    SELECT 
        hll_cardinality('{hll_set_a}')::int + 
                hll_cardinality('{hll_set_b}')::int - 
                hll_cardinality(hll_union('{hll_set_a}', '{hll_set_b}'))::int as IntersectionCount,
        hll_cardinality(hll_union('{hll_set_a}', '{hll_set_b}'))::int as unionCount,
        hll_cardinality('{hll_set_a}')::int as {record_a_name}_count,
        hll_cardinality('{hll_set_b}')::int as {record_b_name}_count
    """
    return db_query

db_query = compile_intersection_union_sql(
    hll_set_a=df["user_hll"][0],
    hll_set_b=df["user_hll"][1],
    record_a_name=df["hashtag"][0],
    record_b_name=df["hashtag"][1])
df_2 = db_conn.query(db_query)
df_2["intersection_percentage"] = df_2.intersectioncount / (df_2.unioncount / 100)
display(df_2.head())
display(Markdown((
    f'<div class="alert alert-success"> <b>{df_2["intersection_percentage"][0]:,.2f} % '
    f'of users used both hashtags, "sunrise" and "bali", at location '
    f'{df["latitude"][0]} {df["longitude"][0]}'
    f'</b></div>')))

	hashtag	postcount	userdays	usercount	post_hll	date_hll	user_hll
0	sunrise	4855660	804368	529073	\x148b7f7358c6b1909b56d6a	\x148b7f3ada95a54b535294a	\x148b7f529696a18849cc74a
1	nature	1450241	213420	124831	\x148b7f4b58c6b12a9a56952	\x148b7f3a5082a4e65352949	\x148b7f51d2969908394a741
2	sun	1238720	188092	119973	\x148b7f4ad4c6b14a6256952	\x148b7f3ad092a4e7534c93a	\x148b7f39d69398c839cc639
3	sky	1302061	192389	115208	\x148b7f4b58c6ad506256962	\x148b7f3a4a7324e6434c939	\x148b7f51d69698c8390c639
4	sunset	1282107	198308	106364	\x148b7f5b58c5b18a6256952	\x148b7f3a4e8324e75352939	\x148b7f399093990539cc739

	emoji	latitude	longitude	postcount	userdays	usercount	post_hll	date_hll	user_hll
0	🌅	40.714200	-74.006400	239	239	199	\x138b7f00a101c202c203c20	\x138b7f02610286030306e10	\x138b7f016102c2034204610
1	❤️	40.714200	-74.006400	172	161	142	\x138b7f006400a100c101a10	\x138b7f002300c2014202610	\x128b7f807ad7620d72ec8e8
2	🗽	40.714200	-74.006400	156	155	142	\x128b7f83d14ae4e22925de8	\x128b7f80338dae8f1576158	\x128b7f80ff8bbbe3a268ad8
3	☀️	40.714200	-74.006400	179	177	140	\x138b7f004100c1012304010	\x138b7f00610101038105a20	\x128b7f805a966bb06d58168
4	🌅	-33.888910	151.260640	183	177	130	\x138b7f00c1014301e102210	\x138b7f00e10241062106e40	\x128b7f807cfb8ebb71b0df8

	hashtag	latitude	longitude	postcount	userdays	usercount	post_hll	date_hll	user_hll
0	sunrise	40.714200	-74.006400	5464	4724	3319	\x148b7f1086410c4430cc718	\x148b7f0900118ca22886208	\x148b7f11041084a21044100
1	sunrise	34.052200	-118.243000	3499	3168	2273	\x148b7f104c61084018c0201	\x148b7f188400044008c0118	\x148b7f084a3000a22040008
2	sunrise	-8.646196	115.180664	2880	2525	1917	\x148b7f000a11940018c2008	\x148b7f00822104011004000	\x148b7f00823008401802000
3	nyc	40.714200	-74.006400	2648	2421	1742	\x148b7f088641004400c0618	\x148b7f09001188410846008	\x148b7f08421084a21040000
4	sunrise	41.383300	2.183330	2237	2036	1650	\x148b7f1000200c000982008	\x148b7f10c00010450804100	\x148b7f08440000220084108

	hashtag	latitude	longitude	usercount	user_hll
0	sunrise	40.7142	-74.0064	3319	\x148b7f11041084a2104410002218400280601804008c...
1	sunrise	34.0522	-118.2430	2273	\x148b7f084a3000a2204000880128c200040800401008...

	0
0	sunrise-hashtag-ranking.csv
1	instagram-sunrise-dataset_emoji_latlng.csv
2	instagram-sunrise-dataset_hashtag_latlng.csv
3	instagram-sunrise-dataset-totalstats.csv

	hashtag	postcount	post_hll
0	sunrise	4855660	\x148b7f7358c6b1909b56d6a9eb5ad8c5ad6f4b5cb731...
1	nature	1450241	\x148b7f4b58c6b12a9a5695212b4a98942d6f4a96a529...