Privacy-aware topical explorations (emoji/hashtag) using Sunrise-dataset (Instagram)

This notebook demonstrates:

  • visualizations of topical data across spatial dimension
  • how to perform intersections of hll sets
  • how to union hll sets

The data is providedas CSV files in the privacy-aware hll structure format.

The hll structure is more or less a convention for systematically converting the raw lbsn structure to a privacy-aware format that is based on HLL.

The convention introduces two basic components:

  1. the overlay and
  2. the base

The overlay defines what is measured, which can be, for example, number of users (user_hll), number of posts (post_hll), or number of userdays (date_hll). Overlays are stored as HyperLogLog (HLL) sets, which contain statistically approximated versions of original data. It is possible to get an approximation of counts by calculating the cardinality of a HLL Set.

The data in this notebook consists of:

bases

  • emoji
  • hashtag
  • hashtag_latlng
  • emoji_latlng

overlay

  • postcount
  • userdays
  • usercount

Preparations

Load dependencies

In [1]:
import os
import geoviews as gv
import holoviews as hv
import pandas as pd
import numpy as np
import warnings
from typing import List
with warnings.catch_warnings(): 
    # prevent missing glyph warnings
    warnings.simplefilter("ignore")
    import matplotlib.pyplot as plt
from pathlib import Path
from collections import namedtuple
from geoviews import opts
from holoviews import dim
from cartopy import crs as ccrs
from IPython.display import display, Markdown
# init bokeh
from modules import preparations
preparations.init_imports()
pd.set_option('display.max_colwidth', 50)

Load data from CSV

There're 4 CSV files in this directory:

In [4]:
display(pd.DataFrame(Path('.').glob('*.csv')))
0
0 sunrise-hashtag-ranking.csv
1 instagram-sunrise-dataset_emoji_latlng.csv
2 instagram-sunrise-dataset_hashtag_latlng.csv
3 instagram-sunrise-dataset-totalstats.csv

Size of a sample csv:

In [5]:
hll_size_mb = Path("sunrise-hashtag-ranking.csv").stat().st_size / (1024*1024)
print(f"Size: {hll_size_mb:.2f} MB")
Size: 51.83 MB

Preview CSV contents

Bases (the context for metrics) are highlighted in green

</br>

The cardinality (the count of distinct elemements) of the hll sets

</br>

Overlays (hll metrics) are highlighted in yellow
In [6]:
from modules.tools import display_header_stats, print_file_list
display(Markdown("**Data Files:**"))
files = list(Path('.').glob('*.csv'))
print_file_list(files)
display(Markdown("**Content preview:**"))
display_header_stats(files)

Data Files:

.
├── sunrise-hashtag-ranking.csv
├── instagram-sunrise-dataset_emoji_latlng.csv
├── instagram-sunrise-dataset_hashtag_latlng.csv
└── instagram-sunrise-dataset-totalstats.csv

Content preview:


sunrise-hashtag-ranking.csv
(10,000 total records)

hashtag postcount userdays usercount post_hll date_hll user_hll
0 sunrise 4855660 804368 529073 \x148b7f7358c6b1909b56d6a \x148b7f3ada95a54b535294a \x148b7f529696a18849cc74a
1 nature 1450241 213420 124831 \x148b7f4b58c6b12a9a56952 \x148b7f3a5082a4e65352949 \x148b7f51d2969908394a741
2 sun 1238720 188092 119973 \x148b7f4ad4c6b14a6256952 \x148b7f3ad092a4e7534c93a \x148b7f39d69398c839cc639
3 sky 1302061 192389 115208 \x148b7f4b58c6ad506256962 \x148b7f3a4a7324e6434c939 \x148b7f51d69698c8390c639
4 sunset 1282107 198308 106364 \x148b7f5b58c5b18a6256952 \x148b7f3a4e8324e75352939 \x148b7f399093990539cc739


instagram-sunrise-dataset_emoji_latlng.csv
(10,000 total records)

emoji latitude longitude postcount userdays usercount post_hll date_hll user_hll
0 🌅 40.714200 -74.006400 239 239 199 \x138b7f00a101c202c203c20 \x138b7f02610286030306e10 \x138b7f016102c2034204610
1 ❤️ 40.714200 -74.006400 172 161 142 \x138b7f006400a100c101a10 \x138b7f002300c2014202610 \x128b7f807ad7620d72ec8e8
2 🗽 40.714200 -74.006400 156 155 142 \x128b7f83d14ae4e22925de8 \x128b7f80338dae8f1576158 \x128b7f80ff8bbbe3a268ad8
3 ☀️ 40.714200 -74.006400 179 177 140 \x138b7f004100c1012304010 \x138b7f00610101038105a20 \x128b7f805a966bb06d58168
4 🌅 -33.888910 151.260640 183 177 130 \x138b7f00c1014301e102210 \x138b7f00e10241062106e40 \x128b7f807cfb8ebb71b0df8


instagram-sunrise-dataset_hashtag_latlng.csv
(10,000 total records)

hashtag latitude longitude postcount userdays usercount post_hll date_hll user_hll
0 sunrise 40.714200 -74.006400 5464 4724 3319 \x148b7f1086410c4430cc718 \x148b7f0900118ca22886208 \x148b7f11041084a21044100
1 sunrise 34.052200 -118.243000 3499 3168 2273 \x148b7f104c61084018c0201 \x148b7f188400044008c0118 \x148b7f084a3000a22040008
2 sunrise -8.646196 115.180664 2880 2525 1917 \x148b7f000a11940018c2008 \x148b7f00822104011004000 \x148b7f00823008401802000
3 nyc 40.714200 -74.006400 2648 2421 1742 \x148b7f088641004400c0618 \x148b7f09001188410846008 \x148b7f08421084a21040000
4 sunrise 41.383300 2.183330 2237 2036 1650 \x148b7f1000200c000982008 \x148b7f10c00010450804100 \x148b7f08440000220084108


instagram-sunrise-dataset-totalstats.csv
(1 total records)

id postcount userdays usercount post_hll date_hll user_hll
0 1 5032545 832129 553813 \x148b7f735cc6b1909b56d6a \x148b7f3ada95a54b535294a \x148b7f529696a18849cc74a

HLL unions

Connect to hll worker

To be able to perform hll calculations, we connect to a postgres db with citus hll extension installed:

Define connection parameters:

In [7]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"

connect:

In [8]:
import psycopg2
db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)

Store in a method for easier access:

In [9]:
from modules import tools
db_conn = tools.DbConn(db_connection)

Test connection:

In [10]:
db_query = """
    SELECT 1;
"""
db_conn.query(db_query)
Out[10]:
?column?
0 1

Example union

Lets have a look at the first two records in sunrise-hashtag-ranking.csv. We'll only use columns hashtag, postcount and post_hll.

In [11]:
df = pd.read_csv(
    "sunrise-hashtag-ranking.csv", nrows=2, usecols=["hashtag", "postcount", "post_hll"])
display(df)
hashtag postcount post_hll
0 sunrise 4855660 \x148b7f7358c6b1909b56d6a9eb5ad8c5ad6f4b5cb731...
1 nature 1450241 \x148b7f4b58c6b12a9a5695212b4a98942d6f4a96a529...

postcount is the cardinality (the number of distinct elements) of the hll set (post_hll).

This column precalculated while creating the CSV. We can calculate the same number from the hll sets from our CSV:

In [12]:
db_query = f"""
SELECT hll_cardinality(post_hll)::int as postcount 
FROM (VALUES
          ('{df["post_hll"][0]}'::hll),
          ('{df["post_hll"][1]}'::hll)
        ) s(post_hll)
"""
db_conn.query(db_query)
Out[12]:
postcount
0 4855660
1 1450241

Union

Above, the function hll_cardinality(hll) is used to calculate the count of distinct elements. The simple sum of these posts is 4855660+1450241=6305901

Another function is to union hll sets: hll_union_agg()

Below, we union the two sets for posts that used either nature or sunrise.

In [13]:
db_query = f"""
SELECT hll_cardinality(hll_union_agg(s.post_hll))::int
FROM (VALUES
          ('{df["post_hll"][0]}'::hll),
          ('{df["post_hll"][1]}'::hll)
        ) s(post_hll)
"""
df = db_conn.query(db_query)
df.head()
Out[13]:
hll_cardinality
0 4890652

Every post will be counted only once. Since a post can have both the hashtag "nature" and the hashtag "sunrise", the sum is smaller than simply adding numbers. Sunrise and nature were used together in about 1,415,249 posts (22%).

In other words: in 22% of posts on Instagram where "sunrise" is used, "nature" is also used. This can be seen as a significant contextual correlation between these words.

Intersection

This is already a demonstration of intersection, as is emphasized in the citus hll implementation:

Using the inclusion-exclusion principle and the union function, you can also estimate the intersection of sets represented by hlls.

The following graphic illustrates this ability:

We can make this calculation process a little bit more conventient.

Lets look at another CSV file, instagram-sunrise-dataset_hashtag_latlng.csv, which contains the bases hashtag and lat/lng.

This time, we look at usercount, not postcount.

In [38]:
df = pd.read_csv(
    "instagram-sunrise-dataset_hashtag_latlng.csv", nrows=2, usecols=["hashtag", "latitude", "longitude", "usercount", "user_hll"])
display(df)
hashtag latitude longitude usercount user_hll
0 sunrise 40.7142 -74.0064 3319 \x148b7f11041084a2104410002218400280601804008c...
1 sunrise 34.0522 -118.2430 2273 \x148b7f084a3000a2204000880128c200040800401008...

Plot points on a map:

In [39]:
points_lonlat = gv.Points(
    df,
    kdims=['longitude', 'latitude'], vdims=['usercount', 'hashtag']) 
tools.get_point_map(
    points_lonlat,
    dim_col='usercount',
    title='Estimate User Count per coordinate and hashtag')
Out[39]:

The two locations are the highest ranked locations for sunrise in our dataset, based on total usercount.

Intersection question: How many users visited both these locations and referenced the "sunrise"?

In [40]:
db_query = f"""
SELECT hll_cardinality(hll_union_agg(s.post_hll))::int
FROM (VALUES
          ('{df["user_hll"][0]}'::hll),
          ('{df["user_hll"][1]}'::hll)
        ) s(post_hll)
"""
df_1 = db_conn.query(db_query)
df_1.head()
Out[40]:
hll_cardinality
0 5512
In [41]:
db_query = f"""
SELECT 
    hll_cardinality('{df["user_hll"][0]}')::int + 
            hll_cardinality('{df["user_hll"][1]}')::int - 
            hll_cardinality(hll_union('{df["user_hll"][0]}', '{df["user_hll"][1]}'))::int as IntersectionCount,
    hll_cardinality(hll_union('{df["user_hll"][0]}', '{df["user_hll"][1]}'))::int as unionCount,
    hll_cardinality('{df["user_hll"][0]}')::int as LA_count,
    hll_cardinality('{df["user_hll"][1]}')::int as NY_count
"""
df_2 = db_conn.query(db_query)
df_2["intersection_percentage"] = df_2.intersectioncount / (df_2.unioncount / 100)
display(df_2.head())
display(Markdown((
    f'<div class="alert alert-success"> <b> {df_2["intersection_percentage"][0]:,.2f} % '
    f'of users visited both NY and LA and shared picture with the hashtag sunrise'
    f'</b></div>')))
intersectioncount unioncount la_count ny_count intersection_percentage
0 80 5512 3319 2273 1.451379
1.45 % of users visited both NY and LA and shared picture with the hashtag sunrise

Since such a low correlation was to be expected, lets have a look at another example intersection. We'll filter the top 100 hashtags, and select only those that appeared at location -8.646196,115.180664 (Bali).

In [42]:
df = pd.read_csv(
    "instagram-sunrise-dataset_hashtag_latlng.csv",
    nrows=20, usecols=["hashtag", "latitude", "longitude", "usercount", "user_hll"])
tools.filter_df_bbox(df, bbox=(114.884171,-8.801279,115.405678,-8.552843))
display(df)
hashtag latitude longitude usercount user_hll
0 sunrise -8.646196 115.180664 1917 \x148b7f00823008401802000024080811040120021080...
1 bali -8.646196 115.180664 1360 \x148b7f00823000400000000024080811000120020080...
In [43]:
points_lonlat = gv.Points(
    df,
    kdims=['longitude', 'latitude'], vdims=['usercount', 'hashtag']) 
tools.get_point_map(
    points_lonlat,
    dim_col='usercount',
    title='Estimate User Count per coordinate and hashtag',
    global_extent=True)
Out[43]:

Two different hashtags used at the same location by a number of users. How many users used both of these hastags?

In [44]:
def compile_intersection_union_sql(hll_set_a, hll_set_b, record_a_name: str, record_b_name: str):
    db_query = f"""
    SELECT 
        hll_cardinality('{hll_set_a}')::int + 
                hll_cardinality('{hll_set_b}')::int - 
                hll_cardinality(hll_union('{hll_set_a}', '{hll_set_b}'))::int as IntersectionCount,
        hll_cardinality(hll_union('{hll_set_a}', '{hll_set_b}'))::int as unionCount,
        hll_cardinality('{hll_set_a}')::int as {record_a_name}_count,
        hll_cardinality('{hll_set_b}')::int as {record_b_name}_count
    """
    return db_query
In [46]:
db_query = compile_intersection_union_sql(
    hll_set_a=df["user_hll"][0],
    hll_set_b=df["user_hll"][1],
    record_a_name=df["hashtag"][0],
    record_b_name=df["hashtag"][1])
df_2 = db_conn.query(db_query)
df_2["intersection_percentage"] = df_2.intersectioncount / (df_2.unioncount / 100)
display(df_2.head())
display(Markdown((
    f'<div class="alert alert-success"> <b>{df_2["intersection_percentage"][0]:,.2f} % '
    f'of users used both hashtags, "sunrise" and "bali", at location '
    f'{df["latitude"][0]} {df["longitude"][0]}'
    f'</b></div>')))
intersectioncount unioncount sunrise_count bali_count intersection_percentage
0 1326 1951 1917 1360 67.965146
67.97 % of users used both hashtags, "sunrise" and "bali", at location -8.6461956812 115.1806640625
In [ ]: