A Privacy-aware data structure: Introduction with YFCC100m

Introduction

Based on data from YFCC100m dataset, this Notebook introduces basics and usage of the privacy-aware data structure. The data used here was converted from the raw lbsn structure to its privacy-aware version, which is based, at its core, on the HyperLogLog (HLL) algorithm.

The raw lbsn structure was developed to systematically characterize LBSN data aspects in a common scheme that enables privacy-by-design for connected software, data handling and information visualization. The components of the raw structure are documented here.

The hll structure is more or less a convention for systematically converting raw structure components to a privacy-aware format that is based on HLL. A possible data scheme is proposed here. It is critical to note that not all components of the privacy-aware structure need to be available for all visualizations. Rather, an analyst will need to consider, on a per visualization basis, which components are needed. Therefore, the proposed structure is not exhaustive and should be treated as a demonstrator that can be extended.

In this Notebook, we introduce the basic process of working with the privacy-aware hll structure. This is the first notebook in a tutorial series of three notebooks:

Additional notes:

Use Shift+Enter to walk through the Notebook

Note: The three notebooks are finished in that they are complete and working examples. However, the notebooks are in active development and will be refined if new knowledge becomes available.

Preparations

Load dependencies

In [14]:
import os
import psycopg2 # Postgres API
import geoviews as gv
import holoviews as hv
import pandas as pd
import numpy as np
import warnings
with warnings.catch_warnings(): 
    # prevent missing glyph warnings
    warnings.simplefilter("ignore")
    import matplotlib.pyplot as plt
from collections import namedtuple
from geoviews import opts
from holoviews import dim
from cartopy import crs as ccrs
from IPython.display import display, Markdown
# init bokeh
from modules import preparations
preparations.init_imports()
pd.set_option('display.max_colwidth', 25)

Connect to database

Password is loaded from .env file specified in container setup hlldb.

The docker stack contains a full backup of the YFCC database converted to the privacy-aware datastructure. In this Notebook, we're only working with a small part of the data from the table spatial.latlng.

Define credentials as environment variables

In [2]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"

Connect to empty Postgres database running HLL Extension. Note that only readonly privileges are needed.

In [3]:
db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)

Test connection:

In [4]:
db_query = """
    SELECT 1;
"""
# create pandas DataFrame from database data
df = pd.read_sql_query(db_query, db_connection)
display(df.head())
?column?
0 1
In [5]:
from modules import tools
db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")
Out[5]:
?column?
0 1

Privacy-aware data introduction

The privacy-aware structure is based on two basic components, 1) the overlay and 2) the base. The overlay defines what is measured, which can be, for example, number of users (_userhll), number of posts (_posthll), or number of userdays (_datehll). Overlays are stored as HyperLogLog (HLL) sets, which contain statistically approximated versions of original data. It is possible to get an approximation of counts by calculating the cardinality of a HLL Set.

Conversely, the base is the context for which measurements are analyzed. The base can be terms, location-terms, dates, coordinates, places, or any combinations or complex queries. Herein, our base are lat-lng coordinates. Bases also have some privacy implications, particularly when they're accurate enough so that only one person is referenced (e.g. a latitude-longitude pair with high accuracy).

The core aim of the privacy-aware structure is to support visual analytics where no original data is needed. By removing the need for original data, entire process pipelines (such as Monitoring) can be converted to privacy-aware versions, without loosing quality or validity for interpretation.

For calculating with HLL Sets, it is necessary to connect to a Postgres Database which runs the Citus HLL extension. The database itself can be empty, it is only needed for calculation. A docker container for an empty postgres database with Citus HLL extension is made available here.

First, some statistics for the data we're working with.

In [6]:
%%time
db_query = """
    SELECT count(*) FROM spatial.latlng t1;
"""

display(Markdown(
    f"There're "
    f"<strong style='color: red;'>"
    f"{db_conn.query(db_query)['count'][0]:,.0f}</strong> "
    f"distinct coordinates in this table."))

There're 12,692,723 distinct coordinates in this table.

CPU times: user 3.11 ms, sys: 3.91 ms, total: 7.02 ms
Wall time: 4.19 s

The Flickt YFCC 100M dataset includes 99,206,564 photos and 793,436 videos from 581,099 different photographers, and 48,469,829 of those are geotagged [1].

In the privacy-aware HLL data we're working with, distinct users, posts and dates have been grouped by various means such as distinct latitude/longitude pairs. The number of distinct coordinates can be calculated from the table spatial.latlng.

In [9]:
db_query = tools.get_stats_query("spatial.latlng")
stats_df = db_conn.query(db_query)
stats_df["bytes/ct"] = stats_df["bytes/ct"].fillna(0).astype('int64')
display(stats_df)
metric bytes/ct bytes_pretty bytes_per_row
0 core_relation_size 1340825600 1279 MB 105.0
1 visibility_map 0 0 bytes 0.0
2 free_space_map 352256 344 kB 0.0
3 table_size_incl_toast 1341374464 1279 MB 105.0
4 indexes_size 400359424 382 MB 31.0
5 total_size_incl_toast... 1741733888 1661 MB 137.0
6 live_rows_in_text_rep... 1677544434 1600 MB 132.0
7 ---------------------... 0 None NaN
8 row_count 12692723 None NaN
9 live_tuples 0 None NaN
10 dead_tuples 0 None NaN

Data structure preview (get random 10 records):

In [7]:
db_query = "SELECT * FROM spatial.latlng ORDER BY RANDOM() LIMIT 10;"
first_10_df = db_conn.query(db_query)
display(first_10_df)
user_hll post_hll date_hll utl_hll latitude longitude latlng_geom
0 \x138b406c86 \x138b40ae42 \x138b408d41 None -33.890000 151.204333 0101000020E6100000CB0...
1 \x138b403601 \x138b40e362 \x138b40a641 None 39.088408 -74.818696 0101000020E61000006D5...
2 \x138b406725 \x138b409ec1 \x138b405ee1 None 41.020820 28.989486 0101000020E610000090D...
3 \x138b401861 \x138b40e201 \x138b40de02 None 53.358158 -2.561702 0101000020E610000047C...
4 \x138b409202 \x138b4052e2 \x138b40b8e1 None 49.457266 11.077488 0101000020E610000037A...
5 \x138b40e361 \x138b402da1bc41e5a2 \x138b40cda1 None 9.945221 -83.966651 0101000020E610000053E...
6 \x138b405ee1 \x138b403dc145035961b... \x138b401b42 None 60.434695 22.216072 0101000020E6100000D19...
7 \x138b401ca17ee2 \x138b409a26c6c1 \x138b407161dae1 None 37.786666 -122.406834 0101000020E6100000A83...
8 \x138b40a381 \x138b4028813ee347427... \x138b4004411c6147a173e2 None -7.008897 15.308074 0101000020E6100000ED7...
9 \x138b40e461 \x138b401ea2 \x138b40e6e1 None 51.508347 -0.075917 0101000020E6100000DFD...

Did you know that you can see the docstring from defined functions with ?? This helps us remember what functions are doing:

In [8]:
db_conn.query?
Signature: db_conn.query(sql_query: str) -> pandas.core.frame.DataFrame
Docstring: Execute Calculation SQL Query with pandas
File:      /home/jovyan/work/alex_dev/yfcc_raw_hll_intro/modules/tools.py
Type:      method

Test cardinality calculation: hll_add_agg() is used to create a HLL set consisting of 3 distinct sample ids, hll_cardinality() is then applied to retrieve the cardinality of the set (the number of distinct elements).

In [9]:
db_query = """
    SELECT hll_cardinality(hll_add_agg(hll_hash_text(a)))
    FROM (
    VALUES ('24399297324604'), ('238949132532500'), ('269965843254300')
    ) s(a);
"""
# create pandas DataFrame from database data
df = db_conn.query(db_query)
display(df.head())
hll_cardinality
0 3.0

Example Cardinality

Have a look at a random number of coordinates from our YFCC spatial.latlng table

In [15]:
HllRecord = namedtuple(
    'Hll_record', 'lat, lng, user_hll, post_hll, date_hll')

def strip_item(item, strip: bool):
    if not strip:
        return item
    if len(item) > 120:
        item = item[:120] + '..'
    return item
    

def get_hll_record(record, strip: bool = None):
    """Concatenate topic info from post columns"""
        
    lat = record.get('latitude')
    lng = record.get('longitude')
    user_hll = strip_item(record.get('user_hll'), strip)
    post_hll = strip_item(record.get('post_hll'), strip)
    date_hll = strip_item(record.get('date_hll'), strip)            
    return HllRecord(lat, lng, user_hll, post_hll, date_hll)

hll_sample_records = []
for ix, hll_record in first_10_df.iterrows():
    hll_record = get_hll_record(hll_record)
    print(
        f'Coordinate: {hll_record.lat}, {hll_record.lng}, '
        f'Post Count (Hll-Set): {hll_record.post_hll}, '
        f'User Count (Hll-Set): {hll_record.user_hll}')
    hll_sample_records.append(hll_record)
Coordinate: -33.89, 151.204333, Post Count (Hll-Set): \x138b40ae42, User Count (Hll-Set): \x138b406c86
Coordinate: 39.088408, -74.818696, Post Count (Hll-Set): \x138b40e362, User Count (Hll-Set): \x138b403601
Coordinate: 41.02082, 28.989486, Post Count (Hll-Set): \x138b409ec1, User Count (Hll-Set): \x138b406725
Coordinate: 53.358158, -2.561702, Post Count (Hll-Set): \x138b40e201, User Count (Hll-Set): \x138b401861
Coordinate: 49.457266, 11.077488, Post Count (Hll-Set): \x138b4052e2, User Count (Hll-Set): \x138b409202
Coordinate: 9.945221, -83.966651, Post Count (Hll-Set): \x138b402da1bc41e5a2, User Count (Hll-Set): \x138b40e361
Coordinate: 60.434695, 22.216072, Post Count (Hll-Set): \x138b403dc145035961bec1c0a8ee01, User Count (Hll-Set): \x138b405ee1
Coordinate: 37.786666, -122.406834, Post Count (Hll-Set): \x138b409a26c6c1, User Count (Hll-Set): \x138b401ca17ee2
Coordinate: -7.008897, 15.308074, Post Count (Hll-Set): \x138b4028813ee347427221b9a5b9e1c123cec2e481e781f122, User Count (Hll-Set): \x138b40a381
Coordinate: 51.508347, -0.075917, Post Count (Hll-Set): \x138b401ea2, User Count (Hll-Set): \x138b40e461

Lets calculate the cardinality (the estimated count) of these sets:

In [16]:
db_query = f"""
SELECT hll_cardinality(user_hll)::int as usercount 
FROM (VALUES
          ('{hll_sample_records[0].user_hll}'::hll),
          ('{hll_sample_records[1].user_hll}'::hll),
          ('{hll_sample_records[2].user_hll}'::hll),
          ('{hll_sample_records[3].user_hll}'::hll),
          ('{hll_sample_records[4].user_hll}'::hll),
          ('{hll_sample_records[5].user_hll}'::hll),
          ('{hll_sample_records[6].user_hll}'::hll),
          ('{hll_sample_records[7].user_hll}'::hll),
          ('{hll_sample_records[8].user_hll}'::hll),
          ('{hll_sample_records[9].user_hll}'::hll)
        ) s(user_hll)
ORDER BY usercount DESC;
"""

df = db_conn.query(db_query)
# combine with base data
df["lat"] = [record.lat for record in hll_sample_records]
df["lng"] = [record.lng for record in hll_sample_records]
df.set_index(['lat', 'lng'])
Out[16]:
usercount
lat lng
-33.890000 151.204333 2
39.088408 -74.818696 1
41.020820 28.989486 1
53.358158 -2.561702 1
49.457266 11.077488 1
9.945221 -83.966651 1
60.434695 22.216072 1
37.786666 -122.406834 1
-7.008897 15.308074 1
51.508347 -0.075917 1

Lets take this apart:

  • the f'' format refers to f-strings python format, where variables can be substituted with curly brackets, e.g. {hll_sample_records[0].user_hll}, which inserts a HLL-Set as a string from our CSV
  • by surrounding the HLL-Set with single-quotes, e.g. 'HLL-Set', it is treated as a string in SQL
  • adding ::hll after the string labels the record as the special type hll. This is called type casting, and it doesn't change the content of the value
  • hll_cardinality(user_hll)::int means, for each user_hll, calculate the cardinality (the approximated count of distinct values in the set
  • since this aproximation is a float, we cast it to an int using ::int

The result shows that all coordinates were only frequented by 1 user from the YFCC dataset.

Calculating with HLL Sets

What is interesting here: HLL-Sets can be combined or intersected. E.g.:

In [17]:
db_query = f"""
SELECT hll_cardinality(hll_union_agg(s.a))::int
FROM (
    VALUES ('{hll_sample_records[1].user_hll}'::hll),
           ('{hll_sample_records[2].user_hll}'::hll)
) s(a)
"""
df = db_conn.query(db_query)
df.head()
Out[17]:
hll_cardinality
0 2

Obviously, the users who frequented the two coordinates were different.

HLL Modes

The postgres implementation of citus has several modes, for reasons of performance:

A hll is a combination of different set/distinct-value-counting algorithms that can be thought of as a hierarchy, along with rules for moving up that hierarchy. In order to distinguish between said algorithms, we have given them names:

EMPTY A constant value that denotes the empty set.

EXPLICIT An explicit, unique, sorted list of integers in the set, which is maintained up to a fixed cardinality.

SPARSE A 'lazy', map-based implementation of HyperLogLog, a probabilistic set data structure. Only stores the indices and values of non-zero registers in a map, until the number of non-zero registers exceeds a fixed cardinality.

FULL A fully-materialized, list-based implementation of HyperLogLog. Explicitly stores the value of every register in a list ordered by register index.

If used in a privacy-aware context, explicit mode does not provide any improvement, since unique ids are store fully, which can be reversed. Sparse mode, according to Desfontaines, Lochbihler, & Basin (2018), may be more vulnerable to reverse engeneering attacks since it provides more accurate information on each mapped unique id.

It is possible to tune the hll mode with SELECT hll_set_defaults(log2m, regwidth, expthresh, sparseon); The default values are:

  • DEFAULT_LOG2M 11
  • DEFAULT_REGWIDTH 5
  • DEFAULT_EXPTHRESH -1 - explicit mode on
  • DEFAULT_SPARSEON 1 - auto sparse mode

Disable the explicit mode with the following command for the current connection (Note: the output will return the settings before the change):

In [18]:
df = db_conn.query("SELECT hll_set_defaults(11,5, 0, 1);")
df.head()
Out[18]:
hll_set_defaults
0 (11,5,-1,1)

It is also possible to define the exact number of ids in a hll set, when the promotion to a full hll set occurs. By default (-1), this is automatically determined based on storage efficiency. Our observation was that promotion to full mode typically occurs between 150 to 250 unique IDs in a hll set.

Set it manually with SELECT hll_set_max_sparse(int);, e.g. to 25 Unqiue IDs:

In [19]:
df = db_conn.query("SELECT hll_set_max_sparse(25);")
df.head()
Out[19]:
hll_set_max_sparse
0 -1

However, note that these settings won't affect any calculations in this notebook, since we're loading already created hll sets from CSV. It is possible to detect with which mode a hll set was created. Below, the first hll set (for the null island), was created using FULL, the other two containing only one entry were created using SPARSE because the explicit mode was disabled.

This time, we'll construct the query using Python's list comprehension, which significantly reduces the length of our code:

In [20]:
pd.set_option('display.max_colwidth', 150)
db_query = f"""
SELECT hll_print(s.a)
FROM (
    VALUES {",".join([f"('{record.user_hll}'::hll)" for record in hll_sample_records])}
) s(a)
"""
df = db_conn.query(db_query)
df["lat"] = [record.lat for record in hll_sample_records]
df["lng"] = [record.lng for record in hll_sample_records]
df.set_index(['lat', 'lng']).head()
Out[20]:
hll_print
lat lng
-33.890000 151.204333 COMPRESSED, 1 filled nregs=2048, nbits=5, expthresh=0, sparseon=1:\n 0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
39.088408 -74.818696 COMPRESSED, 1 filled nregs=2048, nbits=5, expthresh=0, sparseon=1:\n 0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
41.020820 28.989486 COMPRESSED, 1 filled nregs=2048, nbits=5, expthresh=0, sparseon=1:\n 0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
53.358158 -2.561702 COMPRESSED, 1 filled nregs=2048, nbits=5, expthresh=0, sparseon=1:\n 0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
49.457266 11.077488 COMPRESSED, 1 filled nregs=2048, nbits=5, expthresh=0, sparseon=1:\n 0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...

HLL Questions

To anticipate some questions or assumptions:

If I consider the HLL string as a unique ID, it is no different than any encrypted string and can be used to re-identify users (e.g. rainbow tables!)

This is not true. Of course, you can treat the binary representation of a hll set as a unique string. However, this string will always refer to a group of users. Who is in this group is defined by two things: First, the re-distribution of characters of each unique ID through the Murmur-Hash-Function. Since Murmur-Hash is no cryptographic hash function, this could be reversed. However, secondly, HLL removes about 97% of information from the unqiue Murmur hash strings, the remaining 3% of information is then used to extrapolate the likely number of distinct ids in the original set. Thus, the binary representation of a hll cannot be linked back to original ids. Desfointes et al. (2017) point out a possible attack vector: if you have both the hll set (e.g. access to the database) and the original data, someone could attempt to add a unique ID to an existing set, and if the set doesn't change, he could assume (e.g. by 60% accuracy) whether this user is already in the set. Since we're working with publicly available social media data, such attack vector appears not plausible - using the original data is easier in this case. Still, we've added a measure to impede such an approach by applying a cryptographic hashing function before the murmur hashing. Without knowing the parameters of this cryptographic hashing function, someone could not derive the encrypted id that would be needed for such an attack.

How were the HLL sets created that are loaded herein from CSV?

By adding ids consecutively to hll sets. In reality, this is a multi-step process done using a number of tools. The process can be summarizes in the following steps:

  1. Create a HLL database with a fairly broad number of default structures present (see our HLL DB Docker Template).
  2. Write a (python) script to connect to the respective API from Social Media,
  3. retrieve json data for geotagged posts, and transform to RAW-LBSN structure and from that to HLL-LBSN structure and
  4. submit to HLLdb. This is done in-memory using the package lbsntransform. Thus, for streaming the HLL updates, no original data needs to be stored. The CSV with bases and HLL sets is then exported from the HLL database.

Why do I need to connect to a database when I only want to perform HLL calculations?

The original algorithm developed by Flajolet et al. (2007) exists in several implementations. After some consideration, we decided for the Postgres implementation by Citus. While there exist a number of ports (java, js and go), none is currently available for Python. This is the main reason why we created the hll-worker database, an empty postgres container with the HLL extension installed, which can be used for hll calculations independent of the language used. The connection is based on a read-only user, which only allows in-memory calculations.

There are several advantages to such a setup. From a privacy perspective, the main one is based on the separation of concerns principle: the HLL Worker doesn't need to know the 'bases' (e.g. the coordinates) for which distinct ids are aggregated. In the calculation below (union_hll()), coordinates are replaced by ascending integers. Thus, even if someone 'compromosed' the hll worker, no contextual information is available to make inferences to e.g. which places have been visited by particular users. Conversely, the hllworker can apply a cryptographic hash function to ids using a salt that is only available in the worker, not the one who is sending information. This means that neither the user nor the hll-worker have all the information that is necessary to re-construct final hll-sets. Only if both, the one who is sending raw data and the hll-worker, are compromised, it is possible to re-create hll-sets.

This is perhaps beyond the security measures what is necessary when working with datathat is publicly available anyway, such as the one herein. However, these measures come at nearly no cost, which is why considered this as the default setup that would be suited to most users.

While using HLL for the overlay (distinct IDs), aren't the bases (coordinates, for example) also of relevance to privacy?

Yes, to some degree. The privacy-aware structure makes a clear distinction between distinct ids, as overlays, that directly reference single users. Conversely, bases may or may not refer to single users (we call the contextual referencing). However, for such contextual referencing, additional ('contextual') data must be taken into account to allow re-identification of users. For example, one may look up a single coordinate-pair on the original Flickr YFCC dataset to find out who used it to geotag their photographs. Some bases can be formatted to make contextual referencing harder, such as increasing granularity of information. Furthermore, since relations are dissolved in the privacy-aware structure, no additional information or query is available beyond the count-metrics.

This means that the hll structure features very limited options for analyzing data in different ways than it was created for. This prevents re-using of datasets for other contexts, such as appears possible when data is compromised by someone with different aims. On the other hand, Social Media data is often publicly available anyway and the hll-structure cannot prevent attacks compromising peoples´ privacy by using the original data from Social Media that is sometimes still available.

RAW structure components

[Add Section] - see overview of raw structure components at lbsn.vgiscience.org/structure/

HLL structure components

[Add Section] - see overview of hll structure components at gitlab.vgiscience.de/lbsn/structure/hlldb/

topical facet

[Update Section with basic visualization examples]

term base

In [21]:
facet = "topical"
base = "term"
samples = 10

Get data from topical.term:

In [22]:
def get_samples(facet: str, base: str, samples: int):
    """Get samples from db"""
    db_query = f"""
        SELECT * FROM {facet}.{base} LIMIT {samples};
    """
    df = db_conn.query(db_query)
    return df
df = get_samples(facet=facet, base=base, samples=samples)
display(df.head())
user_hll post_hll latlng_hll place_hll date_hll term
0 \x128b7fcfda7373d4ac49ac0dc04b57f3a2953e1197fd4a8c8c7cfa69f87203a5e54731 \x128b7ff6a9ae084cd38677f9b66415b1fefd1e0a32e14d478c9d716980577330517b7a \x128b7f858faf642a3d1598 \x118b7f \x128b7f8d6a7b8a762b330298e6603e14a06863de6a68d629709a487eba89d057233bfb r0011144jpg
1 \x128b7f82ead4d62107b3e496c727fdc751343f998f82329fed0f309e32606bca6d9906a3609f3d06de85ddc4debc22aecc2650d272bce88ceb25aad442aa50b05a4e26d72acb8d2b... \x128b7f8780924442400ade87a46a0d384e40aba0c495da538dd0efaac703f9f7bd6d4fb4e61d7a097b464aba8c789fd285951abd054871db998892c1540ffdffb9c940c570d8d961... \x128b7f858faf642a3d1598 \x118b7f \x128b7f868cd90a7e9be030899f3446cbd759828c31dcccaaa35d008d56dab004e5bc6d966b996512694ef69af238b5ee0705669fc3f774ae36adefabe261c0ab55e277ac6dc45cca... citaat
2 \x128b7fb378865552b5a53f \x128b7fb665d40e41b91f80f78342f3d70a96991ce1fa28153b957d1fb9d34f4898b6612cde510260083dd67f58214b2194e32a \x128b7f858faf642a3d1598 \x118b7f \x128b7f1787aeb543911236 invarveli
3 \x128b7fa01caf2822618f05e7f2ace926fa85ab273f1482ecd92952 \x128b7f83c819e6c570c4a59477ed9877a211eedf8c3afa2b8d7a802c5756273ea608964d2682781a10af8350f7461b872f709868b29ee7081f0ecc6f9489966d300b98 \x128b7f858faf642a3d1598 \x118b7f \x128b7fb068cb6585110d4cdd90e10bdc6f541c7631cd169b9326fc goccod
4 \x128b7fb89cc948f47f1880bb6b639233157ec5bdbbfd3cab69344cc05302382fb2a40734a1ffd679d080c6619b4c2f43b0e35272e76a1f42a96e2b \x128b7fc15f46a9af06a08ac972c80f640e0c9fd99039a62726c8180df39974bb4194e21f2c53e19c01697e492616cb083a03cb6861737e6f4cfcd7 \x128b7f858faf642a3d1598 \x118b7f \x128b7fde6a68d629709a48eaaebf65752b176e112c6bddb11df8dd245891399ead4e35619c463460b29d286e23ebf2496d3e9e p5050069
In [23]:
def list_components(df, facet: str = None, base: str = None):
    """Formatted display of structure components (overlays and bases)"""
    comp = [
        '**Base**',
        '**Overlay(s) (hll metrics)**',
        '**Additional (supplementary base information)**']
    hll_cols = []
    if base is None:
        bases = []
    else:
        bases = [base]
    additional = []
    for col in df.columns:
        if col.endswith("_hll"):
            hll_cols.append(col)
            continue
        if bases is None:
            bases.append(col)
        else:
            if col not in bases:
                additional.append(col)
    display(Markdown(f'### <u>{facet}.{base}</u>'))
    for ix, li in enumerate([bases, hll_cols, additional]):
        if not li:
            continue
        display(Markdown(f'{comp[ix]}: {", ".join([f"*{li_}*" for li_ in li])}'))

list_components(df, facet=facet, base=base)        

topical.term

Base: term

Overlay(s) (hll metrics): user_hll, post_hll, latlng_hll, place_hll, date_hll

In [24]:
def get_cardinality(
    facet: str, base: str, metric: str, samples: int):
    """Get cardinality of samples from db"""
    db_query = f"""
    SELECT hll_cardinality({metric})::int, {base}
    FROM {facet}.{base} LIMIT {samples};
    """
    df = db_conn.query(db_query)
    df = df.set_index([base])
    df.sort_values(
        by="hll_cardinality", ascending=False, inplace=True)
    return df
In [25]:
df = get_cardinality(
    metric="user_hll", facet=facet, base=base, samples=samples)
df.head()
Out[25]:
hll_cardinality
term
citaat 25
p5050069 7
citaten 5
r0011144jpg 4
goccod 3
In [26]:
def plot_df_bar(
    df: pd.DataFrame, title: str, ylabel: str, xlabel: str):
    """Generate barplot from df"""
    plt.figure(figsize=(15,10))
    cmap = plt.cm.tab10
    colors = cmap(np.arange(len(df)) % cmap.N)
    df["hll_cardinality"].plot.bar(color=colors)
    plt.xticks(rotation=50)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()
plot_df_bar(
    df, title=f"{samples} sample records from {facet}.{base}",
    ylabel="Number of distinct users (estimated)", xlabel=f"{base}")

hashtag base

In [27]:
base = "hashtag"
In [28]:
df = get_samples(facet=facet, base=base, samples=samples)
display(df.head())
user_hll post_hll latlng_hll place_hll date_hll hashtag
0 \x128b7f78de33c8e92a36a7 \x128b7f6fc8bdda303509f5 \x128b7f858faf642a3d1598 \x118b7f \x128b7f10a38f1ce7e29cc4 leidsuplein
1 \x128b7f86546a53c0322dfca48c61f45c80f281bb2e743e52c44c140b052cfeaf0b2dd8394e81ae1c9380814a25c4786414a701 \x128b7f836ef086840b137d88ccb44dd2cc850d926a49da47dce97694c0434d75a909fe962e0e65b59d5661a1685523a86d6582adbfdecd078d9b9bb0da3db1506a7756ba7008af66... \x128b7f858faf642a3d1598 \x118b7f \x128b7fad057c5c0b0a3dcfc797103bbf1ca6d6c88466743740f206cbfbf30c9cbf2e50100351d70d0246f138317a69ab04f30e39d7fe8aa062e4f2696d0456d4c7fa1d 2008!
2 \x128b7f359bc6dc91c0f831 \x128b7fbd9c0e9643fe8ce0c1d90de0daafcb0b \x128b7f858faf642a3d1598 \x118b7f \x128b7f6bd2bf83e5a47dea pig-iron
3 \x128b7fecf2131cca152a8c \x128b7f1b01fd7d447b7bde \x128b7f858faf642a3d1598 \x118b7f \x128b7ff4b9a7a66d02c196 tunturihotel
4 \x128b7f98d9908bb75563f106472c3997eb37db08829a39e43f04e3245ca1477a015b953c5a4eeba15ac5ad \x128b7f9ce44e7463e61a67b6966ef52569fbf5be7eefee3c71e84abf8585f3afd60bb5cb3c4a2818dc74a0cfcc244cbed84297d18b43bc0c977ffed419907b5993e53bdad20b4eec... \x128b7f858faf642a3d1598 \x118b7f \x128b7fb018bd604c767ee0d05117a5acaa764be12178f9bf9de900eaa48eed9ef98f6beb962b8de919823d02fb1b734b3a439305a0fbf27a9e92b51267d85b62b87a9b31e376b7ec... voulkos
In [29]:
df = get_cardinality(
    metric="user_hll", facet=facet, base=base, samples=samples)
In [30]:
df.head()
Out[30]:
hll_cardinality
hashtag
whitebox 14
2008! 6
voulkos 5
luluzinhacamprj 2
leidsuplein 1
In [31]:
plot_df_bar(
    df, title=f"{samples} sample records from {facet}.{base}",
    ylabel="Number of distinct users (estimated)", xlabel=f"{base}")

temporal facet

date base

In [32]:
facet = "temporal"
base = "date"
samples = 10
In [33]:
df = get_samples(facet=facet, base=base, samples=samples)
display(df.head())
user_hll post_hll date name
0 \x148b7f110642804108c2620c2128061208811084010c81214a52104111041210242882310443308a038460208221882119043110c90084220c21088a038482088243844208c82104... \x148b7f214a6210c97148721d2539483390e4390c7390a431463495062148819485414ac29ca5310662192831ca321065298843108439c84294a5224a4414e621883418e32a0c730c... 2013-11-24 None
1 \x128b7fee86ce32fc96323d7f90a98d21b158b4 \x128b7fd64146406325d6f74a3656d5cf9c7f3b 1920-08-01 None
2 \x148b7f10843008820946421821214c518c8010c41198263148220c631102421c6128c8719821288a238485210240844310c2431063108a418841108431a04200c631084520842104... \x148b7f210c6310e5294e4294a428ce421925314a629067310a431ca5314e6294a631504314853908721885410c639c8921884294c639ce62948531485318a9214a82206939ce4210... 2012-02-19 None
3 \x148b7f29c6310882188c310c42390a008866190a111042204402a04418ca209461108a720c6208c623804020845184662088321cc1088422886108ca60888130c0111463208a330c... \x148b7f218c8318a641886294c65ac88294c739c8631104194c6290a421884694a64908751c8520ca54a484310c4210c5394c551486314a6318c430ce6290c619c85320c53106a210... 2008-06-15 None
4 \x128b7fc3ca722da09fadb761d45c08794653bf \x128b7f93fde0c05f8a1087ac1cf543968ebad3afa22ccae743a018bf76e50955b68300ce9b5701a148ef65e1392d10cd1a17f4fc529a74ad6b9b50098d0222c12dafda35c0fab8e3... 2015-07-22 None
In [34]:
df = get_cardinality(
    metric="user_hll", facet=facet, base=base, samples=samples)
display(df.head())
hll_cardinality
date
2010-01-01 10093
2012-02-04 7832
2008-06-15 7772
2010-03-14 7451
2012-02-19 7420
In [35]:
plot_df_bar(
    df, title=f"{samples} sample records from {facet}.{base}",
    ylabel="Number of distinct users (estimated)", xlabel=f"{base}")

spatial facet

latlng base

In [36]:
facet = "spatial"
base = "latlng"

Get sample data for Dresden:

In [37]:
db_query = f"""
    SELECT latitude, longitude, hll_cardinality(post_hll)::int
    FROM {facet}.{base} t1
    WHERE t1.latlng_geom
        && -- intersects
        ST_MakeEnvelope (
         13.46475261, 50.90449772, -- bounding 
         14.0876945, 51.24305035, -- box limits (Dresden)
         4326)
"""
df = db_conn.query(db_query)

display(df.head())
latitude longitude hll_cardinality
0 50.905266 14.044046 1
1 50.905567 13.667014 1
2 50.905739 14.067306 1
3 50.905982 13.665329 1
4 50.906065 13.942826 1

Create geoviews point layer:

In [46]:
points_lonlat = gv.Points(
    df,
    kdims=['longitude', 'latitude'], vdims=['hll_cardinality']) 

Plot points on map:

In [51]:
def set_active_tool(plot, element):
    """Enable wheel_zoom in bokeh plot by default"""
    plot.state.toolbar.active_scroll = plot.state.tools[0]

hv.Overlay(
    gv.tile_sources.EsriImagery * \
    points_lonlat.opts(
        tools=['hover'],
        size=1+dim('hll_cardinality')*0.1,
        line_color='black',
        line_width=0.1,
        fill_alpha=0.8,
        fill_color='yellow')
    ).opts(
        width=800,
        height=480,
        hooks=[set_active_tool],
        title='Estimate Post Count for distinct yfcc coordinates',
        projection=ccrs.GOOGLE_MERCATOR,
    )
Out[51]:
In [ ]: