A Privacy-aware data structure: Introduction with YFCC100m

Introduction¶

Based on data from YFCC100m dataset, this Notebook introduces basics and usage of the privacy-aware data structure. The data used here was converted from the raw lbsn structure to its privacy-aware version, which is based, at its core, on the HyperLogLog (HLL) algorithm.

The raw lbsn structure was developed to systematically characterize LBSN data aspects in a common scheme that enables privacy-by-design for connected software, data handling and information visualization. The components of the raw structure are documented here.

The hll structure is more or less a convention for systematically converting raw structure components to a privacy-aware format that is based on HLL. A possible data scheme is proposed here. It is critical to note that not all components of the privacy-aware structure need to be available for all visualizations. Rather, an analyst will need to consider, on a per visualization basis, which components are needed. Therefore, the proposed structure is not exhaustive and should be treated as a demonstrator that can be extended.

In this Notebook, we introduce the basic process of working with the privacy-aware hll structure. This is the first notebook in a tutorial series of three notebooks:

1) the HLL Introduction (01_hll_intro.ipynb) explaines basic concepts and tools for working with the privacy-aware lbsn data
2) the RAW Notebook (02_yfcc_gridagg_raw.ipynb) demonstrates how a typical grid-based visualization looks like when using the raw lbsn structure and
3) the HLL Notebook (03_yfcc_gridagg_hll.ipynb) demonstrates the same visualization using the privacy-aware hll lbsn structure

Additional notes:

Use Shift+Enter to walk through the Notebook

Note: The three notebooks are finished in that they are complete and working examples. However, the notebooks are in active development and will be refined if new knowledge becomes available.

Preparations¶

Load dependencies¶

import os
import psycopg2 # Postgres API
import geoviews as gv
import holoviews as hv
import pandas as pd
import numpy as np
import warnings
with warnings.catch_warnings(): 
    # prevent missing glyph warnings
    warnings.simplefilter("ignore")
    import matplotlib.pyplot as plt
from collections import namedtuple
from geoviews import opts
from holoviews import dim
from cartopy import crs as ccrs
from IPython.display import display, Markdown
# init bokeh
from modules import preparations
preparations.init_imports()
pd.set_option('display.max_colwidth', 25)

Connect to database¶

Password is loaded from .env file specified in container setup hlldb.

The docker stack contains a full backup of the YFCC database converted to the privacy-aware datastructure. In this Notebook, we're only working with a small part of the data from the table spatial.latlng.

Define credentials as environment variables

db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"

Connect to empty Postgres database running HLL Extension. Note that only readonly privileges are needed.

db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)

Test connection:

db_query = """
    SELECT 1;
"""
# create pandas DataFrame from database data
df = pd.read_sql_query(db_query, db_connection)
display(df.head())

from modules import tools
db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")

Privacy-aware data introduction¶

The privacy-aware structure is based on two basic components, 1) the overlay and 2) the base. The overlay defines what is measured, which can be, for example, number of users (_userhll), number of posts (_posthll), or number of userdays (_datehll). Overlays are stored as HyperLogLog (HLL) sets, which contain statistically approximated versions of original data. It is possible to get an approximation of counts by calculating the cardinality of a HLL Set.

Conversely, the base is the context for which measurements are analyzed. The base can be terms, location-terms, dates, coordinates, places, or any combinations or complex queries. Herein, our base are lat-lng coordinates. Bases also have some privacy implications, particularly when they're accurate enough so that only one person is referenced (e.g. a latitude-longitude pair with high accuracy).

The core aim of the privacy-aware structure is to support visual analytics where no original data is needed. By removing the need for original data, entire process pipelines (such as Monitoring) can be converted to privacy-aware versions, without loosing quality or validity for interpretation.

For calculating with HLL Sets, it is necessary to connect to a Postgres Database which runs the Citus HLL extension. The database itself can be empty, it is only needed for calculation. A docker container for an empty postgres database with Citus HLL extension is made available here.

First, some statistics for the data we're working with.

%%time
db_query = """
    SELECT count(*) FROM spatial.latlng t1;
"""

display(Markdown(
    f"There're "
    f"<strong style='color: red;'>"
    f"{db_conn.query(db_query)['count'][0]:,.0f}</strong> "
    f"distinct coordinates in this table."))

CPU times: user 3.11 ms, sys: 3.91 ms, total: 7.02 ms
Wall time: 4.19 s

The Flickt YFCC 100M dataset includes 99,206,564 photos and 793,436 videos from 581,099 different photographers, and 48,469,829 of those are geotagged [1].

In the privacy-aware HLL data we're working with, distinct users, posts and dates have been grouped by various means such as distinct latitude/longitude pairs. The number of distinct coordinates can be calculated from the table spatial.latlng.

db_query = tools.get_stats_query("spatial.latlng")
stats_df = db_conn.query(db_query)
stats_df["bytes/ct"] = stats_df["bytes/ct"].fillna(0).astype('int64')
display(stats_df)

Data structure preview (get random 10 records):

db_query = "SELECT * FROM spatial.latlng ORDER BY RANDOM() LIMIT 10;"
first_10_df = db_conn.query(db_query)
display(first_10_df)

Did you know that you can see the docstring from defined functions with ?? This helps us remember what functions are doing:

db_conn.query?

Signature: db_conn.query(sql_query: str) -> pandas.core.frame.DataFrame
Docstring: Execute Calculation SQL Query with pandas
File:      /home/jovyan/work/alex_dev/yfcc_raw_hll_intro/modules/tools.py
Type:      method

Test cardinality calculation: hll_add_agg() is used to create a HLL set consisting of 3 distinct sample ids, hll_cardinality() is then applied to retrieve the cardinality of the set (the number of distinct elements).

db_query = """
    SELECT hll_cardinality(hll_add_agg(hll_hash_text(a)))
    FROM (
    VALUES ('24399297324604'), ('238949132532500'), ('269965843254300')
    ) s(a);
"""
# create pandas DataFrame from database data
df = db_conn.query(db_query)
display(df.head())

Example Cardinality¶

Have a look at a random number of coordinates from our YFCC spatial.latlng table

HllRecord = namedtuple(
    'Hll_record', 'lat, lng, user_hll, post_hll, date_hll')

def strip_item(item, strip: bool):
    if not strip:
        return item
    if len(item) > 120:
        item = item[:120] + '..'
    return item
    

def get_hll_record(record, strip: bool = None):
    """Concatenate topic info from post columns"""
        
    lat = record.get('latitude')
    lng = record.get('longitude')
    user_hll = strip_item(record.get('user_hll'), strip)
    post_hll = strip_item(record.get('post_hll'), strip)
    date_hll = strip_item(record.get('date_hll'), strip)            
    return HllRecord(lat, lng, user_hll, post_hll, date_hll)

hll_sample_records = []
for ix, hll_record in first_10_df.iterrows():
    hll_record = get_hll_record(hll_record)
    print(
        f'Coordinate: {hll_record.lat}, {hll_record.lng}, '
        f'Post Count (Hll-Set): {hll_record.post_hll}, '
        f'User Count (Hll-Set): {hll_record.user_hll}')
    hll_sample_records.append(hll_record)

Coordinate: -33.89, 151.204333, Post Count (Hll-Set): \x138b40ae42, User Count (Hll-Set): \x138b406c86
Coordinate: 39.088408, -74.818696, Post Count (Hll-Set): \x138b40e362, User Count (Hll-Set): \x138b403601
Coordinate: 41.02082, 28.989486, Post Count (Hll-Set): \x138b409ec1, User Count (Hll-Set): \x138b406725
Coordinate: 53.358158, -2.561702, Post Count (Hll-Set): \x138b40e201, User Count (Hll-Set): \x138b401861
Coordinate: 49.457266, 11.077488, Post Count (Hll-Set): \x138b4052e2, User Count (Hll-Set): \x138b409202
Coordinate: 9.945221, -83.966651, Post Count (Hll-Set): \x138b402da1bc41e5a2, User Count (Hll-Set): \x138b40e361
Coordinate: 60.434695, 22.216072, Post Count (Hll-Set): \x138b403dc145035961bec1c0a8ee01, User Count (Hll-Set): \x138b405ee1
Coordinate: 37.786666, -122.406834, Post Count (Hll-Set): \x138b409a26c6c1, User Count (Hll-Set): \x138b401ca17ee2
Coordinate: -7.008897, 15.308074, Post Count (Hll-Set): \x138b4028813ee347427221b9a5b9e1c123cec2e481e781f122, User Count (Hll-Set): \x138b40a381
Coordinate: 51.508347, -0.075917, Post Count (Hll-Set): \x138b401ea2, User Count (Hll-Set): \x138b40e461

Lets calculate the cardinality (the estimated count) of these sets:

db_query = f"""
SELECT hll_cardinality(user_hll)::int as usercount 
FROM (VALUES
          ('{hll_sample_records[0].user_hll}'::hll),
          ('{hll_sample_records[1].user_hll}'::hll),
          ('{hll_sample_records[2].user_hll}'::hll),
          ('{hll_sample_records[3].user_hll}'::hll),
          ('{hll_sample_records[4].user_hll}'::hll),
          ('{hll_sample_records[5].user_hll}'::hll),
          ('{hll_sample_records[6].user_hll}'::hll),
          ('{hll_sample_records[7].user_hll}'::hll),
          ('{hll_sample_records[8].user_hll}'::hll),
          ('{hll_sample_records[9].user_hll}'::hll)
        ) s(user_hll)
ORDER BY usercount DESC;
"""

df = db_conn.query(db_query)
# combine with base data
df["lat"] = [record.lat for record in hll_sample_records]
df["lng"] = [record.lng for record in hll_sample_records]
df.set_index(['lat', 'lng'])

Lets take this apart:

the f'' format refers to f-strings python format, where variables can be substituted with curly brackets, e.g. {hll_sample_records[0].user_hll}, which inserts a HLL-Set as a string from our CSV
by surrounding the HLL-Set with single-quotes, e.g. 'HLL-Set', it is treated as a string in SQL
adding ::hll after the string labels the record as the special type hll. This is called type casting, and it doesn't change the content of the value
hll_cardinality(user_hll)::int means, for each user_hll, calculate the cardinality (the approximated count of distinct values in the set
since this aproximation is a float, we cast it to an int using ::int

The result shows that all coordinates were only frequented by 1 user from the YFCC dataset.

Calculating with HLL Sets¶

What is interesting here: HLL-Sets can be combined or intersected. E.g.:

db_query = f"""
SELECT hll_cardinality(hll_union_agg(s.a))::int
FROM (
    VALUES ('{hll_sample_records[1].user_hll}'::hll),
           ('{hll_sample_records[2].user_hll}'::hll)
) s(a)
"""
df = db_conn.query(db_query)
df.head()

Obviously, the users who frequented the two coordinates were different.

HLL Modes¶

The postgres implementation of citus has several modes, for reasons of performance:

A hll is a combination of different set/distinct-value-counting algorithms that can be thought of as a hierarchy, along with rules for moving up that hierarchy. In order to distinguish between said algorithms, we have given them names:

EMPTY A constant value that denotes the empty set.

EXPLICIT An explicit, unique, sorted list of integers in the set, which is maintained up to a fixed cardinality.

SPARSE A 'lazy', map-based implementation of HyperLogLog, a probabilistic set data structure. Only stores the indices and values of non-zero registers in a map, until the number of non-zero registers exceeds a fixed cardinality.

FULL A fully-materialized, list-based implementation of HyperLogLog. Explicitly stores the value of every register in a list ordered by register index.

If used in a privacy-aware context, explicit mode does not provide any improvement, since unique ids are store fully, which can be reversed. Sparse mode, according to Desfontaines, Lochbihler, & Basin (2018), may be more vulnerable to reverse engeneering attacks since it provides more accurate information on each mapped unique id.

It is possible to tune the hll mode with SELECT hll_set_defaults(log2m, regwidth, expthresh, sparseon); The default values are:

DEFAULT_LOG2M 11
DEFAULT_REGWIDTH 5
DEFAULT_EXPTHRESH -1 - explicit mode on
DEFAULT_SPARSEON 1 - auto sparse mode

Disable the explicit mode with the following command for the current connection (Note: the output will return the settings before the change):

df = db_conn.query("SELECT hll_set_defaults(11,5, 0, 1);")
df.head()

It is also possible to define the exact number of ids in a hll set, when the promotion to a full hll set occurs. By default (-1), this is automatically determined based on storage efficiency. Our observation was that promotion to full mode typically occurs between 150 to 250 unique IDs in a hll set.

Set it manually with SELECT hll_set_max_sparse(int);, e.g. to 25 Unqiue IDs:

df = db_conn.query("SELECT hll_set_max_sparse(25);")
df.head()

However, note that these settings won't affect any calculations in this notebook, since we're loading already created hll sets from CSV. It is possible to detect with which mode a hll set was created. Below, the first hll set (for the null island), was created using FULL, the other two containing only one entry were created using SPARSE because the explicit mode was disabled.

This time, we'll construct the query using Python's list comprehension, which significantly reduces the length of our code:

pd.set_option('display.max_colwidth', 150)
db_query = f"""
SELECT hll_print(s.a)
FROM (
    VALUES {",".join([f"('{record.user_hll}'::hll)" for record in hll_sample_records])}
) s(a)
"""
df = db_conn.query(db_query)
df["lat"] = [record.lat for record in hll_sample_records]
df["lng"] = [record.lng for record in hll_sample_records]
df.set_index(['lat', 'lng']).head()

HLL Questions¶

To anticipate some questions or assumptions:

If I consider the HLL string as a unique ID, it is no different than any encrypted string and can be used to re-identify users (e.g. rainbow tables!)

This is not true. Of course, you can treat the binary representation of a hll set as a unique string. However, this string will always refer to a group of users. Who is in this group is defined by two things: First, the re-distribution of characters of each unique ID through the Murmur-Hash-Function. Since Murmur-Hash is no cryptographic hash function, this could be reversed. However, secondly, HLL removes about 97% of information from the unqiue Murmur hash strings, the remaining 3% of information is then used to extrapolate the likely number of distinct ids in the original set. Thus, the binary representation of a hll cannot be linked back to original ids. Desfointes et al. (2017) point out a possible attack vector: if you have both the hll set (e.g. access to the database) and the original data, someone could attempt to add a unique ID to an existing set, and if the set doesn't change, he could assume (e.g. by 60% accuracy) whether this user is already in the set. Since we're working with publicly available social media data, such attack vector appears not plausible - using the original data is easier in this case. Still, we've added a measure to impede such an approach by applying a cryptographic hashing function before the murmur hashing. Without knowing the parameters of this cryptographic hashing function, someone could not derive the encrypted id that would be needed for such an attack.

How were the HLL sets created that are loaded herein from CSV?

By adding ids consecutively to hll sets. In reality, this is a multi-step process done using a number of tools. The process can be summarizes in the following steps:

Create a HLL database with a fairly broad number of default structures present (see our HLL DB Docker Template).

Write a (python) script to connect to the respective API from Social Media,

retrieve json data for geotagged posts, and transform to RAW-LBSN structure and from that to HLL-LBSN structure and

submit to HLLdb. This is done in-memory using the package lbsntransform. Thus, for streaming the HLL updates, no original data needs to be stored. The CSV with bases and HLL sets is then exported from the HLL database.

Why do I need to connect to a database when I only want to perform HLL calculations?

The original algorithm developed by Flajolet et al. (2007) exists in several implementations. After some consideration, we decided for the Postgres implementation by Citus. While there exist a number of ports (java, js and go), none is currently available for Python. This is the main reason why we created the hll-worker database, an empty postgres container with the HLL extension installed, which can be used for hll calculations independent of the language used. The connection is based on a read-only user, which only allows in-memory calculations.

There are several advantages to such a setup. From a privacy perspective, the main one is based on the separation of concerns principle: the HLL Worker doesn't need to know the 'bases' (e.g. the coordinates) for which distinct ids are aggregated. In the calculation below (union_hll()), coordinates are replaced by ascending integers. Thus, even if someone 'compromosed' the hll worker, no contextual information is available to make inferences to e.g. which places have been visited by particular users. Conversely, the hllworker can apply a cryptographic hash function to ids using a salt that is only available in the worker, not the one who is sending information. This means that neither the user nor the hll-worker have all the information that is necessary to re-construct final hll-sets. Only if both, the one who is sending raw data and the hll-worker, are compromised, it is possible to re-create hll-sets.

This is perhaps beyond the security measures what is necessary when working with datathat is publicly available anyway, such as the one herein. However, these measures come at nearly no cost, which is why considered this as the default setup that would be suited to most users.

While using HLL for the overlay (distinct IDs), aren't the bases (coordinates, for example) also of relevance to privacy?

Yes, to some degree. The privacy-aware structure makes a clear distinction between distinct ids, as overlays, that directly reference single users. Conversely, bases may or may not refer to single users (we call the contextual referencing). However, for such contextual referencing, additional ('contextual') data must be taken into account to allow re-identification of users. For example, one may look up a single coordinate-pair on the original Flickr YFCC dataset to find out who used it to geotag their photographs. Some bases can be formatted to make contextual referencing harder, such as increasing granularity of information. Furthermore, since relations are dissolved in the privacy-aware structure, no additional information or query is available beyond the count-metrics.

This means that the hll structure features very limited options for analyzing data in different ways than it was created for. This prevents re-using of datasets for other contexts, such as appears possible when data is compromised by someone with different aims. On the other hand, Social Media data is often publicly available anyway and the hll-structure cannot prevent attacks compromising peoples´ privacy by using the original data from Social Media that is sometimes still available.

RAW structure components¶

[Add Section] - see overview of raw structure components at lbsn.vgiscience.org/structure/

HLL structure components¶

[Add Section] - see overview of hll structure components at gitlab.vgiscience.de/lbsn/structure/hlldb/

[Update Section with basic visualization examples]

term base¶

facet = "topical"
base = "term"
samples = 10

Get data from topical.term:

def get_samples(facet: str, base: str, samples: int):
    """Get samples from db"""
    db_query = f"""
        SELECT * FROM {facet}.{base} LIMIT {samples};
    """
    df = db_conn.query(db_query)
    return df
df = get_samples(facet=facet, base=base, samples=samples)
display(df.head())

def list_components(df, facet: str = None, base: str = None):
    """Formatted display of structure components (overlays and bases)"""
    comp = [
        '**Base**',
        '**Overlay(s) (hll metrics)**',
        '**Additional (supplementary base information)**']
    hll_cols = []
    if base is None:
        bases = []
    else:
        bases = [base]
    additional = []
    for col in df.columns:
        if col.endswith("_hll"):
            hll_cols.append(col)
            continue
        if bases is None:
            bases.append(col)
        else:
            if col not in bases:
                additional.append(col)
    display(Markdown(f'### <u>{facet}.{base}</u>'))
    for ix, li in enumerate([bases, hll_cols, additional]):
        if not li:
            continue
        display(Markdown(f'{comp[ix]}: {", ".join([f"*{li_}*" for li_ in li])}'))

list_components(df, facet=facet, base=base)

def get_cardinality(
    facet: str, base: str, metric: str, samples: int):
    """Get cardinality of samples from db"""
    db_query = f"""
    SELECT hll_cardinality({metric})::int, {base}
    FROM {facet}.{base} LIMIT {samples};
    """
    df = db_conn.query(db_query)
    df = df.set_index([base])
    df.sort_values(
        by="hll_cardinality", ascending=False, inplace=True)
    return df

df = get_cardinality(
    metric="user_hll", facet=facet, base=base, samples=samples)
df.head()

def plot_df_bar(
    df: pd.DataFrame, title: str, ylabel: str, xlabel: str):
    """Generate barplot from df"""
    plt.figure(figsize=(15,10))
    cmap = plt.cm.tab10
    colors = cmap(np.arange(len(df)) % cmap.N)
    df["hll_cardinality"].plot.bar(color=colors)
    plt.xticks(rotation=50)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()
plot_df_bar(
    df, title=f"{samples} sample records from {facet}.{base}",
    ylabel="Number of distinct users (estimated)", xlabel=f"{base}")

hashtag base¶

base = "hashtag"

df = get_samples(facet=facet, base=base, samples=samples)
display(df.head())

df = get_cardinality(
    metric="user_hll", facet=facet, base=base, samples=samples)

df.head()

plot_df_bar(
    df, title=f"{samples} sample records from {facet}.{base}",
    ylabel="Number of distinct users (estimated)", xlabel=f"{base}")

date base¶

facet = "temporal"
base = "date"
samples = 10

df = get_samples(facet=facet, base=base, samples=samples)
display(df.head())

df = get_cardinality(
    metric="user_hll", facet=facet, base=base, samples=samples)
display(df.head())

plot_df_bar(
    df, title=f"{samples} sample records from {facet}.{base}",
    ylabel="Number of distinct users (estimated)", xlabel=f"{base}")

latlng base¶

facet = "spatial"
base = "latlng"

Get sample data for Dresden:

db_query = f"""
    SELECT latitude, longitude, hll_cardinality(post_hll)::int
    FROM {facet}.{base} t1
    WHERE t1.latlng_geom
        && -- intersects
        ST_MakeEnvelope (
         13.46475261, 50.90449772, -- bounding 
         14.0876945, 51.24305035, -- box limits (Dresden)
         4326)
"""
df = db_conn.query(db_query)

display(df.head())

Create geoviews point layer:

points_lonlat = gv.Points(
    df,
    kdims=['longitude', 'latitude'], vdims=['hll_cardinality'])

Plot points on map:

def set_active_tool(plot, element):
    """Enable wheel_zoom in bokeh plot by default"""
    plot.state.toolbar.active_scroll = plot.state.tools[0]

hv.Overlay(
    gv.tile_sources.EsriImagery * \
    points_lonlat.opts(
        tools=['hover'],
        size=1+dim('hll_cardinality')*0.1,
        line_color='black',
        line_width=0.1,
        fill_alpha=0.8,
        fill_color='yellow')
    ).opts(
        width=800,
        height=480,
        hooks=[set_active_tool],
        title='Estimate Post Count for distinct yfcc coordinates',
        projection=ccrs.GOOGLE_MERCATOR,
    )

	metric	bytes/ct	bytes_pretty	bytes_per_row
0	core_relation_size	1340825600	1279 MB	105.0
1	visibility_map	0	0 bytes	0.0
2	free_space_map	352256	344 kB	0.0
3	table_size_incl_toast	1341374464	1279 MB	105.0
4	indexes_size	400359424	382 MB	31.0
5	total_size_incl_toast...	1741733888	1661 MB	137.0
6	live_rows_in_text_rep...	1677544434	1600 MB	132.0
7	---------------------...	0	None	NaN
8	row_count	12692723	None	NaN
9	live_tuples	0	None	NaN
10	dead_tuples	0	None	NaN

	user_hll	post_hll	latlng_hll	place_hll	date_hll	term
0	\x128b7fcfda7373d4ac49ac0dc04b57f3a2953e1197fd4a8c8c7cfa69f87203a5e54731	\x128b7ff6a9ae084cd38677f9b66415b1fefd1e0a32e14d478c9d716980577330517b7a	\x128b7f858faf642a3d1598	\x118b7f	\x128b7f8d6a7b8a762b330298e6603e14a06863de6a68d629709a487eba89d057233bfb	r0011144jpg
1	\x128b7f82ead4d62107b3e496c727fdc751343f998f82329fed0f309e32606bca6d9906a3609f3d06de85ddc4debc22aecc2650d272bce88ceb25aad442aa50b05a4e26d72acb8d2b...	\x128b7f8780924442400ade87a46a0d384e40aba0c495da538dd0efaac703f9f7bd6d4fb4e61d7a097b464aba8c789fd285951abd054871db998892c1540ffdffb9c940c570d8d961...	\x128b7f858faf642a3d1598	\x118b7f	\x128b7f868cd90a7e9be030899f3446cbd759828c31dcccaaa35d008d56dab004e5bc6d966b996512694ef69af238b5ee0705669fc3f774ae36adefabe261c0ab55e277ac6dc45cca...	citaat
2	\x128b7fb378865552b5a53f	\x128b7fb665d40e41b91f80f78342f3d70a96991ce1fa28153b957d1fb9d34f4898b6612cde510260083dd67f58214b2194e32a	\x128b7f858faf642a3d1598	\x118b7f	\x128b7f1787aeb543911236	invarveli
3	\x128b7fa01caf2822618f05e7f2ace926fa85ab273f1482ecd92952	\x128b7f83c819e6c570c4a59477ed9877a211eedf8c3afa2b8d7a802c5756273ea608964d2682781a10af8350f7461b872f709868b29ee7081f0ecc6f9489966d300b98	\x128b7f858faf642a3d1598	\x118b7f	\x128b7fb068cb6585110d4cdd90e10bdc6f541c7631cd169b9326fc	goccod
4	\x128b7fb89cc948f47f1880bb6b639233157ec5bdbbfd3cab69344cc05302382fb2a40734a1ffd679d080c6619b4c2f43b0e35272e76a1f42a96e2b	\x128b7fc15f46a9af06a08ac972c80f640e0c9fd99039a62726c8180df39974bb4194e21f2c53e19c01697e492616cb083a03cb6861737e6f4cfcd7	\x128b7f858faf642a3d1598	\x118b7f	\x128b7fde6a68d629709a48eaaebf65752b176e112c6bddb11df8dd245891399ead4e35619c463460b29d286e23ebf2496d3e9e	p5050069

	hll_cardinality
term
citaat	25
p5050069	7
citaten	5
r0011144jpg	4
goccod	3

	user_hll	post_hll	latlng_hll	place_hll	date_hll	hashtag
0	\x128b7f78de33c8e92a36a7	\x128b7f6fc8bdda303509f5	\x128b7f858faf642a3d1598	\x118b7f	\x128b7f10a38f1ce7e29cc4	leidsuplein
1	\x128b7f86546a53c0322dfca48c61f45c80f281bb2e743e52c44c140b052cfeaf0b2dd8394e81ae1c9380814a25c4786414a701	\x128b7f836ef086840b137d88ccb44dd2cc850d926a49da47dce97694c0434d75a909fe962e0e65b59d5661a1685523a86d6582adbfdecd078d9b9bb0da3db1506a7756ba7008af66...	\x128b7f858faf642a3d1598	\x118b7f	\x128b7fad057c5c0b0a3dcfc797103bbf1ca6d6c88466743740f206cbfbf30c9cbf2e50100351d70d0246f138317a69ab04f30e39d7fe8aa062e4f2696d0456d4c7fa1d	2008!
2	\x128b7f359bc6dc91c0f831	\x128b7fbd9c0e9643fe8ce0c1d90de0daafcb0b	\x128b7f858faf642a3d1598	\x118b7f	\x128b7f6bd2bf83e5a47dea	pig-iron
3	\x128b7fecf2131cca152a8c	\x128b7f1b01fd7d447b7bde	\x128b7f858faf642a3d1598	\x118b7f	\x128b7ff4b9a7a66d02c196	tunturihotel
4	\x128b7f98d9908bb75563f106472c3997eb37db08829a39e43f04e3245ca1477a015b953c5a4eeba15ac5ad	\x128b7f9ce44e7463e61a67b6966ef52569fbf5be7eefee3c71e84abf8585f3afd60bb5cb3c4a2818dc74a0cfcc244cbed84297d18b43bc0c977ffed419907b5993e53bdad20b4eec...	\x128b7f858faf642a3d1598	\x118b7f	\x128b7fb018bd604c767ee0d05117a5acaa764be12178f9bf9de900eaa48eed9ef98f6beb962b8de919823d02fb1b734b3a439305a0fbf27a9e92b51267d85b62b87a9b31e376b7ec...	voulkos

	hll_cardinality
hashtag
whitebox	14
2008!	6
voulkos	5
luluzinhacamprj	2
leidsuplein	1