Privacy Test for Alex

Introduction

Herein, we explore (two unlikely but possible) scenarios how privacy can be compromised using the HLL data. We distinguish two vectors:

a) an internal attack ("sandy"), where the attacker has full access to the HLL database. This is based on the full HLL data collected in Notebook 2.
b) an external attack ("robert"), where the attacker has access to the published benchmark data. This is based on the published HLL data, limiting Grid cells to bins where usercount > 100.

In both scenarios, an attacker would need additional information such as

  • information about the system such as exact HLL parameters etc.,
  • the secret key used for cryptographic hashing of ids,
  • contextual information, such as from raw data available elsewhere (e.g. on Social Media).

The two examples herein use what is described as an "intersection attack" by Desfontaines et al. (2018). Intersection attacks do not provide absolute certainty, but under certain circumstances they can be used to confirm suspicion, or significantly increase knowledge, which may finally lead to compromised privacy.

We use two scenarios to demonstrate such intersection attacks, based on a user Alex, with Scenario "Sandy" and "Robert".

Alex is an actual user included in the YFCC100 dataset. He is one of the authors of this guide and published images from 2008 to 2012 under Creative Commons Licenses on Flickr. 120 of these images are geotagged. While Alex is an actual user, Scenarios "Sandy" and "Robert" are purely fictional. They are used to illustrate two specific examples of how Alex's privacy could become compromised.

Parameters

In [104]:
# Select scenario
# SCENARIO = "robert"
SCENARIO = "sandy"

Preparations

In [105]:
import sys
from pathlib import Path
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)

from _03_yfcc_gridagg_hll import *

Additional imports

In [106]:
import matplotlib.patches as mpatches
from matplotlib.lines import Line2D

Load RAW data

Connect to raw db

In [107]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "rawdb"
db_port = "5432"
db_name = "rawdb"

db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)
db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")
Out[107]:
?column?
0 1
In [108]:
alex_user_id = '96117893@N05'
alex_userday = '2012-05-09'
In [109]:
sql_query = f"""
    SELECT
        t1.user_guid,
        t1.post_guid,
        to_char(t1.post_create_date, 'yyyy-MM-dd') as "post_create_date",
        ST_Y(ST_PointFromGeoHash(ST_GeoHash(t1.post_latlng, 5), 5)) As "latitude", 
        ST_X(ST_PointFromGeoHash(ST_GeoHash(t1.post_latlng, 5), 5)) As "longitude"
    FROM topical.post t1
    WHERE user_guid = '{alex_user_id}'
    AND post_geoaccuracy IN ('place', 'latlng', 'city');
"""

This can take a while:

In [110]:
%%time
pickle_path = OUTPUT / "pickles" / "alex_raw_locations.pkl"
if pickle_path.exists():
    alex_raw = pd.read_pickle(pickle_path)
else:
    alex_raw = db_conn.query(sql_query)
    alex_raw.to_pickle(pickle_path)   
CPU times: user 4.22 ms, sys: 642 µs, total: 4.86 ms
Wall time: 4.53 ms
In [111]:
print(len(alex_raw))
120
In [112]:
if SCENARIO == 'sandy':
    alex_raw.query(
        f"post_create_date == '{alex_userday}'",
        inplace=True)
In [113]:
alex_raw.head()
Out[113]:
user_guid post_guid post_create_date latitude longitude
1 96117893@N05 8974106438 2012-05-09 37.63916 -122.365723
2 96117893@N05 8972920203 2012-05-09 37.63916 -122.365723
3 96117893@N05 8972875223 2012-05-09 37.63916 -122.365723
8 96117893@N05 8973965140 2012-05-09 37.63916 -122.365723
17 96117893@N05 8974036854 2012-05-09 37.63916 -122.365723
In [114]:
geoseries_alex_locations = gp.GeoSeries(
    [Point(alex_location.longitude, alex_location.latitude)
     for _, alex_location in alex_raw.iterrows()], crs=CRS_WGS)
In [115]:
geoseries_alex_locations_proj = geoseries_alex_locations.to_crs(CRS_PROJ)
In [116]:
geoseries_alex_locations_proj.head()
Out[116]:
0    POINT (-10609916.322 4523845.434)
1    POINT (-10609916.322 4523845.434)
2    POINT (-10609916.322 4523845.434)
3    POINT (-10609916.322 4523845.434)
4    POINT (-10609916.322 4523845.434)
dtype: geometry

Visualize raw locations of Alex

In [117]:
world = gp.read_file(
    gp.datasets.get_path('naturalearth_lowres'),
    crs=CRS_WGS)
world = world.to_crs(CRS_PROJ)

Add annotation layer for location labels:

In [118]:
sanfrancisco_coords = (Point(-122.5776844, 37.7576171), "San Francisco")
berlin_coords = (Point(13.1445531, 52.5065133), "Berlin")
caboverde_coords = (Point(-23.0733155, 16.7203123), "Cabo Verde")
In [119]:
df = pd.DataFrame([sanfrancisco_coords, berlin_coords, caboverde_coords], columns=["geometry", "name"])
In [120]:
df.head()
Out[120]:
geometry name
0 POINT (-122.5776844 37.7576171) San Francisco
1 POINT (13.1445531 52.5065133) Berlin
2 POINT (-23.0733155 16.7203123) Cabo Verde
In [121]:
gdf = gp.GeoDataFrame(
        df.drop(
            columns=["geometry"]),
            geometry=df.geometry)
gdf.crs = CRS_WGS
gdf = gdf.to_crs(CRS_PROJ)
gdf['coords'] = gdf['geometry'].apply(lambda x: x.representative_point().coords[:])
gdf['coords'] = [coords[0] for coords in gdf['coords']]
In [122]:
label_off = {
    "San Francisco":(5500000, 1000000),
    "Berlin":(4500000, 1000000),
    "Cabo Verde":(4500000, -1000000)}
label_rad = {
    "San Francisco":0.1,
    "Berlin":0.5,
    "Cabo Verde":-0.3}

def annotate_locations(
    gdf: gp.GeoDataFrame, label_off: List[Tuple[int,int]] = label_off,
    label_rad: List[float] = label_rad):
    """Annotate map based on a list of locations"""
    for idx, row in gdf.iterrows():
        plt.annotate(
            text=row['name'], 
            xy=row['coords'],
            xytext=np.subtract(row['coords'], label_off.get(row['name'])),
            horizontalalignment='left',
            arrowprops=dict(
                arrowstyle='->', 
                connectionstyle=f'arc3,rad={label_rad.get(row["name"])}',
                color='red'))
In [123]:
fig, ax = plt.subplots(1, 1, figsize=(11, 14))
geoseries_alex_locations_proj.buffer(500000).plot(
    ax=ax,
    facecolor="none",
    edgecolor='red', 
    linewidth=0.2,
    alpha=0.9,
    label='Alex, actual locations (RAW)')
ax.axis('off')
# combine with world geometry
world.plot(
    ax=ax, color='none', edgecolor='black', linewidth=0.3)
annotate_locations(gdf=gdf)

Load HLL data

In 03_yfcc_gridagg_hll.ipynb, aggregate grid data was stored to yfcc_all_est_benchmark.csv, including hll sets with cardinality > 100. Load this data first, using functions from previous notebooks.

Read benchmark data, only loading usercount and usercount_hll columns.

In [124]:
benchmark_data_published = "yfcc_all_est_benchmark.csv"
benchmark_data_internal = "yfcc_all_est_benchmark_internal.csv"

Select userdays or usercount, based on chosen scenario:

In [125]:
if SCENARIO == "robert":
    metric = "usercount"
else:
    metric = "userdays"
In [126]:
load_opts = {
    "columns":["xbin", "ybin", f"{metric}_hll"],
    "metrics":[f"{metric}_est"],
    "grid_size":GRID_SIZE_METERS
}
grid_internal = grid_agg_fromcsv(
    OUTPUT / "csv" / benchmark_data_internal,
    **load_opts)
grid_published = grid_agg_fromcsv(
    OUTPUT / "csv" / benchmark_data_published,
    **load_opts)
datasets = [
    grid_published,
    grid_internal
]
In [127]:
grid_published[grid_published[f"{metric}_est"]>5].head()
Out[127]:
geometry userdays_est userdays_hll
xbin ybin
-15340096 2779952 POLYGON ((-15340096.000 2779952.000, -15240096... 2041.0 \x148b40084220880008002100012004000ca409020080...
2679952 POLYGON ((-15340096.000 2679952.000, -15240096... 585.0 \x138b400061008100e30121014101610201028202e503...
-15240096 2779952 POLYGON ((-15240096.000 2779952.000, -15140096... 955.0 \x148b4008402180000002200000204400000200000000...
2679952 POLYGON ((-15240096.000 2679952.000, -15140096... 11564.0 \x148b40108622a46410cc5210a44886220c452086230c...
-15140096 2679952 POLYGON ((-15140096.000 2679952.000, -15040096... 3566.0 \x148b4010c4210c24284600104009025090c218c22314...

Intersection attack

Connect to hll worker and test intersection

Connect to hll worker:

In [128]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
# set connection variables
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"

db_connection = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_connection.set_session(readonly=True)
db_conn = tools.DbConn(db_connection)
db_conn.query("SELECT 1;")
Out[128]:
?column?
0 1

We either need the HLL or the hash for an intersection attack. If the hash is given, we can recreate the HLL. Hashes can either be known if the secret key is compromised or if an attacker observes internal memory states during the streaming of values and HLL conversion. We do not publish the secret key that was used to generate HLL.

In [129]:
if SCENARIO == "robert":
    # alex cryptographic hash
    alex_hash = 'fcad382c10535ad1bfdec19651eb7ec93d6d7b9bac7566503b38f5a4f8be56e6' 
else:
    # alex @ date (2012-05-09) cryptographic hash
    alex_hash = 'cfc0d9890bfdd66728e179c25b243b867122998bc493bd4f41739bce857d7682' 

Test intersection attack using a single HLL value:

In [130]:
hll_val = grid_published[grid_published[f"{metric}_est"]>=1].iloc[1][f"{metric}_hll"]

Adjust hll defaults:

In [131]:
db_conn.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Out[131]:
hll_set_defaults
0 (11,5,-1,1)

Use hll_eq() function (reference) to make equality comparison for unioned and original hll:

In [132]:
alex_hll = f"""
    hll_add_agg(
        hll_hash_text(
            '{alex_hash}'))
"""

sql_query = f"""
    
SELECT 
    hll_eq(
        hll_union(
            {alex_hll}, '{hll_val}'::hll),
        '{hll_val}'::hll) as hll_equal;
"""

result = db_conn.query(
    sql_query)
In [133]:
result.hll_equal[0]
Out[133]:
False

Intersection attack for all grid cells

Repeat intersection attack for all HLL sets per grid, store results in a separate column:

In [134]:
for grid in datasets:
    grid["alex"] = None

Test all Grid Cells

In [135]:
for dataset_id, grid in enumerate(datasets):
    usermetric_series = grid[f"{metric}_est"].dropna()
    bins_found = 0
    for idx, __ in usermetric_series[usermetric_series > 0].iteritems():
        hll_val = grid.loc[idx][f"{metric}_hll"]
        sql_query = f"""
        SELECT hll_eq(
            hll_union(
                {alex_hll}::hll, '{hll_val}'::hll),
            '{hll_val}'::hll) as hll_equal;
        """
        result = db_conn.query(
            sql_query)
        if result.hll_equal[0]:
            bins_found += 1
            clear_output(wait=True)
            print(
                f"Dataset {dataset_id+1} "
                f"- Number of positive bins found: "
                f"{bins_found}. Last positive bin index: {idx}")
            grid.loc[idx, "alex"] = True
Dataset 2 - Number of positive bins found: 143. Last positive bin index: (15159904, -4420048)

All grid cells that have been marked positive for the intersection attack contain the HLL patterns of the given hash.

In [136]:
grid_published[grid_published["alex"] == True]
Out[136]:
geometry userdays_est userdays_hll alex
xbin ybin
-10640096 4579952 POLYGON ((-10640096.000 4579952.000, -10540096... 204279.0 \x148b40424e73952751ce94292a31ce7420e839ce5424... True
4179952 POLYGON ((-10640096.000 4179952.000, -10540096... 72295.0 \x148b402a0c62a928214c6318e9414c54112541d0739d... True
-10540096 4079952 POLYGON ((-10540096.000 4079952.000, -10440096... 13863.0 \x148b40208e2114843148618c462210429045210a2288... True
-10440096 4179952 POLYGON ((-10440096.000 4179952.000, -10340096... 6406.0 \x148b4018424008421042218c6518c6210c85110421a0... True
4079952 POLYGON ((-10440096.000 4079952.000, -10340096... 2467.0 \x148b4000842088200880018803104021004108002000... True
... ... ... ... ... ...
12159904 4279952 POLYGON ((12159904.000 4279952.000, 12259904.0... 2219.0 \x148b4000c00088a110862084640840601c6200440008... True
12459904 -4520048 POLYGON ((12459904.000 -4520048.000, 12559904.... 25263.0 \x148b40198862148430c843988718d23414e329d26490... True
13359904 -4020048 POLYGON ((13359904.000 -4020048.000, 13459904.... 2689.0 \x148b402084000c211004340006184213106110461400... True
13459904 -4020048 POLYGON ((13459904.000 -4020048.000, 13559904.... 38365.0 \x148b4051c8631484490a519103294c4318c92aca5314... True
15159904 -4420048 POLYGON ((15159904.000 -4420048.000, 15259904.... 8800.0 \x148b40190431088620c8520cc210c6510c4112482180... True

124 rows × 4 columns

Check how many positive grid cells exist that have a higher likeliness to be true compared to Cabo Verde (usercount: 56)

In [137]:
if SCENARIO == "robert":
    grid_published[grid_published["alex"] == True].sort_values(by=['usercount_est'], ascending=True).head()

Visualize results

In [138]:
color_raw = "red"
color_published = "#810f7c"
color_internal = "#fc4f30"

Get positive grid cells that have only been detected with internal data (usercount < 100)

In [139]:
internal_additional = pd.concat(
    [grid_internal[grid_internal["alex"] == True], grid_published[grid_published["alex"] == True]]
    ).drop_duplicates(keep=False)
In [140]:
internal_additional.plot()
Out[140]:
<AxesSubplot:>

Check how many positive grid cells exist that have a higher likeliness to be true compared to Cabo Verde (usercount: 56)

In [141]:
if SCENARIO == "robert":
    internal_additional[internal_additional["alex"] == True].sort_values(by=['usercount_est'], ascending=True).head(30)

Combine layers into single graphic, annotate:

In [142]:
fig, ax = plt.subplots(1, 1, figsize=(11, 14))
internal_additional.centroid.buffer(250000).plot(
    ax=ax,
    facecolor=color_internal,
    edgecolor=color_internal, 
    linewidth=1,
    alpha=0.9)
grid_published[grid_published["alex"] == True].plot(
    ax=ax,
    facecolor=color_published,
    edgecolor=color_published, 
    linewidth=1,
    alpha=0.9)
geoseries_alex_locations_proj.buffer(500000).plot(
    ax=ax,
    facecolor="none",
    edgecolor=color_raw, 
    linewidth=0.5,
    alpha=0.9
)
if SCENARIO == 'sandy':
    label_text_raw = f"on {alex_userday}"
    drop_rows_idx = gdf.index[gdf['name'] == "Cabo Verde"].tolist()
    location_label_gdf = gdf.drop(drop_rows_idx)
else:
    label_text_raw = "actual locations"
    drop_rows_idx = gdf.index[gdf['name'].isin(["San Francisco", "Berlin"])].tolist()
    location_label_gdf = gdf.drop(drop_rows_idx)

legend_entry_raw = Line2D(
        [0], [0],
        markeredgecolor="red",
        linestyle="None",
        linewidth=0.5,
        marker='o',
        markerfacecolor='None',
        markersize=15,
        label=f"Alex, {label_text_raw} (RAW)")

external_patch = mpatches.Patch(
    color=color_published,
    label='Query results \non published data \n(usercount > 100)')

legend_entry_internal = Line2D(
        [0], [0],
        markeredgecolor=color_internal,
        linestyle="None",
        linewidth=0.5,
        marker='o',
        markerfacecolor=color_internal,
        markersize=15,
        label='Additional query results \nwith direct database access')
legend_entries = [
    external_patch,
    legend_entry_internal,
    legend_entry_raw
    ]
plt.legend(
    handles=legend_entries, loc='lower left',
    frameon=False, prop={'size': 16})

# combine with world geometry
world.plot(
    ax=ax, color='none', edgecolor='black', linewidth=0.3)
# fig.patch.set_visible(False)
ax.axis('off')
ax.add_artist(ax.patch)
ax.patch.set_zorder(-1)
fig.tight_layout()
annotate_locations(gdf=location_label_gdf)
fig.savefig(
    OUTPUT / "figures" / f"Alex_privacy_example_{SCENARIO}.png", dpi=300, bbox_inches = 'tight',
    pad_inches = 0)
plt.show()

Finalize notebook:

In [101]:
db_connection.close()

Convert notebook to HTML

In [102]:
!jupyter nbconvert --to html_toc \
    --output-dir=../out/html ./Privacy_test_alex.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False # create single output file
[NbConvertApp] Converting notebook ./Privacy_test_alex.ipynb to html_toc
[NbConvertApp] Writing 525427 bytes to ../out/html/Privacy_test_alex.html
In [ ]: