In this post, we’ll learn how to run MaxMind Geo Lookups directly on raw data as it is ingested and add these lookups as columns to our Hydrolix Table.
We’ll leverage Hydrolix Dictionaries (for in memory lookups), Hydrolix Data Enrichment, and custom functions to achieve this.
Table of Contents
Install hdxcli tool
To load dictionary file and their associated configuration the fastest solution is to leverage our CLI tool:
https://github.com/hydrolix/hdxcli
You can use the following command to install the latest version:
1 |
pip install --index-url https://test.pypi.org/simple hdxcli==1.0rc31 |
Once you have install hdxcli tool you can create a profile ~/.hdx_cli/config.toml
the configuration file contains section like the following:
1 2 3 4 5 6 7 8 9 |
[default] username = "username@hydrolix.io" hostname = "hydrolix.company.net" projectname = "myproject" tablename = "mytable" [prod] username = "username@hydrolix.io" hostname = "hydrolix-prod.company.net" |
The configuration can contains multiple profile environment, if nothing is specified the CLI will use the default one. To use the prod environment you can use hdxcli --profile prod
For this post we are creating a new project called maxmind
where we’ll put all dictionary and function.
1 |
hdxcli project create maxmind |
Load maxmind files
There are 2 main steps required to setup geolookup, the first step is to actually load the CSV files into Hydrolix cluster.
The default CSV files GeoLite2-City-Locations-en.csv doesn’t load in Hydrolix because of single quotes in the CSV file, you can create a new file in python with double quote everywhere:
1 2 3 4 5 6 7 8 |
import csv with open('GeoLite2-City-Locations-en.csv', 'r') as csvfilein: with open('GeoLite2-City-Locations-en-fixed.csv', 'w') as csvfileout: csvreader = csv.DictReader(csvfilein, delimiter=',') writer = csv.DictWriter(csvfileout, fieldnames=csvreader.fieldnames, quoting=csv.QUOTE_ALL) writer.writeheader() for row in csvreader: writer.writerow(row) |
After fixing the CSV file you can start loading those into Hydrolix.
1 2 3 4 5 6 7 8 |
hdxcli --project maxmind dictionary files upload GeoLite2-Country-Locations-en.csv geoip_country_locations_en -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-Country-Blocks-IPv6.csv geoip_country_blocks_ipv6 -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-ASN-Blocks-IPv4.csv geoip_asn_blocks_ipv4 -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-City-Blocks-IPv6.csv geoip_city_blocks_ipv6 -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-ASN-Blocks-IPv6.csv geoip_asn_blocks_ipv6 -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-City-Blocks-IPv4.csv geoip_city_blocks_ipv4 -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-City-Locations-en-fixed.csv geoip_city_locations_en -t verbatim hdxcli --project maxmind dictionary files upload GeoLite2-Country-Blocks-IPv4.csv geoip_country_blocks_ipv4 -t verbatim |
The second step is to actually create dictionary definition for those, this is the example for the geoip_country_locations_en.json
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
{ "name": "geoip_country_locations_en", "settings": { "filename": "geoip_country_locations_en", "layout": "hashed", "lifetime_seconds": 5, "output_columns": [ { "name": "geoname_id", "datatype": { "type": "uint64", "denullify": true } }, { "name": "locale_code", "datatype": { "type": "string", "denullify": true } }, { "name": "continent_code", "datatype": { "type": "string", "denullify": true } }, { "name": "continent_name", "datatype": { "type": "string", "denullify": true } }, { "name": "country_iso_code", "datatype": { "type": "string", "denullify": true } }, { "name": "country_name", "datatype": { "type": "string", "denullify": true } }, { "name": "subdivision_1_iso_code", "datatype": { "type": "string", "denullify": true } }, { "name": "subdivision_1_name", "datatype": { "type": "string", "denullify": true } }, { "name": "subdivision_2_iso_code", "datatype": { "type": "string", "denullify": true } }, { "name": "subdivision_2_name", "datatype": { "type": "string", "denullify": true } }, { "name": "city_name", "datatype": { "type": "string", "denullify": true } }, { "name": "metro_code", "datatype": { "type": "uint32", "denullify": true } }, { "name": "time_zone", "datatype": { "type": "string", "denullify": true } }, { "name": "is_in_european_union", "datatype": { "type": "uint8", "denullify": true } } ], "primary_key": [ "geoname_id" ], "format": "CSVWithNames" } } |
All the dictionary definition are available in our github:
https://github.com/hydrolix/transforms/tree/dev/MaxMind/dictionary
You can checkout this repository and then use the following command to create the dictionary into Hydrolix:
1 2 3 4 5 6 7 8 |
hdxcli --project maxmind dictionary create geoip_asn_blocks_ipv4.json geoip_asn_blocks_ipv4 hdxcli --project maxmind dictionary create geoip_asn_blocks_ipv6.json geoip_asn_blocks_ipv6 hdxcli --project maxmind dictionary create geoip_city_blocks_ipv4.json geoip_city_blocks_ipv4 hdxcli --project maxmind dictionary create geoip_city_blocks_ipv6.json geoip_city_blocks_ipv6 hdxcli --project maxmind dictionary create geoip_city_locations_en.json geoip_city_locations_en hdxcli --project maxmind dictionary create geoip_country_blocks_ipv4.json geoip_country_blocks_ipv4 hdxcli --project maxmind dictionary create geoip_country_blocks_ipv6.json geoip_country_blocks_ipv6 hdxcli --project maxmind dictionary create geoip_country_locations_en.json geoip_country_locations_en |
You should now be able to do SQL query using the MaxMind Geo ip dictionary freshly loaded into Hydrolix:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
SELECT ip, dictGetUInt32('maxmind_geoip_city_blocks_ipv4', 'geoname_id', tuple(IPv4StringToNum(ip))) AS geoname_id, dictGetString('maxmind_geoip_city_blocks_ipv4', 'postal_code', tuple(IPv4StringToNum(ip))) AS postcode, dictGetFloat64('maxmind_geoip_city_blocks_ipv4', 'latitude', tuple(IPv4StringToNum(ip))) AS latitude, dictGetFloat64('maxmind_geoip_city_blocks_ipv4', 'longitude', tuple(IPv4StringToNum(ip))) AS longitude, dictGetUInt32('maxmind_geoip_city_blocks_ipv4', 'accuracy_radius', tuple(IPv4StringToNum(ip))) AS accuracy_radius, dictGetString('maxmind_geoip_city_locations_en', 'locale_code', toUInt64(geoname_id)) AS locale_code, dictGetString('maxmind_geoip_city_locations_en', 'continent_code', toUInt64(geoname_id)) AS continent_code, dictGetString('maxmind_geoip_city_locations_en', 'continent_name', toUInt64(geoname_id)) AS continent_name, dictGetString('maxmind_geoip_city_locations_en', 'country_iso_code', toUInt64(geoname_id)) AS country_iso_code, dictGetString('maxmind_geoip_city_locations_en', 'country_name', toUInt64(geoname_id)) AS country_name, dictGetString('maxmind_geoip_city_locations_en', 'subdivision_1_iso_code', toUInt64(geoname_id)) AS subdivision_1_iso_code, dictGetString('maxmind_geoip_city_locations_en', 'subdivision_1_name', toUInt64(geoname_id)) AS subdivision_1_name, dictGetString('maxmind_geoip_city_locations_en', 'subdivision_2_iso_code', toUInt64(geoname_id)) AS subdivision_2_iso_code, dictGetString('maxmind_geoip_city_locations_en', 'subdivision_2_name', toUInt64(geoname_id)) AS subdivision_2_name, dictGetString('maxmind_geoip_city_locations_en', 'city_name', toUInt64(geoname_id)) AS city_name, dictGetUInt32('maxmind_geoip_city_locations_en', 'metro_code', toUInt64(geoname_id)) AS metro_code, dictGetString('maxmind_geoip_city_locations_en', 'time_zone', toUInt64(geoname_id)) AS time_zone, dictGetUInt8('maxmind_geoip_city_locations_en', 'is_in_european_union', toUInt64(geoname_id)) AS is_in_european_union FROM ( SELECT arrayJoin(['129.45.17.12', '173.194.112.139', '77.88.55.66', '2.28.228.0', '95.47.254.1', '62.35.172.0']) AS ip ) |
Simplify your query with function
As you can see above the query can be complex, and here we don’t even deal with IPv6 and IPv4 which are using different dictionary files.
This is an example which is looking if the IP is v4 then load the proper dictionary maxmind_geoip_city_blocks_ipv4
and if it’s v6 then maxmind_geoip_city_blocks_ipv6
.
From the dictionary we load the geoname_id
use for subsequent lookup.
1 2 3 4 5 6 |
(ip) -> toUInt64(multiIf( isIPv4String(ip), dictGetUInt32('maxmind_geoip_city_blocks_ipv4', 'geoname_id', tuple(IPv4StringToNumOrDefault(toString(assumeNotNull(ip))))), isIPv6String(ip), dictGetUInt32('maxmind_geoip_city_blocks_ipv6', 'geoname_id', tuple(IPv6StringToNumOrDefault(toString(assumeNotNull(ip))))), 0 )) |
All the functions are available into our github:
https://github.com/hydrolix/transforms/tree/dev/MaxMind/function
And to load those function with hdxcli you can use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
hdxcli --project maxmind function create -f accuracy_radius.json accuracy_radius hdxcli --project maxmind function create -f asn_name.json asn_name hdxcli --project maxmind function create -f asn_org.json asn_org hdxcli --project maxmind function create -f city_name.json city_name hdxcli --project maxmind function create -f continent_code.json continent_code hdxcli --project maxmind function create -f continent_name.json continent_name hdxcli --project maxmind function create -f country_iso_code.json country_iso_code hdxcli --project maxmind function create -f country_name.json country_name hdxcli --project maxmind function create -f geo.json geo hdxcli --project maxmind function create -f geoname_id.json geoname_id hdxcli --project maxmind function create -f is_in_european_union.json is_in_european_union hdxcli --project maxmind function create -f latitude.json latitude hdxcli --project maxmind function create -f locale_code.json locale_code hdxcli --project maxmind function create -f longitude.json longitude hdxcli --project maxmind function create -f metro_code.json metro_code hdxcli --project maxmind function create -f subdivision_1_iso_code.json subdivision_1_iso_code hdxcli --project maxmind function create -f subdivision_1_name.json subdivision_1_name hdxcli --project maxmind function create -f subdivision_2_iso_code.json subdivision_2_iso_code hdxcli --project maxmind function create -f subdivision_2_name.json subdivision_2_name hdxcli --project maxmind function create -f time_zone.json time_zone |
This allow you to simplify the query to something like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT ip, maxmind_geo('asn', ip) AS asn, maxmind_geo('org', ip) AS org, maxmind_geo('country', ip) AS country, maxmind_geo('subdivision1', ip) AS subdivision1, maxmind_geo('state', ip) AS state, maxmind_geo('subdivision2', ip) AS subdivision2, maxmind_geo('city', ip) AS city FROM ( SELECT arrayJoin(['129.45.17.12', '173.194.112.139', '77.88.55.66', '2.28.228.0', '95.47.254.1', '62.35.172.0', '2001:4860:4860::8888', '2607:f8b0:4001:c24::65', '2606:4700:4700::1111', '2600:9000:254a:6000:7:49a5:5fd2:2221']) AS ip ) |
Adding GeoLookup at Ingest
Hydrolix allows data enrichment at ingest and we can setup geo lookup here.
This is an example you can use in transform SQL where the lookup is done on the column cliIP
:
1 2 3 4 5 6 7 |
SELECT maxmind_geo('asn', cliIP) AS cliIP_asn, maxmind_geo('org', cliIP) AS cliIP_org, maxmind_geo('country', cliIP) AS cliIP_country, maxmind_geo('state', cliIP) AS cliIP_state, maxmind_geo('city', cliIP) AS cliIP_city, * FROM {STREAM} |
The transform needs to include the virtual column:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
{ "name": "cliIP_asn", "datatype": { "type": "uint32", "source": { "from_input_field": "sql_transform" } } }, { "name": "cliIP_org", "datatype": { "type": "string", "source": { "from_input_field": "sql_transform" } } }, { "name": "cliIP_country", "datatype": { "type": "string", "source": { "from_input_field": "sql_transform" } } }, { "name": "cliIP_state", "datatype": { "type": "string", "source": { "from_input_field": "sql_transform" } } }, { "name": "cliIP_city", "datatype": { "type": "string", "source": { "from_input_field": "sql_transform" } } } |
The data will now contain the new column:
- cliIP_asn -> AS number
- cliIP_org -> AS name
- cliIP_country -> Country
- cliIP_state -> State
- cliIP_city -> City
Obviously this is an example and you can rename the column to whatever you like!