Geographic IP database using PERL, PHP and MySQL — UPDATE: September 16 2008

Updated September 16 2008 to reflect the numerous changes that have resulted since the original article was posted in 2005 – smp


Targeting Web site content to the specific visitors who view the site is a very important marketing advantage. Being able to track incoming visitors by the country that they originate from is an additional item that can assist companies in ensuring that visitors are presented with relevant content. This may seem like a daunting task, but it can be achieved with a high degree of accuracy using publicly accessible data, and Open Source software.

IP to Country Mapping

The idea for IP to Country mapping is one that has started to appear more frequently on the Internet in recent months. All GeoIP systems do warn users that they are not 100% accurate. The accuracy of GeoIP mapping can be affected by things such as large corporate or ISP networks where traffic is routed out a small number of public access points, regardless of the traffic’s point of origin.

Making IP Addresses Searchable

The first issue that needs to be addressed is how to determine if an IP Address is in one of the ranges that is defined as originating from a distinct country. The simplest way to range-match IP Addresses is to abandon the dotted-quad notation we are all familiar with, and convert the IP Address to an IP Number.
All IP Addresses can be converted into decimal numbers that fall into a known range between 0 and (2^32)-1 (4294967295). In reality, the range is even smaller than that, as public IP Addresses fall between 0.0.0.0 (IP Number: 0) and 223.255.255.255 (IP Number: 3758096383).
A quick search of the Web showed me that there is a way to create functions to convert IP Addresses to their numerical (IP Number) equivalent and reverse the process.

sub ip2long {
	return sprintf("%u",unpack("l*", pack("l*", unpack("N*", inet_aton(shift)))));
}
sub long2ip {
	return inet_ntoa(pack("N*", shift));
}
<--snip-->
if ($start_ip =~ /\d+\.\d+\.\d+\.\d+/) {
         ## my $ip_address = shift($start_ip);
         chomp($start_ip);
         $long_start = ip2long($start_ip);
         ## print "$ip_address converts to $ip_number\n";
 }

If you are using PHP in your applications, this conversion process is made even easier by native function calls.

Convert IP Address to IP Number
$ip_number = sprintf("%u",ip2long($ip_address));

IP Address Location Data

Now that we have settled on a format for the IP data to be used in the database, we now have to find IP data that allows us to map IP Addresses to countries. This is easier than it sounds, as this data is centrally held by the 5 regional IP Registries — ARIN, RIPE, APNIC, LACNIC, and AFRINIC. After poking around in the depths of their Web sites, I found that they actually provide text formatted versions of the allocated and assigned IP ranges that they are responsible for. All of the registries use the same format, which makes parsing the these files a simple process.
As of March, 2005, the distribution of IPV4 networks in the database by registry is:

+----------+--------------+
| registry | num_networks |
+----------+--------------+
| arin     |        36073 |
| ripencc  |        14813 |
| apnic    |        10474 |
| lacnic   |         1460 |
| afrinic  |          443 |
+----------+--------------+
5 rows in set (0.52 sec)

I chose to use the PERL module WWW::CURL to retrieve the files. You could re-write the application to use LWP or some other method on systems where cURL is not supported, as it is a simple file download over FTP. I update the data once a day, which may at first appear excessive. However, I have seen upwards of 40-50 new rows added to the database in a single day.
Some may ask why I chose to write the downloaded files to a file rather than immediately inserting them into the database. Using this two-step process gives me the ability to manually rollback to an older database if there is a problem retrieving one of the registry files. I have set an arbitrary limit of 75,000 lines for the entire aggregated file; if the file is less than that, the remainder of the process is aborted.
The data retrieved from the registries is in the following format.
Registry Raw Data Format

<snip>
apnic|CN|ipv4|202.127.4.0|256|19950610|assigned
apnic|BN|ipv4|202.160.0.0|2048|19950610|allocated
apnic|NP|asn|4613|1|19950611|allocated
apnic|LK|ipv4|203.143.0.0|1024|19950612|allocated
apnic|MO|asn|4609|1|19950615|allocated
apnic|KR|asn|4670|1|19950616|allocated
apnic|SB|ipv4|202.63.254.0|512|19950618|assigned
apnic|JP|ipv4|202.232.0.0|262144|19950618|allocated
apnic|SG|ipv4|203.127.192.0|8192|19950618|allocated
apnic|PK|asn|4615|1|19950629|allocated
apnic|HK|asn|4614|1|19950704|allocated
</snip>

The fields are all “|” (pipe-character) separated, and are described below.

COLUMN		VALUES
---------------------------------------------------------------------
REGISTRY:	apnic,arin,ripencc,lacnic,iana
COUNTRY_CODE:	One of 240 unique 2-character country codes or "*"
ADDRESS_TYPE:	asn,ipv4,ipv6
ADDRESS:	Either the starting IP Address or AS Number or "*"
NUMBER:		Number of IPs in range or "1" if ADDRESS_TYPE is "asn"
DATE:		Date IP range or AS Number was added to database or "*"
RANGE_TYPE:	"allocated" -> borrowed; "assigned" -> owned

Storing the Data in MySQL

To store the data, I created a two-table MySQL database named “ip_registry”, using the script below.
Database creation statement for ‘ip_registry’

CREATE DATABASE ip_registry;

Table structure for table ‘country_code’

CREATE TABLE ip_registry.country_code (
code char(2) default NULL,
country varchar(50) default NULL,
UNIQUE KEY code (code)
);

Table structure for table ‘ip_map’

CREATE TABLE ip_registry.ip_map (
code char(2) default NULL,
registry char(10) default NULL,
ip_from double default NULL,
ip_to double default NULL,
UNIQUE KEY registry (registry,ip_from,ip_to)
);

As of March 2005 September 2008, the “ip_map” data table for my system runs to 63,263 88,000+ rows. This value will change daily, and may decrease suddenly at times. The registries make an effort to aggregate as many IP networks as possible into the largest possible contiguous block, and this aggregation process will reduce the number of individual entries by 2,000 – 3,000 rows in a single day.
The recognized standard for country codes is ISO 3166. In this standard, each nation is assigned a unique, two-character code. The ONLY exception I found to this rule is that, for historical reasons, the IP registries have entries for the United Kingdom listed with two country codes (GB and UK). I could have corrected this in the Perl script by standardizing on a single country code, but I preferred the solution of adding another row to the “country_code” table.
From the raw Registry data, I determined that only four of the fields useful for the project that I was working on: REGISTRY, COUNTRY_CODE, ADDRESS, and NUMBER. I then wrote PERL code to read the raw IP Registry data from the data file I created previously, convert the starting IP address to a number, use this starting IP Number that to generate the end IP Number, and then insert the rows into a database.
PERL: IP Number conversion and database insert

<--snip-->
if ($line_count >= 115000) {
## Establish Database Connection
print "\n\nOpening database connection";
my $dbh = DBI->connect("DBI:mysql:host=[host];database=logger","[username]","[password]",{PrintError=>0});
## Remove existing values
my $sth = $dbh->do("TRUNCATE TABLE ip_map");
print " --> Data from ip_map table dropped";
$sth = $dbh->prepare("INSERT into ip_map values (?,?,?,?,?,?)");
$count = 0;
print " --> Completed\n";
## Insert Data Into Database
print "Inserting data into the database";
open (PROCESS, "<$file");
  while ($line =
) {
        chomp ($line);
        if (($line =~ m/\|ipv4\|/) and ($line !~ m/\|\*\|/)) {
                ($registrar,$country_code,$item_type,$start_ip,$num_ip,$entry_date,$registry_type) = split(/\|/, $line);
                $long_start = 0;
                if ($start_ip =~ /\d+\.\d+\.\d+\.\d+/) {
                        ## my $ip_address = shift($start_ip);
                        chomp($start_ip);
                        $long_start = ip2long($start_ip);
                        ## print "$ip_address converts to $ip_number\n";
                        $long_end = $long_start + ($num_ip-1);
                        $count += $sth->execute($country_code,$registrar,$long_start,$long_end,$num_ip,$start_ip);
                }
        }
  }
}
<--snip-->

The “TRUNCATE” statement in the script has the affect of dropping the table and re-creating it using the column names and types defined in the initial create statement. It is easier to rebuild the data table each time new data is inserted to ensure that duplicates and overlaps do not enter into the database.
Why is the value of “$long_end” defined by “$long_start + ($num_ip-1)”? The IP address ranges delivered by the registries count the starting value as one of the items in the set — i.e. counting using ordinal numbers.

START_IP:   12.236.236.0
END_IP:     12.236.236.255
NUMBER_IP:  256

If cardinal numbering is used to calculate the address range, incorrect IP addresses will be generated.

IP Number Calculations
WRONG!	216853760 = 216853504 + 256 	-> END_IP = 12.236.237.0
RIGHT!	216853759 = 216853504 + (256-1)	-> END_IP = 12.236.236.255

I have also added a sanity-check that stops the insertion process if the number of lines in the data file is less than 75,0000 115,000. This would prevents the creation of a truncated database if one of the registries does not update their data files or the script is unable to retrieve the data files. The value of 75,000 115,000 appears high, but the data files that the data is extracted from also contain autonomous system and IPV6 data, as well as the IPV4 data that is inserted into the database. Currently, the data file runs to over 90,000 130,000 lines, so the 75,000 115,000 line barrier seems very reasonable to prevent inserting a broken dataset.

Querying the Database

Now that the database is constructed, we can start to run queries against it.

mysql> select ip.code,ip.registry, ip.ip_from, ip.ip_to, co.country
-> from ip_map ip, country_code co
-> where (ip.code = 'IS') and (ip.code = co.code);
+------+----------+------------+------------+---------+
| code | registry | ip_from    | ip_to      | country |
+------+----------+------------+------------+---------+
| IS   | ripencc  | 1049722880 | 1049731071 | ICELAND |
| IS   | ripencc  | 1359937536 | 1359970303 | ICELAND |
| IS   | ripencc  | 1383088128 | 1383096319 | ICELAND |
| IS   | ripencc  | 1385447424 | 1385455615 | ICELAND |
| IS   | ripencc  | 1390215168 | 1390280703 | ICELAND |
| IS   | ripencc  | 1403846656 | 1403863039 | ICELAND |
| IS   | ripencc  | 1433681920 | 1433690111 | ICELAND |
| IS   | ripencc  | 1439023104 | 1439039487 | ICELAND |
| IS   | ripencc  | 1440481280 | 1440514047 | ICELAND |
| IS   | ripencc  | 2644312064 | 2644377599 | ICELAND |
| IS   | ripencc  | 3238264832 | 3238330367 | ICELAND |
| IS   | ripencc  | 3245150208 | 3245154303 | ICELAND |
| IS   | ripencc  | 3261718528 | 3261726719 | ICELAND |
| IS   | ripencc  | 3264217088 | 3264282623 | ICELAND |
| IS   | ripencc  | 3556884480 | 3556892671 | ICELAND |
| IS   | ripencc  | 3558785024 | 3558793215 | ICELAND |
| IS   | ripencc  | 3565084672 | 3565092863 | ICELAND |
| IS   | ripencc  | 3584524288 | 3584532479 | ICELAND |
| IS   | ripencc  | 3585114112 | 3585122303 | ICELAND |
| IS   | ripencc  | 3585433600 | 3585441791 | ICELAND |
| IS   | ripencc  | 3586023424 | 3586031615 | ICELAND |
| IS   | ripencc  | 3587538944 | 3587547135 | ICELAND |
| IS   | ripencc  | 3587981312 | 3587997695 | ICELAND |
| IS   | ripencc  | 3641278464 | 3641282559 | ICELAND |
| IS   | ripencc  | 3642535936 | 3642540031 | ICELAND |
| IS   | ripencc  | 3650592768 | 3650596863 | ICELAND |
| IS   | ripencc  | 3650596864 | 3650600959 | ICELAND |
| IS   | ripencc  | 3651915776 | 3651919871 | ICELAND |
+------+----------+------------+------------+---------+
28 rows in set (0.09 sec)

So the database structure is sound. It is important to build the file using all four five registries; even though Iceland is now covered by RIPE, older IP allocations and assignments were been handled by both RIPE and ARIN.
Having the registry information helps build in the flexibility to add a WHOIS functionality using this database, something that I have done on for my own IP Tracking and logging tool. This allows for further drilldowns on the data, beyond the scope of this article.
The main item that will be of interest to most Web site administrators is that they can now build dynamic pages using a data source which tracks their visitors’ announced IP address to the country of origin with a high degree of accuracy. This is particulary useful if you are attempting to distribute users to geographically diverse mirror sites. You can also do fun things, such as displaying the flag of the country that the visitor is coming from.
A generic copy of the registry retrieval and database insertion script can be yours just by contacting me.

22 thoughts on “Geographic IP database using PERL, PHP and MySQL — UPDATE: September 16 2008”

  1. When attempting to connect to the FTP sites above, I am prompted for login credentials. Anonymous login does not seem to work for me. Is the data publicly available or do I need a subscription of some sort to get the registry data from the FTP sites?
    Thanks,
    JW

  2. When attempting to connect to the FTP sites above, I am prompted for login credentials. Anonymous login does not seem to work for me. Is the data publicly available or do I need a subscription of some sort to get the registry data from the FTP sites?Thanks,JW

  3. Thank you man… That was a step by step procedure to a complete ip-tracker code!!! great work…

  4. Thank you man… That was a step by step procedure to a complete ip-tracker code!!! great work…

  5. I am doing something simliar to this to generate whitelist IP-list based on country of origin. I am troubled by the fact that I have 4.751 rows in my database, which have a country code of EU. Anyone know of a way of determining what country it really is?

  6. Jakob:From just the registrar data, you will unlikely be able to do this. However, there are a number of commercial products that may be able to assist you in narrowing these down.smp

  7. Well yes, but the idea was to make a do-it-yourself solution using public and free information. I have all the pieces in place, except for the one “country code” that does not conform to the standard ISO 3166.

  8. Unfortunately, all the data does exist, but it is in the whois data for the blocks in question. This is a manual process that I chose not to undertake due to its complexity and my lack of programming skils.

  9. Well yes, but the idea was to make a do-it-yourself solution using public and free information. I have all the pieces in place, except for the one “country code” that does not conform to the standard ISO 3166.

  10. Unfortunately, all the data does exist, but it is in the whois data for the blocks in question. This is a manual process that I chose not to undertake due to its complexity and my lack of programming skils.

Leave a Reply

Your email address will not be published. Required fields are marked *