As a member of CKAN’s technical team I often wonder what versions are still in use and which extensions are popular. Since we don’t have any analytics built into CKAN the only way we can find out is by going out and “asking” each CKAN site. This information can be very handy, since we can use it to decide when fixes are no longer worth the effort to backport, or which “core” extensions should recieve focus.

All CKAN instances by default have a handy API endpoint commonly found at /api/action/status_show that provides some interesting information. From this endpoint we can identify the version of CKAN being run, the site title, a list of all enabled plugins, a contact email (usually not set in practice) and the default language. Really this is everything we were hoping for.

Finding CKAN

Fanstatic is an old asset-management library that is used by CKAN to managed CSS and JavaScript files. It lost the popularity wars to webassets so only a few sites other than CKAN will be using it. We’ll query the fantastic shodan search engine which hits the entire IPv4 address space for pages that are on port 80 that contain fanstatic/:vendor:, since it’s used as part of the resource path on a generated page.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import requests


def get_shodan_results(key, query, max_pages=None):
    page = 1
    total = 0

    while True:
        r = requests.get(
            'https://api.shodan.io/shodan/host/search',
            params={
                'key': sys.argv[1],
                'query': query,
                'page': page,
                'l': 100
            }
        )

        j = r.json()

        for item in j['matches']:
            total += 1
            yield item

        if total >= j['total'] or page == max_pages:
            return

        if not j['matches']:
            return

        page += 1


if __name__ == '__main__':
    results = get_shodan_results(
        key=sys.argv[1],
        query='http.html:fanstatic/:vendor: port:"80"'
    )

    for result in results:
        if 'http' not in result:
            continue

        print('{scheme}://{host}:{port}'.format(
            scheme='https' if 'ssl' in result else 'http',
            host=result['http']['host'],
            port=result['port']
        ))

With this we have a list of around 269 URLs we can hit.

Querying CKAN

For each host we’ve discovered we’ll make a couple of API requests:

API Method Fields
/api/action/status_show Version, extensions, title, language
/api/action/package_search Date of first dataset, date of last dataset, # of datasets, # of resource formats
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import json
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import requests
import requests.exceptions
import unicodecsv as csv

if __name__ == '__main__':
    writer = csv.DictWriter(sys.stdout, [
        'host',
        'version',
        'title',
        'locale',
        'extensions',
        'formats',
        'count',
        'first_date',
        'last_date'
    ])
    writer.writeheader()

    for host in sys.stdin:
        host = host.strip()

        try:
            r = requests.get(
                '{host}/api/action/status_show'.format(
                    host=host
                ),
                verify=False
            )
        except requests.exceptions.ConnectionError:
            continue

        if r.status_code != 200:
            continue

        try:
            j = r.json()['result']
        except ValueError:
            continue

        try:
            pkg_last = requests.get(
                '{host}/api/action/package_search'.format(
                    host=host
                ),
                params={
                    'sort': 'metadata_created desc',
                    'rows': 1,
                    'facet.field': json.dumps([
                        'res_format'
                    ])
                },
                verify=False
            ).json()['result']

            last_date = ''
            if pkg_last['results']:
                last_date = pkg_last['results'][0]['metadata_created']
        except (KeyError, TypeError, ValueError):
            pkg_last = {}

        try:
            pkg_first = requests.get(
                '{host}/api/action/package_search'.format(
                    host=host
                ),
                params={
                    'sort': 'metadata_created asc',
                    'rows': 1
                },
                verify=False
            ).json()['result']

            first_date = ''
            if pkg_first['results']:
                first_date = pkg_first['results'][0]['metadata_created']
        except (KeyError, TypeError, ValueError):
            pkg_first = {}

        writer.writerow({
            'host': host,
            'version': j['ckan_version'],
            'title': j['site_title'],
            'locale': j['locale_default'],
            'extensions': json.dumps(j['extensions']),
            'formats': json.dumps(
                pkg_last.get('facets', {}).get('res_format')
            ),
            'count': pkg_first.get('count', 0),
            'first_date': first_date,
            'last_date': last_date
        })

And put it all together:

$ python find_ckan.py | python query_ckan.py > ckan_results.csv

In the end we get 250 results from the 269 we started with. Lets throw our results into postgres so we can do some simple queries against them:

$ createdb ckan_results

Then create our table:

CREATE TABLE "public"."ckan_results" (
    "host" varchar(255),
    "version" varchar(15),
    "title" text,
    "locale" varchar(10),
    "extensions" jsonb,
    "formats" jsonb,
    "count" int,
    "first_date" timestamp with time zone,
    "last_date" timestamp with time zone,
    PRIMARY KEY ("host")
);

And then load our CSV:

COPY ckan_results FROM 'ckan_results.csv' DELIMITER ',' CSV HEADER;

Results

Lets take a look at our some of our results. You can grab a copy of the results CSV yourself here. Lets start by taking a look at version popularity, which was the original question that started this whole thing:

SELECT
  version,
  COUNT(1) AS count,
  string_agg(host, ', ')
FROM (SELECT DISTINCT ON (title, count) * FROM ckan_results) AS ckan_results
GROUP BY version
ORDER BY count DESC
LIMIT 10;
version count hosts
2.5.2 35 http://134.1.2.23:80, http://96.127.58.74:80, http://52.222.14.1:80, http://143.107.158.91:80, http://138.91.29.173:80, http://140.134.48.48:80, http://140.110.240.59:80, http://88.197.53.132:80, http://150.162.6.68:80, http://34.251.42.47:80, http://193.205.161.51:80, http://177.67.68.32:80, http://149.202.179.38:80, http://dati.trentino.it:80, http://162.243.161.199:80, http://84.39.40.151:80, http://92.222.232.25:80, http://103.43.46.53:80, http://195.220.224.113:80, http://13.75.156.29:80, http://mail.prebi.unlp.edu.ar:80, http://37.186.220.182:80, http://60.10.25.187:80, http://52.53.251.238:80, http://146.185.135.123:80, http://13.69.151.171:80, http://46.31.137.66:80, http://52.35.87.14:80, http://52.30.146.224:80, http://136.172.30.53:80, http://rds.ua.edu:80, http://196.44.161.89:80, http://54.66.131.32:80, http://54.252.174.240:80, http://211.109.9.119:80
2.6.2 22 http://213.128.251.124:80, http://148.88.72.40:80, http://52.28.62.35:80, http://52.56.108.184:80, http://210.115.182.219:80, http://52.14.57.54:80, http://35.186.144.159:80, http://34.252.66.235:80, http://34.212.183.41:80, http://34.211.183.146:80, http://40.114.240.248:80, http://lwbin-datahub.ad.umanitoba.ca:80, http://data.gss.stonybrook.edu:80, http://77.66.12.195:80, http://103.9.227.26:80, http://opendata.cs.vt.edu:80, http://34.211.119.229:80, http://192.103.41.232:80, http://146.48.123.151:80, http://84.39.36.193:80, http://14.225.2.228:80, http://60.10.25.235:80
2.6.0 20 http://capulet.bcu.ac.uk:80, http://217.58.92.91:80, http://84.39.34.228:80, http://84.39.48.156:80, http://169.239.252.29:80, http://52.27.39.170:80, http://200.198.212.122:80, http://81.208.35.117:80, http://194.135.94.23:80, http://staging.dati.toscana.it:80, http://dati.toscana.it:80, http://212.145.253.91:80, http://91.244.248.28:80, http://177.12.72.177:80, http://193.137.203.42:80, http://5.53.104.200:80, http://bennu.magic.ubc.ca:80, http://hub.urbanopus.net:80, http://139.162.65.209:80, http://103.39.193.205:80
2.5.1 17 http://52.72.107.99:80, http://52.87.102.194:80, http://160.40.63.111:80, http://186.33.211.187:80, http://158.49.245.76:80, http://52.31.131.69:80, http://52.192.227.186:80, http://shiny.bo.isac.cnr.it:80, http://srv-drdsi-ext.jrc.it:80, http://180.250.189.45:80, http://104.131.110.37:80, http://52.29.76.59:80, http://145.253.163.170:80, http://data.deutschebahn.com:80, http://176.124.252.42:80, http://5.196.95.43:80, http://222.158.210.6:80
2.2 16 http://b2find.eudat.eu:80, http://54.228.230.194:80, http://84.38.48.220:80, http://46.30.246.78:80, http://40.83.120.36:80, http://mbmggin.mtech.edu:80, http://216.109.20.7:80, http://datos.arona.org:80, http://ofisl.ugr.es:80, http://128.140.233.15:80, http://200.3.193.35:80, http://54.228.181.101:80, http://191.239.109.218:80, http://139.217.10.108:80, http://52.68.36.26:80, http://52.197.46.57:80
2.4.1 14 http://202.214.212.68:80, http://150.162.114.59:80, http://5.189.169.245:80, http://161.148.242.38:80, http://www.mkecommunitydata.com:80, http://213.226.229.30:80, http://223.27.48.21:80, http://6st.lancs.ac.uk:80, http://159.203.88.200:80, http://139.217.17.163:80, http://163.29.98.131:80, http://42.159.244.35:80, http://210.189.171.49:80, http://210.189.171.17:80
2.3 13 http://173.226.100.112:80, http://130.206.126.145:80, http://174.140.171.251:80, http://45.55.253.172:80, http://props.getjive.com:80, http://130.75.87.3:80, http://157.193.231.112:80, http://157.193.231.70:80, http://ckan-ds0.cloudapp.net:80, http://fllwbp34.fl.ulaval.ca:80, http://54.193.76.39:80, http://103.224.137.178:80, http://118.97.66.109:80
2.2.1 13 http://193.146.227.183:80, http://194.48.80.125:80, http://www.civicdata.com:80, http://128.135.112.101:80, http://186.28.253.85:80, http://net180.environ.umanitoba.ca:80, http://datos.labcd.mx:80, http://195.45.76.147:80, http://ckan-dev.snap.uaf.edu:80, http://194.140.149.252:80, http://ubdc-mac-2.css.gla.ac.uk:80, http://104.46.233.143:80, http://210.189.171.126:80
2.6.1 9 http://fosckaprd01.cer.auckland.ac.nz:80, http://51.15.143.57:80, http://35.167.220.103:80, http://180.250.53.125:80, http://52.27.149.196:80, http://148.251.121.162:80, http://54.247.88.62:80, http://45.64.96.188:80, http://80.93.233.124:80
2.5.3 7 http://opendata.corkcity.ie:80, http://89.238.75.196:80, http://data.cityofsydney.nsw.gov.au:80, http://52.200.53.203:80, http://200.20.168.26:80, http://181.209.63.190:80, http://210.236.15.29:80

How about (default) language popularity?

SELECT
  locale,
  COUNT(1) AS count
FROM (SELECT DISTINCT ON (title, count) * FROM ckan_results) AS ckan_results
GROUP BY locale
ORDER BY count DESC
LIMIT 10;
locale count
en 116
es 21
ja 14
it 11
pt_BR 8
de 6
zh_TW 5
zh_CN 5
fr 5
en_GB 4

How about the top-20 extensions?

SELECT
  jsonb_array_elements(extensions) AS ext,
  COUNT(1)                         AS count
FROM (SELECT DISTINCT ON (title, count) * FROM ckan_results) AS ckan_results
GROUP BY ext
ORDER BY count DESC
LIMIT 20;
ext count
stats 214
text_view 169
recline_view 163
datastore 161
image_view 157
datapusher 111
resource_proxy 89
harvest 55
recline_preview 52
pdf_view 52
ckan_harvester 51
text_preview 50
recline_grid_view 48
spatial_metadata 46
spatial_query 44
geo_view 42
geojson_view 38
recline_map_view 38
recline_graph_view 37
googleanalytics 34

What are the top-20 most common file formats for linked resources?

SELECT
  lower(format.key)                AS format_type,
  SUM(format.value :: TEXT :: INT) AS total
-- The odds of a site having the same title & package
-- count are pretty low, we'll consider them identical sites.
FROM (SELECT DISTINCT ON (title, count) * FROM ckan_results) AS ckan_results,
     jsonb_each(formats) AS format
WHERE formats :: TEXT NOT IN ('null', '{}') AND
      -- This host is a little messed up, we're excluding
      -- it to make the numbers a bit saner.
      host != 'http://210.115.182.219:80'
GROUP BY format_type
ORDER BY total DESC
LIMIT 20;
format_type total
html 133151
pdf 120068
  98324
csv 66160
json 48434
zip 46946
xml 44949
wms 36239
xls 21607
originator data format 20429
bin 18607
shp 16120
wfs 15539
rdf 10736
tiff 10538
xlsx 9721
mrsid 9625
kml 9498
txt 6378
xyz 6129