CKAN is a data management system that provides tools for data publishing and sharing. The data could be metadata, text/audio or video. How much data can CKAN handle? This is the question we tried to answer in the EUDAT project at the Language Archive Group. We found that it performs well with ten thousands of datasets (records) but has issues with millions of records.
CKAN is being used by a number of organizations and governments including the UK, Canada and US governments. The UK Government (data.gov.uk) uses CKAN to provide a central access to government data with the objective of making data “easy to find, easy to license, and easy to re-use”. The Canadian government (data.gc.ca) uses CKAN to provide one-stop access to the Government of Canada’s searchable open data with the objective of enhancing transparency and accountability. Similarly, the US Federal Government (data.gov) uses CKAN to provide a single portal where data from different portals, sources and catalogs (over 200 publishing organizations) is displayed in a standardized user interface allowing users to search, filter and facet through thousands of datasets.
Many of these public CKAN installations have datasets in thousands, not in millions. The Canadian national portal (data.gc.ca) has less than 200,000 datasets. The US government portal (data.gov) has less than 100,000 datasets and the UK government portal (data.gov.uk) has less than 20,000 datasets. Because we plan to use CKAN as a single portal for many publishers with potentially millions of datasets, we ran performance and scalability tests.
Can CKAN handle millions?
No, with default configurations
Yes, with performance tuning
Default configurations – if CKAN is installed without any changes other than to make it work, importing millions of datasets will take a long time. CKAN is designed and configured primarily for thousands of datasets. To make it work for millions of datasets, performance optimization is necessary.
Hence, we carried out such performance tuning at three levels. First, we changed the CKAN configuration file. The changes here involve delaying solr indexing/committing and stopping activity streaming. Second, we changed some designs in the postgres database tables (based on tips from ) and our observations. The changes here involve removing constraints and adding/removing database indexes). Third, we changed a few postgresql (postgresql.conf) configurations to take advantage of available memory and CPU. With these changes, we imported 2 million datasets into CKAN in less than 2 weeks. Without these changes, it would have taken over a year. This we estimated based on a trend seen on 150,000 datasets on a machine with 8GB RAM and 2.67Ghz CPU (dual core).