====================
Pattrn data packages
====================

In the :ref:`getting-started` tutorial we have seen how to package
a dataset directly with the source code of the Pattrn app and how
to publish them together on the web.

As the Pattrn code itself will mostly stay the same while the
dataset for a Pattrn instance may be updated frequently, in most
cases it is advisable to keep Pattrn code and data (along with
metadata and settings) separate, so that Editors only need to
focus on curating the dataset.

In order to do so, Pattrn supports the use of *Pattrn data packages*:
these are `NPM <https://npmjs.org/>`_ packages that contain data,
metadata and settings for a Pattrn instance.

A consistent way to package Pattrn data also allows Pattrn editors
to make interesting datasets available for others to reuse and
analyse in their own Pattrn instances, therefore facilitating
collaboration around datasets.

*A future aim for the Pattrn project is to switch to using* `Frictionless
Data Packages <http://frictionlessdata.io/data-packages/>`_, *as these
provide a standard way to package data for reuse: if you are a
data scientist or developer interested in adding support for Frictionless
Data Packages to Pattrn, please get in touch! (see* :ref:`pattrn-gitter-room` *).*

In this section of the Pattrn manual, you will learn:

* how to :ref:`install <installing-a-pattern-data-package>`
  a Pattrn data package and use it for a Pattrn instance
* how to :ref:`prepare and publish <developing-a-pattern-data-package>`
  a Pattrn data package

Prerequisites
-------------

In order to *install* Pattrn data packages you will need to use
`Git <https://git-scm.org/>`_ and `Node.js <https://nodejs.org/>`_ 
(version 6.10 or later) on your computer. You will therefore
need to have at least basic proficiency with using the command line,
although no specific experience with Git, Node.js or JavaScript in general
is required.

Additionally, in order to *create* Pattrn data packages, you will
need to be able to edit JSON files, to handle basic Git operations
(commit, branch, pushing to remotes) and to use a code collaboration
platform (we recommend to host Pattrn data packages on
`GitLab <https://gitlab.com/>`_).

.. _installing-a-pattern-data-package:

Installing a Pattrn data package
--------------------------------

In order to use Pattrn with a Pattrn data package, we will be
*building* the Pattrn app from its source code (rather than
directly using the pre-built app as you may have done if you
followed the :ref:`getting-started` tutorial in this manual).

In order to build Pattrn from source, you will need Node.js (the current
`Node.js LTS release <https://nodejs.org/en/download/>`_ is recommended as
Pattrn is mainly developed and tested with this version) and
`Yarn <https://yarnpkg.com/en/docs/install>`_.

Firstly, *clone* the current version of Pattrn from the master
GitHub repository::

    git clone https://github.com/pattrn-project/pattrn.git

Then enter the ``pattrn`` source folder::

    cd pattrn

To configure Pattrn to use and bundle a Pattrn data package, simply
create a file with name ``source-data-packages.json`` within the
``pattrn`` folder. Its content should be as in the example below
(just replace the URI of the sample data package with the URI of
your own data package in the ``source`` setting)::

    {
        "source_data_packages": [
            {
            "package": "pattrn-data-where-the-drones-strike",
            "source": "https://gitlab.com/pattrn-data/pattrn-data-where-the-drones-strike.git#pattrn-data"
            }
        ]
    }

The ``source_data_packages`` configuration object is an array, although
the current version of Pattrn only supports a single data package for
each Pattrn instance.

The ``package`` setting needs to match the NPM package name as defined
within the NPM package's own ``package.json`` file. The ``source``
setting matches the syntax of NPM's ``dependencies``
`configuration section <https://docs.npmjs.com/files/package.json#dependencies>`_.

Once the ``source-data-packages.json`` file has been created with suitable
content, building the Pattrn app will automatically retrieve the configured
Pattrn data package and bundle it with the Pattrn app::

    yarn install && yarn run gulp build

If you run into a build error related to the ``node-sass`` package, running
``yarn install --force && yarn run gulp build`` should fix the issue (see
https://github.com/sass/node-sass/issues/1579#issuecomment-227663782 for
details).

As part of the Pattrn build process, the Gulp build scripts will install
the Pattrn data package configured in the ``source-data-packages.json``
file **and copy all the content of its** ``pattrn-data`` **folder to the**
``dist`` **folder where the Pattrn app gets built**: effectively, the
``pattrn-data`` folder of the Pattrn data package gets *merged* with
the content of the Pattrn app.

If the app is built correctly, you will find your Pattrn app bundled with
your dataset inside of the ``dist`` folder. You can now publish this folder
(for example on Netlify, as illustrated in the :ref:`getting-started`
tutorial), or run the app locally in order to check that everything is
working as expected::

    yarn start

(the URI where to access the app will be displayed as part of Yarn's
output for the command above).

When developing a new Pattrn data package (see section below), it may
be useful to reference a *local* repository in the
``source-data-packages.json`` file, rather than a web URI, so that
any changes made to the data package during the development process
can be reflected almost immediately in the local Pattrn app (just
run ``yarn build && yarn start`` to merge the latest local copy of
the Pattrn data package into the development copy of Pattrn.

.. _developing-a-pattern-data-package:

Developing a Pattrn data package
--------------------------------

If you have a dataset in GeoJSON format, creating a Pattrn data package
for your own Pattrn instances or to share with other Pattrn users is easy.

The process involves:

* creating a project folder with a simple subfolder structure
* placing your GeoJSON data file alongside Pattrn's metadata,
  settings and core config files at specific locations of the
  project's folder structure
* creating a ``package.json`` file to turn the folder into a
  NPM package

Let's now go over the process in detail.

Before starting to package your data, you will want to make sure that the
GeoJSON dataset you wish to package is ready to be used with Pattrn (see
the section of this manual about managing :ref:`managing-data-geojson`
for all the relevant details).

To create your Pattrn data package, first create a project folder for
the data package; the name of the folder is not relevant: you will
normally want to use a name that mirrors your NPM package name, and
to allow users to easily distinguish a NPM package that is a Pattrn
data package we recommend to use the ``pattrn-data-`` prefix. For
example, for the sample dataset used in the :ref:`getting-started` tutorial
in this manual, we would create and enter the project folder as::

    mkdir pattrn-data-where-the-drones-strike && cd "$_"

Initializing the NPM package
............................

Within the project folder, run the ``npm init`` command: this
will ask a few questions (providing sensible defaults) and then
create a ``package.json`` file that turns your project into a NPM
package.

You may customise any of the settings while running the ``npm init``
command or by editing the generated ``package.json`` file afterwards.
We recommend to:

* set the ``name`` according to the suggestion in the previous section
* use `semantic versioning <http://semver.org/>`_ to manage package versions
* set a brief ``description`` of the dataset
* include ``pattrn-data`` amongst any ``keywords`` for the NPM package
* set the ``private`` field to ``true``, at least initially, to avoid
  accidentally publishing the package to NPM if it is not intended for
  public use or until it has not been fully checked.

You will need to choose a license for your package. If you are simply
packaging a dataset provided by a third party, make sure to check their
terms of use and licenses and to comply with these. Your own packaging
work and any scripts to clean up data should be distributed under
a license you choose (we recommend to use the GNU GPL v3 or later),
but when doing so you **must** indicate this clearly by creating a
text file with details about the licenses of each component of the
Pattrn data package (e.g. ``LICENSE.txt``) and setting the
``license`` field of the ``package.json`` file to ``SEE LICENSE IN
LICENSE.txt`` (or the name of your file with license details), as
recommended in the
`NPM documentation <https://docs.npmjs.com/files/package.json#license>`_
for the ``license`` field.

Adding data and metadata to the package
.......................................

You can put any content inside the project folder (for example, the
raw source data and any R or Python scripts to clean it up and export
it to GeoJSON format): when installing a Pattrn data package (see
previous section), the build script will always look for a folder
named ``pattrn-data``, ignoring all the rest of the project folder's
contents.

*Advanced users will likely want to avoid putting content directly
into the* ``pattrn-data`` *folder, opting instead to create its
content via a build pipeline (e.g. a Makefile running some R scripts);
for this tutorial we will simply create the* ``pattrn-data`` *and its
contents directly. For an example of a Pattrn data package that uses
a build pipeline, see the* `full source code of the sample data package
<https://gitlab.com/pattrn-data/pattrn-data-where-the-drones-strike>`_
*used in the* :ref:`getting-started` *tutorial).*

Within the project folder, create the following subfolders and (empty,
for the moment) plain text files::

    /pattrn-data-project/
      /config.json
      /data/
         /metadata.json
         /settings.json
         /data.geojson

If you have been working through the :ref:`getting-started` tutorial,
you will recognise the folder structure and the files listed above:

* ``config.json`` is the core Pattrn config file, which instructs
  the Pattrn code on where to find the data, settings and metadata
  files (when using a GeoJSON data source)
* ``metadata.json`` (or ``metadata.yaml``, if you prefer to use the
  YAML syntax) is the file that describes the variables in the
  instance's dataset
* ``settings.json`` is the file with the instance's settings
* ``data.geojson`` is your data file

Once all the files are in place with their full content (see the
:ref:`getting-started` tutorial and the
:ref:`managing-data-geojson` part of the :ref:`managing-data`
section of this manual for details), you can test the Pattrn data
package by configuring it to be used in a Pattrn instance (see
the :ref:`previous section <installing-a-pattern-data-package>`).

Once you are happy with the content of your Pattrn data package,
you can publish it by committing it into Git and pushing it to
a code collaboration platform such as `GitLab <https://gitlab.com>`_.

To commit your work into Git, from the root of your project folder::

  git init && git add . && git commit

You can then create a new project on GitLab (or any other code
collaboration platform you wish to use), and configure it as
a remote for your Git repository::

  git remote add origin <URI-of-the-remote-git-repository>
  git push origin master

You (and other users, if your repository is publicly accessible) will
now be able to configure the remote URI of the repository in the
``source-data-packages.json`` file of a Pattrn instance and let the
Pattrn build script retrieve and bundle it with a new Pattrn instance.

You may also want to publish the package to NPM: refer to the `relevant
documentation on docs.npmjs.com <https://docs.npmjs.com/getting-started/publishing-npm-packages>`_
for this.