COG-UK Docs

View the Project on GitHub COG-UK/docs

Updated 2021-04-23 by @viralverity

Geography cleaning

This document describes the steps in the scipt that cleans geographical metadata provided by COG members to make computer and human readable outputs and some suggestions are made to make further analysis easier.

It takes the submitted sequence metadata as an input, and attempts to find the highest resolution geographical data available.

The adm2s are cleaned to match those found in the Global Administrative Database (gadm.org) database.

In all cases, pipes (“|”) are used to denote ambiguity. Correct adm2s with pipes (i.e. if the ambiguity is known on submission) in between locations will be be accepted as inputs.

This script will also accept valid NUTS1 regions as inputs. These are defined at the bottom of the page, along with their constituent adm2s.

Columns in output:

Adm2 processing:

There are multiple ways that the final adm2 designation is achieved, and this is denoted in the “source” column:

Location:

This is designed to be as useful as possible for humans. The adm2 processing often results in long, and ambiguous strings, joined by pipe symbols which, while important for any analysis or mapping, can be tricky for people to interprete quickly.

There are a few different ways the location field is used:

It will show the highest resolution (up to adm2) data available going from adm1–>NUTS1–>adm2

Or it may show a grouping of adm2s that is commonly used more than the adm2s themselves, avoiding long and ambiguous adm2 strings. These groupings can be found at the bottom of this page.

Suggested adm2 groupings:

This is a field based on experience of running geographical analyses using the genome data in the UK. It is designed to try and keep all of the geographical genome data as accurate as possible. It is a combination of two things:

The file that defines this is “adm2_aggregation.csv” in the “geography_utils” folder.

Safe location:

While adm2-level data usually isn’t viewed as identifying, if data is sparse, it may be. If there are fewer than 5 sequences in an epiweek in the sequence’s adm2, then the aggregated adm2 is checked. If there are still less than five sequences in this grouping, then the NUTS1 region is given. If the adm2 or grouping is ambiguous one (ie it has a “|”), then the counts are combined. Eg if the adm2 is “EAST_SUSSEX|WEST_SUSSEX”, then there must be five sequences between the two locations or the aggregated adm2 will be provided.

Files in geography_utilities folder:

Groupings:

NUTS1 regions:


Published 2021-04-23. Updated 2021-04-23. Page maintainer @viralverity.