Understanding CDR aggregates: Fundamentals
CDR aggregates can be used as the building blocks for developing a range of mobility indicators that can be leveraged for supporting low- and middle-income countries in understanding their population's mobility patterns.
This section is mainly targeted at those who will be using the aggregates for analysis or modelling - we provide the details of how aggregates are established and calculated since this may be important for ensuring the correct interpretation of your own outputs.
CDR aggregates: Fundamentals
Basics of CDR aggregates
Call Details Records (CDR)
For billing purposes, MNOs keep a record of subscribers’ activities on a database. These records are generated each time a subscriber makes or receives a call, sends or receives a SMS, or uses mobile data on their phone. These are what we call Call Detail Records (CDR). CDR contains information about the origin, destination and duration of a call/text/data session, as well as the ID of the cell tower routing the call. From that dataset, we can tell the approximate location of a subscriber, based on the tower’s location, associated with the time of the event that is included in the dataset. In this case, we say that we ‘recorded’ the subscriber as being at that location at that time.
Objectives of aggregates
Aggregates have been selected to represent all dimensions of mobility and the following three criteria:
they are fast and easy to compute for MNOs with limited resources;
they are fully anonymous and contain no information about individual subscribers; and
the aggregates are robust to infrequent phone usage.
The aggregates have been developed with the assumption that most subscribers will not have a record associated with them every day, or even once every few days. This is especially common in low- and middle-income countries.
Privacy considerations
CDR aggregates do not expose any information about individual subscribers and cannot be used to re-identify an individual. In line with international standards, aggregates are only produced for groups of at least 15 subscribers. This means that, for example, when the number of active subscribers in a location is less than 15 subscribers, that count is not included in the aggregate. This means that from the perspective of personal privacy, the data can be shared with third parties. These parties include epidemiologists who may use the aggregates, in combination with case data, to assess whether mobility changes are having an effect on the spread of the disease, and to predict the evolution of the spread.
Spatial and temporal resolutions of the CDR aggregates
How CDR aggregates are broken down in time (temporal resolution) and space (spatial resolution) is key to which types of questions they can inform.
Temporal resolution ranges from weeks to 15 minute intervals, and the selection of this time interval is dependent on the indicator required. Spatial resolution is often based around administrative boundary datasets, from sub-county level (level 4) up to state level (level 1), although clusters of towers can be used for fine scale analyses (especially relevant in urban areas).
The maximum spatial resolution that can reasonably be achieved is typically dependent on the density of cell towers within the studied region. In regions where there are many cell towers (e.g. highly urbanised regions), it is possible to divide up the region into multiple sub-regions by clustering towers, each cluster containing several cell towers. Each sub-region is therefore likely to have a statistically significant number of data records associated with it. However, in regions with a low number of cell towers (e.g. rural regions), it is usually not possible to divide up the region and obtain sufficient data in each sub-region.
Whilst it might seem preferable to produce CDR aggregates at the finest spatial and temporal resolution, this may not always be beneficial. As the resolution is increased, a larger proportion of counts will fall beneath the 15 subscriber threshold required for data privacy (and statistical significance), resulting in more data being removed from the outputs. We recommend selecting the coarsest spatial and temporal resolution that will meet the requirements of the use case. This will ensure that statistically significant and privacy-preserving outputs are available for the maximum number of spatial regions and time periods.
Computing CDR aggregates over multiple time periods and region sizes is essential to obtain the full range of mobility indicators.
In several of the aggregates for which we describe the methods, we recommend that, where possible, the aggregates should be calculated over different region sizes and time intervals.
Some aggregates are ‘additive’ in the sense that they can be calculated for the smallest region size, or time period, and then summed to compute the value for a larger region size or time period. However, any aggregate that counts the number of unique subscribers that have e.g. visited a certain region is not additive.
For example, Region A may be composed of smaller sub-regions a1, a2, and a3. We can count the number of unique subscribers that visited a1, a2, and a3. But because some subscribers may have visited both a1 and a2, we cannot simply sum the number of subscribers that visited each sub-region to obtain the number of unique subscribers that visited Region A (because subscribers will be counted multiple times if they visited multiple sub-regions). A similar reasoning applies to time periods.
Combining aggregates produced at different resolutions can also also to build indicators of population mixing (e.g. comparing the number of unique visitors to a region for each day and for a week) and of intra-regional travel (e.g. comparing the number of unique visitors to each a3 in an a2 to the number of unique visitors in the a2, to get the average number of a3 regions visited per visitor of the a2 region).
Defining "home locations"
Several of the aggregates are based on the concept of defining a "home location" for each subscriber. We define this to be a ‘reference location’ based on where the subscriber most frequently used their phone for the last time each day over a four-week period, and updated every week. Once calculated, it can be used to produce mobility indicators and understand mobility patterns with respect to each subscriber’s reference location (e.g. whether they are in their home region or visiting another region).
We define as 'resident of a cluster of towers or region' the subscribers who have been assigned this cluster or region as home location.
Geospatial / GIS methods
In this section, you will learn how to assign each cell tower to an administrative unit, and how to group nearby cell towers to create smaller customisable regions.
Assigning cell towers to administrative units
Cell towers are assigned to a geospatial polygon (region) by mapping the point location of the cell tower into a polygon. When a subscriber uses their phone to make or receive a call or SMS, or to use mobile data, that transaction is routed through a cell tower - usually the closest one to the subscriber. That cell tower is recorded in the CDR data, and the subscriber is assumed to have been located within the corresponding polygon at the time of the transaction.
The assignment can be performed using a number of tools. Some examples are given below, and more will be added.
QGIS: Use the coordinates of the cell tower locations and a polygon shapefile by applying the Join Attributes by Location tool. This will produce a new cell tower output, which will include the existing cell tower information and new information about which polygon each tower falls inside.
Database implementations such as using the PostGIS add-on for PostgreSQL databases.
We recommend assigning towers to level 4 administrative units if that is possible for the mobile operator (giving an assignment also to levels 3, 2, and 1). Shapefiles for administrative boundaries can be found from , Common Operational Datasets with administrative boundaries stored on HDX.
There are many more sophisticated ways of assigning subscribers into polygons (e.g. using directional coverage information about each cell, if available). However, for the sake of efficiency and simplicity, and because the required information may not be available from all mobile operators, we do not recommend prioritising this at the current time.
Grouping nearby towers to obtain smaller regions
Using individual cell towers as a proxy for location is not always the best approach. This is because:
Some towers are only a few metres apart, and so counting the number of unique users seen at each tower is less relevant than counting the number of unique users seen at a grouping of towers, and
There are usually are too many towers in a country to compute trips between each possible pair (since the number of pairs scales as n2).
Instead, we suggest clustering towers using hierarchical clustering with Ward's Method and a threshold of 1 km. This can be implemented via e.g. the fclusterdata function in Python, or clusterdata in Matlab. Then compute the centroid of each tower cluster, and assign each cluster to an administrative unit based on the location of the cluster’s centroid. This will result in a grouping at the sub-admin 4 level (mostly in dense urban environments only), and vastly reduce the number of false ‘interregional trips’ which arise when nearby towers are in two different administrative units.