Guide to understanding reclin
⇦ Back to workshops

Introduction

This document describes the key functions within the reclin package and assists the user with understanding how to use the reclin package to perform record linkage with large data sets.

Import Sample Data

First, we read in the sample data sets for linkage

x <- read.csv("test data/masterdata.csv", stringsAsFactors = F)
y <- read.csv("test data/errordata.csv", stringsAsFactors = F)

Here’s a look at the sample data set we will use in this guide:

head(x)
  first_name        last_name gender birth_date        ethnicity       SSN med_number          state
1     Nathan          Cordova female 11/09/1946            Asian 255383175 6358029309         Nevada
2       Luke            Quick female  5/28/1952            Asian 125087187 6427424655   Pennsylvania
3       Jodi          Baldino female  1/22/1971            Asian 516395786 5981030723        Georgia
4      Guled             Shen   male  4/15/1948 Pacific Islander 143355093 7285462698     New Jersey
5  Viarlenny Picazzo Banuelos   male 10/08/1941            Black 264896375 2283441837 South Carolina
6      Alyce        Dominguez female 07/07/2005 Pacific Islander 574179851 3540154617         Kansas
          city             address   zip  id
1  Falanolzace   899 Casjole Grove 68085 848
2  Gitopupriwa  499 Pihtowi Center 86488 157
3  Gupoowekuro     196 Pipafof Way 65928 980
4   Fawwelcoja 682 Purima Junction 14597 225
5    Hovruhmij  794 Ruwatibi Plaza 15605 598
6 Retuarekzani   1315 Dawoobot Way 46591 927

Reclin functions work with data sets in ldat format. ldat is a package that makes it easy to work with large datasets in R because it is able to process data by chunk in memory.

x <- as.ldat(x)
y <- as.ldat(y)

ldat objects resemble a list object, except that they are stored as external pointers.

str(x)
List of 12
 $ first_name:Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d8d9800>
 $ last_name :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d8ec678>
 $ gender    :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d8f8540>
 $ birth_date:Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d9029c0>
 $ ethnicity :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d90daa0>
 $ SSN       :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 1
  ..- attr(*, "attributes")=<environment: 0x7fb64d9188e0>
 $ med_number:Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 0
  ..- attr(*, "attributes")=<environment: 0x7fb64d923000>
 $ state     :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d92e2a0>
 $ city      :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d939380>
 $ address   :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 3
  ..- attr(*, "attributes")=<environment: 0x7fb64d944310>
 $ zip       :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 1
  ..- attr(*, "attributes")=<environment: 0x7fb64d95ff58>
 $ id        :Class 'lvec' <externalptr>
  ..- attr(*, "type")= int 1
  ..- attr(*, "attributes")=<environment: 0x7fb64d96b498>
 - attr(*, "class")= chr "ldat"

From here on, we will refer to the master data set as x and the batch data set as y.

Record Linkage

Generate pairs

The first step to record linkage is to generate the list of record pairs to be compared and tested for linkage. In reclin, this can be done using the simple_blocking function. This function uses the function gen_pairs internally to create a ldat object with two columns, which indicate the index of the record in x and the index of of the record in y that comprise of each pair.

If blocking is implemented, only pairs that adhere to the blocking criteria will be generated.

How gen_pairs work:

  1. if no blocking variable is provided, the function simply returns the indices of the record pairs based on all combinations. For example, if x contains 1000 records and y contains 500 records, gen_pairs will return 500,000 pairs to be compared. This is done through an expand.grid function.
  • Here’s an example of expand.grid; the output numbers in the example are equivalent to row indices for x and y:
expand.grid(c(1,2,3), c(5,6,7,8))
   Var1 Var2
1     1    5
2     2    5
3     3    5
4     1    6
5     2    6
6     3    6
7     1    7
8     2    7
9     3    7
10    1    8
11    2    8
12    3    8
  1. if blocking variables are provided, the data sets are first ordered by the blocking variables. This facilitates the rle.lvec function to create a data.frame with one column representing the values of the blocking variable and another representing the number of times that value is represented (the count).
  • For each value of the blocking variable, such as “Arizona” if blocking on state, rle.lvec might tell us that 6 records in x and 3 records in y have “Arizona” as their state. We would expect 18 record pairs to be generated to compare all combinations of the 6 x records and 3 y records.

  • Note: Records with NA values in the blocking variables are excluded.

  • Note: If the total number of pairs for one blocking variable exceed \(2147483647\), an error will occur because it’s higher than the machine maximum for integer value.

  • Note: If characters are lower-cased as well as upper-cased, the blocking may be wrong. This is because the order function in rle.lvec puts strings that start with a lower case letter after all strings that start with upper cased letters; however, when merging x value counts and y value counts together, the function sort is used which correctly orders strings regardless of case. The two orders are used in conjunction to get record pairs, which may be disasterous. This issue is addressed in the link function which calls gen_pairs internally by converting all strings to lower case. However, if weird things are happening with blocked record pairs, it may be worthwile to explore the ordering performed by these two functions.

  1. For each blocking value with a non-zero count, the function extracts the indices of x and y and create the pairs result object.

How simple_blocking works:

The function facilitates blocking and calls gen_pairs internally. block_x and block_y parameters take in character vectors of variable names: the names provided in block_x will be used to match variables in x to variables in y which is based on block_y. The corresponding variables from x and y must be positionally matched, regardless of the actual variable name.

The function calls gen_pairs for each blocking variable and appends the results together. If blocktype = "or", the blocking is based on agreement on one or more block variables. This means that providing more blocking variables relaxes the criteria and produces more pairs; any pair that match on at least one of the provided blocking variables will be included. If blocktype = "and", all blocking variables must agree for a record pair to be included.

The function returns an ordered object of all unique record pairs by indices.

We can see an example of the blocking here:

rpairs <- simple_blocking(x, y, block_x = c("state"), block_y = c("state"))
head(rpairs)
ldat with 3096 rows and 2 columns.
Printing first 20 rows:
   x   y
1  1  14
2  1  72
3  1 101
4  1 112
5  1 273
6  2  73
7  2 156
8  2 166
9  2 177
10 2 234
11 2 298
12 3   4
13 3 151
14 3 164
15 3 192
16 3 222
17 3 233
18 4  74
19 4  86
20 4  87

Here we can see the reduction in the universe of comparison pairs due to blocking:

total.possible.pairs <- nrow(x)*nrow(y); total.possible.pairs
[1] 149500
blocking.pairs <- nrow(rpairs); blocking.pairs
[1] 3096

Estimate m- and u-probabilities

Before probabilistic weights can be calculated, we must first estimate m (probability of true link on agreement), u (probability of true non-link on agreement), and p (percentage of true links). This is done through the simple_em function, which uses the EM algorithm.

Currently this algorithm does not directly support similarity scores, so we need to provide threshold levels to convert any such scores into binary scores (0 or 1). If the thresholds are not provided, it defaults to 0.999. The implication here is that if string distance algorithms are used but no corresponding thresholds are provided, we would not be able to get the desirable effect in the model.

argument to simple_em

comparator.threshold <- c(SSN = 0.77,
                          med_number = 0.799,
                          city = 0.799,
                          address = 0.86,
                          zip = 0.799)

Note: The comparator.threshold follows the same rules as the string distance functions mentioned above (for comparator.funcs)

simple_em calls the function count_patterns internally to turn all weight calculations into binary values based on provided thresholds.

count_patterns will also keep the unique comparison patterns and summarise by the number of times it has been represented in the data set. This will be used in the EM algorithm.

Here’s an example of the output from count_patterns. As we can see, all the decimal scores from the string distance algorithms have been converted to 1s and 0s.

head(count_patterns(rpairs, threshold = comparator.threshold))
  first_name last_name gender birth_date ethnicity SSN med_number state city address zip     n
1          0         0      0          0         0   0          0     0    0       0   0 64879
2          0         0      0          0         0   0          0     0    0       0   1    29
3          0         0      0          0         0   0          0     1    0       0   0  1243
4          0         0      0          0         0   1          0     0    0       0   0     7
5          0         0      0          0         1   0          0     0    0       0   0 17136
6          0         0      0          0         1   0          0     0    0       0   1     7

Note: count_patterns assign NA values a similarity score of 0 (complete disagreement) before summarizing the comparison patterns in the record pairs. NA values arise when one or both of the records in a record pair contain a missing value for a particular variable.

The resulting object is a list with the estimated \(m-\) and \(u-probability\) for each variable:

umprob <- simple_em(rpairs, threshold = comparator.threshold, progress=F)
umprob
M- and u-probabilities estimated by the EM-algorithm:
   Variable M-probability U-probability
 first_name     0.7458324  1.038456e-03
  last_name     0.7083325  9.580597e-04
     gender     0.8833334  4.387847e-01
 birth_date     0.7249991  3.349857e-05
  ethnicity     0.7208336  2.079325e-01
        SSN     0.9791655  1.004957e-04
 med_number     0.9791665  1.339777e-05
      state     0.7416658  1.954978e-02
       city     0.8916656  6.699688e-06
    address     0.8749989  4.015786e-19
        zip     0.8999989  4.153825e-04

Matching probability: 0.001605353.

Calculate weights

The weights function takes the \(u-\) and \(m-probabilities\) and the comparison table to calculate the weight of each record pair. It calls the calc_weights function internally to perform this, and attaches the weights calculations to the link object.

Any missing values in pairs is given a weight of 0, and Inf weights are converted to 999 while -Inf weights are converted to -999. The weights calculated for each variable is summed to the total weight of the record pair. The choice of the 999 and -999 is arbitrary, this is only to prevent the situation where one variable weight is Inf and another is -Inf for the same record pair, as that would produce a NaN.

lweights <- weights(umprob, rpairs)

We can see the calculated weights:

lweights$weights
lvec of type numeric with length 149500
Printing first 20 values:
 [1]  -9.293055  -9.728563  -7.569214  -4.734396  -6.306440  -5.358929  -8.530261  -3.737784  -7.751575
[10]  -7.392298  -7.147309  -6.292688 -10.934357  -8.660368  -8.621041  -6.015274  -5.807315  -5.316380
[19]  -8.433735  -3.516654

With this method of probabilistic linkage, we must be able to select an appropriate threshold that separates links from non-links. If a record pair has a weight above the threshold, it will be classified as a link. The reclin package offers tools for manual decision making for this threshold, which is done by reviewing the record pairs around certain weight thresholds to determine if they are true links.

In order to determine a starting point for manually reviewing pairs, we can use the weight_dist function. The parameter nbreak controls how many weight brackets the pairs are divided into.

Generally, the weight distribution follows a U-shape with the majority of the record pairs falling into the lowest weight brackets. This is expected as two data sets with 1000 records each will generate 1000,000 record pairs without blocking, but will only have at most 1000 true links. There will be a number of record pairs in the highest weight brackets that are the most probable links. We want to start by examining the lowest points in this U-shape since we want to find the tipping point in the weight distribution that separates the links from the non-links. Higher nbreak is especially useful where there is a large number of record pairs as it offers more precise estimates of weight brackets with the fewest observations.

weight.distribution <- weight_dist(lweights, nbreak = 30); weight.distribution

(-15.7,-8.52] (-8.52,-1.29]  (-1.29,5.93]   (5.93,13.2]   (13.2,20.4]   (20.4,27.6]   (27.6,34.8]
        32549        115134          1572             6             1             2             3
  (34.8,42.1]   (42.1,49.3]   (49.3,56.5]   (56.5,63.7]     (63.7,71]     (71,78.2]   (78.2,85.4]
           15            11            13             4             1             0             0
  (85.4,92.6]   (92.6,99.9]    (99.9,107]     (107,114]     (114,122]     (122,129]     (129,136]
            0             0             0             0             0             0             0
    (136,143]     (143,150]     (150,158]     (158,165]     (165,172]     (172,179]     (179,187]
            0             0             0             0             0             0             0
    (187,194]     (194,201]
            0           189 

Selecting thresholds

To facilitate the manual review of pairs around a threshold, we use the dump_pairs function. For efficiency purposes, this function returns a function that contains all the data in the link input object ordered by weight.

The function returned by dump_pairs has two arguments. The first argument (threshold) is the weight threshold around which the pairs are returned. With the optional second argument n, we can specify the number of pairs above and below this threshold the should be returned.

inspect_pairs <- dump_pairs(lweights)
inspect_pairs(threshold = 10, n = 3)
-------------------------------------------------------
Pair: 149258/149500; weight: 6.318674
  first_name last_name gender birth_date ethnicity       SSN med_number     state        city
1    michael      bird   male 12/20/1932     asian 760323714 2987729994 minnesota  dohretagij
2      jacob rdoriguez   male 12/20/1932     black 233370654 7595193136  virginia inatkovpuri
             address   zip
1 1961 unabagu ridge 13769
2    348 denuufe way 15663
-------------------------------------------------------
Pair: 149259/149500; weight: 6.57134
   first_name last_name gender birth_date ethnicity       SSN med_number       state        city
2      adrian  trujillo   male 05/02/1978     black 946791066 6421451503 mississippi geszijarifu
21    edvfxer  trujillo   <NA> 02/06/1988      <NA> 533670405  698572559 mississippi   baponipan
             address   zip
2  235 esulus avenue 54759
21    75 gihgoz path 15707
-------------------------------------------------------
Pair: 149260/149500; weight: 7.890392
   first_name last_name gender birth_date       ethnicity       SSN med_number          state        city
3       aaron    palmer female 12/07/1954 native american 191978094 6196267777     new mexico zohcuhicaca
31     andrew  lattimer female 12/07/1954 native american 523968440 3957205541 south carolina  amuziksiji
                address   zip
3     719 cozapena pike 29694
31 346 tivagir junction 85854
=======================================================
Weight =  10
239  ( 0.1599 %) pairs with larger weight.
=======================================================
Pair: 149261/149500; weight: 10.60413
   first_name last_name gender birth_date        ethnicity       SSN med_number      state       city
4       kiana  camagong   male 12/24/1937 pacific islander 718001229 7619935485 california lulookasid
41      kiana  cyaagdng   male 12/26/1936            asian 178001229 7169934585 new jersey       <NA>
               address   zip
4  1622 tuzeneh street 89988
41     1987 jtzeneh ct 89688
-------------------------------------------------------
Pair: 149262/149500; weight: 16.11421
   first_name last_name gender birth_date       ethnicity       SSN med_number    state          city
5      marcus vahlbusch   male  6/27/1994 native american 566783801 3622594204 michigan  ridafonodoje
51     macnsu vahlbusch   male   9/6/1994 native american 566788301 3698594204 virginia ierdafonodoje
              address   zip
5  1701 pehkuw circle 14226
51 1701 pehkuw avenue 14262
-------------------------------------------------------
Pair: 149263/149500; weight: 24.79618
   first_name last_name gender birth_date ethnicity       SSN med_number        state         city
6       ariel    abbott   male  1/15/1940     white 356115809 2188281368 pennsylvania  miminimidid
61      arjel    asbott   <NA>  1/21/1940     white 356115809 2188281368      indiana kipavagetupo
              address   zip
6  1760 hitejove path 25312
61 1169 hitejove path 21132

Classify pairs

Once we have decided on one or two thresholds, we need to classify the record pairs. This is done using the classify function.

When one threshold is given, record pairs above the threshold are classified as links and pairs below the threshold are classified as non-links. Links are indicated by a class of 2 and non-links are indicated by 0. When two thresholds are given, pairs with weights between the two thresholds are classified as possible links, indicated by a class of 1.

lclass <- classify(lweights, 10)
lclass$class
lvec of type numeric with length 149500
Printing first 20 values:
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N to M matching

n_to_m function allows us to take a classified object and force n-to-m matching.

The function requires the package lpSolve:

install.packages("lpSolve")
library(lpSolve)

The function takes integer values for the parameters n and m. Each element of x can be linked to at most n elements of y, and each element of y can be linked to at most m elements of x. Matches are selected to maximize the overall weight. The default values are n = 1 and m = 1, which means 1-to-1 matching.

In this example, we set n = 2 and m = 1. This means that it is possible for a record from the master data set x to be linked to a maximum of 2 records from the batch data set y, but each record from y can only be linked to one record from x.

lclass <- n_to_m(lclass, n = 2, m = 1)

When we have unique identifiers for the records, we can append the identities to each pair in the link object in order to get the true match status.

lclass <- set_true_match_id(lclass, id_x = 'id', id_y = 'id')

To view the performance of the linkage, we can create a confusion matrix:

confusion_matrix(lclass)
      Non-Links Possible Links Links
True     149260              0   240
False         0              0     0

Displaying result

There are a few ways to display the linkage result:

  1. matches matches is a function that allows us to extract matches for class 0, 1, 2 or >0. The result is a data frame with the original row indices for x and y for record pairs within the specified class.
matches <- matches(lclass, class = 2)
head(matches)
    x y
1  64 2
2 242 3
3  35 4
4  25 5
5 430 6
6 169 7

2 true_matches the function true_matches can be used when set_true_match_id has been implemented. true_matches returns the indices of x and y where the true match status is TRUE. This can potentially help us examine the weights calculated on true matches to investigate ways to improve the algorithm.

  1. link_result link_result displays record pairs within a specified class in pairwise fashion and with original information. Each pair is separated by a blank row (a row of NAs). This makes it easy to visually adjudicate the record pairs to see if the classification is accurate.

The class parameter defaults to class > 0, and can be specified by values 0, 1, or 2.

linkresult <- link_result(lclass, class = 2)
head(linkresult, 6)
  first_name  last_name gender birth_date        ethnicity       SSN med_number    state        city
1     dillon    ramirez female 02/01/1965 pacific islander 174725823 5124618654   kansas pakepucawar
2     dillon    ramirez female 02/01/1965 pacific islander 174725823 5124618654   kansas pakepucawar
3       <NA>       <NA>   <NA>       <NA>             <NA>        NA         NA     <NA>        <NA>
4    rebecca manzanares   male 09/08/1900 pacific islander 637701044 1248122191 delaware  ticgazsile
5    rebecca manzanares   male 09/08/1900 pacific islander 637701044 1248122191 delaware  ticgazsile
6       <NA>       <NA>   <NA>       <NA>             <NA>        NA         NA     <NA>        <NA>
             address   zip  id match_status original_row_index
1 718 fohgut highway 14316 365            2                 64
2 718 fohgut highway 14316 365            2                  2
3               <NA>    NA  NA           NA                 NA
4  248 ogugtez river 30854 797            2                242
5  248 ogugtez river 30854 797            2                  3
6               <NA>    NA  NA           NA                 NA

There is also an experimental feature for link_result, which is set through the parameter consolidated = T. The consolidated parameter controls the output display format for one-to-many matches. consolidated = T assumes that x is the master data and displays all y records that link to the same x record in one group. Each group is separated by an empty row. If consolidated = F, data is displayed in pairwise groups as described above.

Additional Material
Links