Guide to understanding reclin
By Hattie Zhou
Introduction
This document describes the key functions within the reclin
package and assists the user with understanding how to use the reclin
package to perform record linkage with large data sets.
Import Sample Data
First, we read in the sample data sets for linkage
x <- read.csv("test data/masterdata.csv", stringsAsFactors = F)
y <- read.csv("test data/errordata.csv", stringsAsFactors = F)
Here’s a look at the sample data set we will use in this guide:
head(x)
first_name last_name gender birth_date ethnicity SSN med_number state
1 Nathan Cordova female 11/09/1946 Asian 255383175 6358029309 Nevada
2 Luke Quick female 5/28/1952 Asian 125087187 6427424655 Pennsylvania
3 Jodi Baldino female 1/22/1971 Asian 516395786 5981030723 Georgia
4 Guled Shen male 4/15/1948 Pacific Islander 143355093 7285462698 New Jersey
5 Viarlenny Picazzo Banuelos male 10/08/1941 Black 264896375 2283441837 South Carolina
6 Alyce Dominguez female 07/07/2005 Pacific Islander 574179851 3540154617 Kansas
city address zip id
1 Falanolzace 899 Casjole Grove 68085 848
2 Gitopupriwa 499 Pihtowi Center 86488 157
3 Gupoowekuro 196 Pipafof Way 65928 980
4 Fawwelcoja 682 Purima Junction 14597 225
5 Hovruhmij 794 Ruwatibi Plaza 15605 598
6 Retuarekzani 1315 Dawoobot Way 46591 927
Reclin functions work with data sets in ldat
format. ldat
is a package that makes it easy to work with large datasets in R because it is able to process data by chunk in memory.
x <- as.ldat(x)
y <- as.ldat(y)
ldat
objects resemble a list object, except that they are stored as external pointers.
str(x)
List of 12
$ first_name:Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d8d9800>
$ last_name :Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d8ec678>
$ gender :Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d8f8540>
$ birth_date:Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d9029c0>
$ ethnicity :Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d90daa0>
$ SSN :Class 'lvec' <externalptr>
..- attr(*, "type")= int 1
..- attr(*, "attributes")=<environment: 0x7fb64d9188e0>
$ med_number:Class 'lvec' <externalptr>
..- attr(*, "type")= int 0
..- attr(*, "attributes")=<environment: 0x7fb64d923000>
$ state :Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d92e2a0>
$ city :Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d939380>
$ address :Class 'lvec' <externalptr>
..- attr(*, "type")= int 3
..- attr(*, "attributes")=<environment: 0x7fb64d944310>
$ zip :Class 'lvec' <externalptr>
..- attr(*, "type")= int 1
..- attr(*, "attributes")=<environment: 0x7fb64d95ff58>
$ id :Class 'lvec' <externalptr>
..- attr(*, "type")= int 1
..- attr(*, "attributes")=<environment: 0x7fb64d96b498>
- attr(*, "class")= chr "ldat"
From here on, we will refer to the master data set as x
and the batch data set as y
.
Record Linkage
Generate pairs
The first step to record linkage is to generate the list of record pairs to be compared and tested for linkage. In reclin, this can be done using the simple_blocking
function. This function uses the function gen_pairs
internally to create a ldat
object with two columns, which indicate the index of the record in x
and the index of of the record in y
that comprise of each pair.
If blocking is implemented, only pairs that adhere to the blocking criteria will be generated.
How gen_pairs
work:
- if no blocking variable is provided, the function simply returns the indices of the record pairs based on all combinations. For example, if
x
contains 1000 records andy
contains 500 records,gen_pairs
will return 500,000 pairs to be compared. This is done through anexpand.grid
function.
- Here’s an example of
expand.grid
; the output numbers in the example are equivalent to row indices forx
andy
:
expand.grid(c(1,2,3), c(5,6,7,8))
Var1 Var2
1 1 5
2 2 5
3 3 5
4 1 6
5 2 6
6 3 6
7 1 7
8 2 7
9 3 7
10 1 8
11 2 8
12 3 8
- if blocking variables are provided, the data sets are first ordered by the blocking variables. This facilitates the
rle.lvec
function to create a data.frame with one column representing the values of the blocking variable and another representing the number of times that value is represented (the count).
For each value of the blocking variable, such as “Arizona” if blocking on state,
rle.lvec
might tell us that 6 records inx
and 3 records iny
have “Arizona” as their state. We would expect 18 record pairs to be generated to compare all combinations of the 6x
records and 3y
records.Note: Records with NA values in the blocking variables are excluded.
Note: If the total number of pairs for one blocking variable exceed \(2147483647\), an error will occur because it’s higher than the machine maximum for integer value.
Note: If characters are lower-cased as well as upper-cased, the blocking may be wrong. This is because the order function in
rle.lvec
puts strings that start with a lower case letter after all strings that start with upper cased letters; however, when mergingx
value counts andy
value counts together, the functionsort
is used which correctly orders strings regardless of case. The two orders are used in conjunction to get record pairs, which may be disasterous. This issue is addressed in thelink
function which callsgen_pairs
internally by converting all strings to lower case. However, if weird things are happening with blocked record pairs, it may be worthwile to explore the ordering performed by these two functions.
- For each blocking value with a non-zero count, the function extracts the indices of
x
andy
and create the pairs result object.
How simple_blocking
works:
The function facilitates blocking and calls gen_pairs
internally. block_x
and block_y
parameters take in character vectors of variable names: the names provided in block_x
will be used to match variables in x
to variables in y
which is based on block_y
. The corresponding variables from x
and y
must be positionally matched, regardless of the actual variable name.
The function calls gen_pairs
for each blocking variable and appends the results together. If blocktype = "or"
, the blocking is based on agreement on one or more block variables. This means that providing more blocking variables relaxes the criteria and produces more pairs; any pair that match on at least one of the provided blocking variables will be included. If blocktype = "and"
, all blocking variables must agree for a record pair to be included.
The function returns an ordered object of all unique record pairs by indices.
We can see an example of the blocking here:
rpairs <- simple_blocking(x, y, block_x = c("state"), block_y = c("state"))
head(rpairs)
ldat with 3096 rows and 2 columns.
Printing first 20 rows:
x y
1 1 14
2 1 72
3 1 101
4 1 112
5 1 273
6 2 73
7 2 156
8 2 166
9 2 177
10 2 234
11 2 298
12 3 4
13 3 151
14 3 164
15 3 192
16 3 222
17 3 233
18 4 74
19 4 86
20 4 87
Here we can see the reduction in the universe of comparison pairs due to blocking:
total.possible.pairs <- nrow(x)*nrow(y); total.possible.pairs
[1] 149500
blocking.pairs <- nrow(rpairs); blocking.pairs
[1] 3096
Create a link object
The link
function is the main function that we will actually use in practice. It calls the simple_blocking
function (default) and creates a link
object, which contains both the original data and the paired data. Note that simple_blocking
is an argument to the function, and alternative blocking functions/methods can be used as long as the output format stays the same.
link
takes the arguments by_x
and by_y
to specify the variables that should be used in performing linkage. These parameters work the same way as block_x
and block_y
mentioned in simple_blocking
where variable names must be positionally matched.
link
takes the row indices generated by simple_blocking
and creates a comparison table for each record pair. The default comparison is exact, where if a variable value is identical between the record pair, it is given a comparison score of 1, and otherwise 0. link
also supports string distance comparisons where the scores are continuous between 0 and 1.
For string distance comparisons, we can provide either a single function, in which case it will be applied to all variables, or a separate function for each variable. In order to apply different string distance algorithms to different variables, we can either provide a named vector of the variables of interest, with the names being the corresponding variable names to apply the function to, or an unamed vector of length by_x
that is positionally matched to by_x
.
For a list of string distance algorithms provided, see function similarity
.
Note: The names should be matched to by_x
for situations where by_x
differ from by_y
.
Here’s an example of the argument to the link
function for string comparison:
comparator.funcs <- c(first_name = similarity_fun(method="jw"),
last_name = similarity_fun(method="jw"),
SSN = similarity_fun(method="lcs"),
med_number = similarity_fun(method="lcs"),
city = similarity_fun(method="jw"),
address = similarity_fun(method="jw"),
zip = similarity_fun(method="osa"))
The method = ""
parameter specifies the string distance algorithm used. Also, additional arguments to similarity functions can be passed through similarity_fun
.
link
calls the bcomp
function to create comparison patterns. bcomp
will call the corresponding string distance algorithm for each variable for each record pair. The resulting comparison table is saved in link$pairs
.
Example using link
, with the result being an object of class link
. The link
object includes the original data sets as well as the new output generated from the function in a list format:
vars <- setdiff(names(x), "id")
rpairs <- link(x, y, by_x = vars, by_y = vars, pairs_generator = simple_blocking,
comparator = comparator.funcs, progress = F)
Here’s the comparison table produced using the corresponding comparator.funcs
:
rpairs$pairs
ldat with 149500 rows and 13 columns.
Printing first 20 rows:
x y first_name last_name gender birth_date ethnicity SSN med_number state city address zip
1 1 1 0.5000000 0.0000000 0 0 0 0.3333333 0.4 0 0.5757576 0.5751634 0.0
2 2 1 0.4444444 0.4555556 0 0 0 0.2222222 0.3 0 0.5757576 0.4542484 0.0
3 3 1 0.4444444 0.4365079 0 0 0 0.4444444 0.4 0 0.5151515 0.4091503 0.2
4 4 1 0.4277778 0.6111111 1 0 1 0.3333333 0.2 0 0.4606061 0.5715170 0.0
5 5 1 0.5965608 0.4097222 1 0 0 0.3333333 0.5 0 0.4680135 0.4097417 0.0
6 6 1 0.5222222 0.3518519 0 0 1 0.4444444 0.5 0 0.6262626 0.4901961 0.0
7 7 1 0.5277778 0.5555556 0 0 0 0.3333333 0.3 0 0.5151515 0.5565875 0.0
8 8 1 0.6436508 0.4074074 1 0 1 0.4444444 0.4 0 0.5151515 0.3795518 0.0
9 9 1 0.4484127 0.0000000 1 0 0 0.1111111 0.3 0 0.4924242 0.4259454 0.4
10 10 1 0.6277778 0.3333333 1 0 0 0.2222222 0.4 0 0.2939394 0.5653595 0.0
11 11 1 0.3333333 0.4305556 1 0 1 0.1111111 0.2 0 0.4494949 0.4827264 0.0
12 12 1 0.6111111 0.0000000 1 0 1 0.1111111 0.4 0 0.4606061 0.5130719 0.0
13 13 1 0.4722222 0.3518519 0 0 0 0.1111111 0.3 0 0.4090909 0.5457762 0.0
14 14 1 0.6666667 0.0000000 0 0 0 0.4444444 0.4 0 0.4924242 0.5548086 0.0
15 15 1 0.5000000 0.4555556 0 0 0 0.3333333 0.2 0 0.6262626 0.3235294 0.2
16 16 1 0.4484127 0.0000000 0 0 1 0.4444444 0.4 0 0.5848485 0.4259454 0.2
17 17 1 0.5222222 0.6388889 1 0 0 0.4444444 0.4 0 0.4212121 0.4575163 0.0
18 18 1 0.4484127 0.0000000 1 0 0 0.3333333 0.4 0 0.6969697 0.4093137 0.4
19 19 1 0.7500000 0.3611111 0 0 0 0.1111111 0.2 0 0.2939394 0.6650327 0.4
20 20 1 0.5515873 0.0000000 1 0 1 0.4444444 0.3 0 0.4131313 0.4825135 0.4
Estimate m- and u-probabilities
Before probabilistic weights can be calculated, we must first estimate m
(probability of true link on agreement), u
(probability of true non-link on agreement), and p
(percentage of true links). This is done through the simple_em
function, which uses the EM algorithm.
Currently this algorithm does not directly support similarity scores, so we need to provide threshold levels to convert any such scores into binary scores (0 or 1). If the thresholds are not provided, it defaults to 0.999. The implication here is that if string distance algorithms are used but no corresponding thresholds are provided, we would not be able to get the desirable effect in the model.
argument to simple_em
comparator.threshold <- c(SSN = 0.77,
med_number = 0.799,
city = 0.799,
address = 0.86,
zip = 0.799)
Note: The comparator.threshold
follows the same rules as the string distance functions mentioned above (for comparator.funcs
)
simple_em
calls the function count_patterns
internally to turn all weight calculations into binary values based on provided thresholds.
count_patterns
will also keep the unique comparison patterns and summarise by the number of times it has been represented in the data set. This will be used in the EM algorithm.
Here’s an example of the output from count_patterns
. As we can see, all the decimal scores from the string distance algorithms have been converted to 1s and 0s.
head(count_patterns(rpairs, threshold = comparator.threshold))
first_name last_name gender birth_date ethnicity SSN med_number state city address zip n
1 0 0 0 0 0 0 0 0 0 0 0 64879
2 0 0 0 0 0 0 0 0 0 0 1 29
3 0 0 0 0 0 0 0 1 0 0 0 1243
4 0 0 0 0 0 1 0 0 0 0 0 7
5 0 0 0 0 1 0 0 0 0 0 0 17136
6 0 0 0 0 1 0 0 0 0 0 1 7
Note: count_patterns
assign NA values a similarity score of 0 (complete disagreement) before summarizing the comparison patterns in the record pairs. NA values arise when one or both of the records in a record pair contain a missing value for a particular variable.
The resulting object is a list with the estimated \(m-\) and \(u-probability\) for each variable:
umprob <- simple_em(rpairs, threshold = comparator.threshold, progress=F)
umprob
M- and u-probabilities estimated by the EM-algorithm:
Variable M-probability U-probability
first_name 0.7458324 1.038456e-03
last_name 0.7083325 9.580597e-04
gender 0.8833334 4.387847e-01
birth_date 0.7249991 3.349857e-05
ethnicity 0.7208336 2.079325e-01
SSN 0.9791655 1.004957e-04
med_number 0.9791665 1.339777e-05
state 0.7416658 1.954978e-02
city 0.8916656 6.699688e-06
address 0.8749989 4.015786e-19
zip 0.8999989 4.153825e-04
Matching probability: 0.001605353.
Calculate weights
The weights
function takes the \(u-\) and \(m-probabilities\) and the comparison table to calculate the weight of each record pair. It calls the calc_weights
function internally to perform this, and attaches the weights calculations to the link
object.
Any missing values in pairs is given a weight of 0, and Inf weights are converted to 999 while -Inf weights are converted to -999. The weights calculated for each variable is summed to the total weight of the record pair. The choice of the 999 and -999 is arbitrary, this is only to prevent the situation where one variable weight is Inf and another is -Inf for the same record pair, as that would produce a NaN.
lweights <- weights(umprob, rpairs)
We can see the calculated weights:
lweights$weights
lvec of type numeric with length 149500
Printing first 20 values:
[1] -9.293055 -9.728563 -7.569214 -4.734396 -6.306440 -5.358929 -8.530261 -3.737784 -7.751575
[10] -7.392298 -7.147309 -6.292688 -10.934357 -8.660368 -8.621041 -6.015274 -5.807315 -5.316380
[19] -8.433735 -3.516654
With this method of probabilistic linkage, we must be able to select an appropriate threshold that separates links from non-links. If a record pair has a weight above the threshold, it will be classified as a link. The reclin package offers tools for manual decision making for this threshold, which is done by reviewing the record pairs around certain weight thresholds to determine if they are true links.
In order to determine a starting point for manually reviewing pairs, we can use the weight_dist
function. The parameter nbreak
controls how many weight brackets the pairs are divided into.
Generally, the weight distribution follows a U-shape with the majority of the record pairs falling into the lowest weight brackets. This is expected as two data sets with 1000 records each will generate 1000,000 record pairs without blocking, but will only have at most 1000 true links. There will be a number of record pairs in the highest weight brackets that are the most probable links. We want to start by examining the lowest points in this U-shape since we want to find the tipping point in the weight distribution that separates the links from the non-links. Higher nbreak
is especially useful where there is a large number of record pairs as it offers more precise estimates of weight brackets with the fewest observations.
weight.distribution <- weight_dist(lweights, nbreak = 30); weight.distribution
(-15.7,-8.52] (-8.52,-1.29] (-1.29,5.93] (5.93,13.2] (13.2,20.4] (20.4,27.6] (27.6,34.8]
32549 115134 1572 6 1 2 3
(34.8,42.1] (42.1,49.3] (49.3,56.5] (56.5,63.7] (63.7,71] (71,78.2] (78.2,85.4]
15 11 13 4 1 0 0
(85.4,92.6] (92.6,99.9] (99.9,107] (107,114] (114,122] (122,129] (129,136]
0 0 0 0 0 0 0
(136,143] (143,150] (150,158] (158,165] (165,172] (172,179] (179,187]
0 0 0 0 0 0 0
(187,194] (194,201]
0 189
Selecting thresholds
To facilitate the manual review of pairs around a threshold, we use the dump_pairs
function. For efficiency purposes, this function returns a function that contains all the data in the link input object ordered by weight.
The function returned by dump_pairs
has two arguments. The first argument (threshold
) is the weight threshold around which the pairs are returned. With the optional second argument n
, we can specify the number of pairs above and below this threshold the should be returned.
inspect_pairs <- dump_pairs(lweights)
inspect_pairs(threshold = 10, n = 3)
-------------------------------------------------------
Pair: 149258/149500; weight: 6.318674
first_name last_name gender birth_date ethnicity SSN med_number state city
1 michael bird male 12/20/1932 asian 760323714 2987729994 minnesota dohretagij
2 jacob rdoriguez male 12/20/1932 black 233370654 7595193136 virginia inatkovpuri
address zip
1 1961 unabagu ridge 13769
2 348 denuufe way 15663
-------------------------------------------------------
Pair: 149259/149500; weight: 6.57134
first_name last_name gender birth_date ethnicity SSN med_number state city
2 adrian trujillo male 05/02/1978 black 946791066 6421451503 mississippi geszijarifu
21 edvfxer trujillo <NA> 02/06/1988 <NA> 533670405 698572559 mississippi baponipan
address zip
2 235 esulus avenue 54759
21 75 gihgoz path 15707
-------------------------------------------------------
Pair: 149260/149500; weight: 7.890392
first_name last_name gender birth_date ethnicity SSN med_number state city
3 aaron palmer female 12/07/1954 native american 191978094 6196267777 new mexico zohcuhicaca
31 andrew lattimer female 12/07/1954 native american 523968440 3957205541 south carolina amuziksiji
address zip
3 719 cozapena pike 29694
31 346 tivagir junction 85854
=======================================================
Weight = 10
239 ( 0.1599 %) pairs with larger weight.
=======================================================
Pair: 149261/149500; weight: 10.60413
first_name last_name gender birth_date ethnicity SSN med_number state city
4 kiana camagong male 12/24/1937 pacific islander 718001229 7619935485 california lulookasid
41 kiana cyaagdng male 12/26/1936 asian 178001229 7169934585 new jersey <NA>
address zip
4 1622 tuzeneh street 89988
41 1987 jtzeneh ct 89688
-------------------------------------------------------
Pair: 149262/149500; weight: 16.11421
first_name last_name gender birth_date ethnicity SSN med_number state city
5 marcus vahlbusch male 6/27/1994 native american 566783801 3622594204 michigan ridafonodoje
51 macnsu vahlbusch male 9/6/1994 native american 566788301 3698594204 virginia ierdafonodoje
address zip
5 1701 pehkuw circle 14226
51 1701 pehkuw avenue 14262
-------------------------------------------------------
Pair: 149263/149500; weight: 24.79618
first_name last_name gender birth_date ethnicity SSN med_number state city
6 ariel abbott male 1/15/1940 white 356115809 2188281368 pennsylvania miminimidid
61 arjel asbott <NA> 1/21/1940 white 356115809 2188281368 indiana kipavagetupo
address zip
6 1760 hitejove path 25312
61 1169 hitejove path 21132
Classify pairs
Once we have decided on one or two thresholds, we need to classify the record pairs. This is done using the classify
function.
When one threshold is given, record pairs above the threshold are classified as links and pairs below the threshold are classified as non-links. Links are indicated by a class
of 2 and non-links are indicated by 0. When two thresholds are given, pairs with weights between the two thresholds are classified as possible links, indicated by a class
of 1.
lclass <- classify(lweights, 10)
lclass$class
lvec of type numeric with length 149500
Printing first 20 values:
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N to M matching
n_to_m
function allows us to take a classified object and force n-to-m matching.
The function requires the package lpSolve:
install.packages("lpSolve")
library(lpSolve)
The function takes integer values for the parameters n
and m
. Each element of x
can be linked to at most n
elements of y
, and each element of y
can be linked to at most m
elements of x
. Matches are selected to maximize the overall weight. The default values are n = 1
and m = 1
, which means 1-to-1 matching.
In this example, we set n = 2
and m = 1
. This means that it is possible for a record from the master data set x
to be linked to a maximum of 2 records from the batch data set y
, but each record from y
can only be linked to one record from x
.
lclass <- n_to_m(lclass, n = 2, m = 1)
When we have unique identifiers for the records, we can append the identities to each pair in the link object in order to get the true match status.
lclass <- set_true_match_id(lclass, id_x = 'id', id_y = 'id')
To view the performance of the linkage, we can create a confusion matrix:
confusion_matrix(lclass)
Non-Links Possible Links Links
True 149260 0 240
False 0 0 0
Displaying result
There are a few ways to display the linkage result:
matches
matches
is a function that allows us to extract matches for class 0, 1, 2 or >0. The result is a data frame with the original row indices forx
andy
for record pairs within the specified class.
matches <- matches(lclass, class = 2)
head(matches)
x y
1 64 2
2 242 3
3 35 4
4 25 5
5 430 6
6 169 7
2 true_matches
the function true_matches
can be used when set_true_match_id
has been implemented. true_matches
returns the indices of x
and y
where the true match status is TRUE
. This can potentially help us examine the weights calculated on true matches to investigate ways to improve the algorithm.
link_result
link_result
displays record pairs within a specified class in pairwise fashion and with original information. Each pair is separated by a blank row (a row of NAs). This makes it easy to visually adjudicate the record pairs to see if the classification is accurate.
The class parameter defaults to class > 0
, and can be specified by values 0, 1, or 2.
linkresult <- link_result(lclass, class = 2)
head(linkresult, 6)
first_name last_name gender birth_date ethnicity SSN med_number state city
1 dillon ramirez female 02/01/1965 pacific islander 174725823 5124618654 kansas pakepucawar
2 dillon ramirez female 02/01/1965 pacific islander 174725823 5124618654 kansas pakepucawar
3 <NA> <NA> <NA> <NA> <NA> NA NA <NA> <NA>
4 rebecca manzanares male 09/08/1900 pacific islander 637701044 1248122191 delaware ticgazsile
5 rebecca manzanares male 09/08/1900 pacific islander 637701044 1248122191 delaware ticgazsile
6 <NA> <NA> <NA> <NA> <NA> NA NA <NA> <NA>
address zip id match_status original_row_index
1 718 fohgut highway 14316 365 2 64
2 718 fohgut highway 14316 365 2 2
3 <NA> NA NA NA NA
4 248 ogugtez river 30854 797 2 242
5 248 ogugtez river 30854 797 2 3
6 <NA> NA NA NA NA
There is also an experimental feature for link_result
, which is set through the parameter consolidated = T
. The consolidated
parameter controls the output display format for one-to-many matches. consolidated = T
assumes that x
is the master data and displays all y
records that link to the same x
record in one group. Each group is separated by an empty row. If consolidated = F
, data is displayed in pairwise groups as described above.