:-)
If you're reading this, you should probably also know: Bill Winkler (mentioned multiple times below) has recently written an extensive survey on record linkage work.
Prob(features(a,b)|match)
and
Prob(features(a,b)|nonmatch)
, where
features(a,b)
is some vector of comparison features
between records a and b. One implication of this is that every
optimal rule for deciding on matches can be obtained by
appropriate thresholding the ration
Prob(features|match)/Prob(features|nonmatch)
.
Experimentally, this produces extreme values for some match variables, so Winkler uses some ad hoc smoothing tricks along with EM. I have another worry: in classifying pairs as match/nonmatch, individual pairs (a,b1) and (a,b2) are not independent. In fact, their "classes" are highly correlated, since in many cases only one of (a,b1) and (a,b2) can be a match. The EM formulation glosses over this though. For some reason I find this more disturbing than the assumption of match-feature independence.
Winkler also outlines a string-matching scheme to measure agreement between partially matching names, like "Cohen" and "Cohn". It's not described very completely in this paper, and I won't review it at all, mostly because it seems very ad hoc and it seems that more principled approaches now exist.
Experimentally, the results with the methods are a bit unclear. They suggest that partial string matching is important, that EM requires smoothing, and that frequency-based parameter setting is quite important. Winkler blames the assumption of match-feature independence, and talks about using more general modelling techniques to fix it.
Winkler says that the standard way in practise of matching records is an iterative one, in which Fellegi-Sunter parameters are estimated, then records are reviewed, then parameters are adjusted, and that experts are needed to make it all work. He also notes that estimated match/nonmatch probabilities are usually not accurate enough to be useful.
Prob(features|match)
that
don't assume independence. Again, the details aren't
presented.
This paper is very much inspired by Kamal Nigam's thesis work on using EM for semi-supervised learning with multinomial naive Bayes.
Extends the Ristad-Yianilos algorithm for affine edit-distance metrics. Affine edit distances allow the cost of a sequence of N insertions (or deletions) to be different from N times the cost of a single insertion (or deletion).
A nice description of various edit distance algorithms, together with some probabilistic versions of these algorithms based on HMMs, can be found in the book Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison.
There's also two sessions (Session 5, Session 8) on record linkage and confidentiality. There is some interesting material on probabilistic linkage, policy issues, and public attitutes toward privacy and linkage, but little that seems technically relevant to the interaction between disclosure limitation and linkage. Comment - I couldn't read many of these papers on Win98, only on linux using an old (v4.0) Acrobat reader. There are more disclosure-relevant papers in the page of notable papers since the '85 workshop: