Cost-aware load balancing for multilingual record linkage using MapReduce

Medhat D.
Yousef A.H.
Salama C.

Gathering and processing large amounts of data is increasing every day. Record linkage is one of the most complex data-intensive tasks, which is used to accurately match records from different data sources that contain information about same entity like a person, especially when they do not share common identifier. As more resources in more than one language become available, new methods are required that are capable to match records expressed in more than one language. In this paper, we are presenting a scalable, cost-aware load balancing technique over MapReduce that is capable to link records from different multilingual data sources accurately and efficiently by re-distributing the multilingual matching tasks on available machines based on their cost. We are evaluating our approach on a Hadoop cluster on cloud infrastructure against state of the art blocking-based load balancing techniques, where our approach outperforms other approaches in terms of execution time and scalability. © 2019 Ain Shams University