Why is document similarity more important than ever? In the big data era, it is always more frequent that companies need to detect similar items in their database. Imagine platforms like Kijiji or Subito, trying to detect people that constantly duplicate announcements on the platform to boost their visibility without paying for sponsorship offered by the platform.
Assume for example the platform has 1000000 announcements and it wants to detect duplicates. If we try to check one by one all the pairs we need to check 499999500000 (half a trillion) pairs. If it takes a microsecond (0.000001 s) to check a pair, it would take 499999 seconds, which is approximately six days of computation just to check document similarity.
In this article, we are going to explore data mining techniques to speed up the whole process of finding and checking document similarity. The whole algorithm is inspired by chapter 3 of the book “Mining of Massive Datasets”.
Dataset
In this document similarity guide, we will use some apartment announcements scraped from Kijiji a few years ago. You can also download the whole complete example here.
The file is a tsv file with header:
Title | Short Description | Location | Price (Euro) | Timestamp | Url Adv |
---|---|---|---|---|---|
Studio accessoria… | Affitto studio a … | Roma | 450 | 12 ottobre, 11:32 | https://www.kijij… |
Negozio 169Mq per… | Privato affitta n… | Prenestino / Casi… | 1.700 | 12 ottobre, 08:45 | https://www.kijij… |
Negozio in tiburt… | Negozio c/1 roma … | Tiburtino / Colla… | 6.000 | 17 October, 21:20 | https://www.kijij… |
Studio medico via… | Studio medico avv… | Trieste / Somalia… | 200 | 17 October, 20:22 | https://www.kijij… |
To read the dataset we use the pandas library:
import pandas as pd
dataset=pd.read_csv("dataset_rent_rome_kijiji.tsv", sep="\t")
Code language: JavaScript (javascript)
Fast document similarity: step by step
In the following section we are going to see:
- how to transform a document into a set using k-shingling
- how to represent a set in a compressed way computing its signature in such a way that set similarity is preserved, using MinHashing.
- how to implement a hashing function that hashes similar items in the same bucket, using LSH (locality-sensitive hashing).
Shingling of Document
In literature, representing documents as sets is considered an effective way to identify lexically similar documents. The k-shingles method represents a document as a set of the substrings of length k.
For example, if your document is ‘I love pizza Margherita, a 6-shingle representation of the document based on characters, including spaces, can be {'I love', ' love ', 'love p', 'ove pi', ...}
. According to the use case, you can compose shingles of words instead of characters or eliminate whitespaces and others.
In Python, we can implement a class that, given the parameter k and a document read into a string, returns the k-shingle set of a document:
class shingler:
def __init__(self, k):
if k > 0:
self.k = int(k)
else:
self.k = 10
#inner class utility
def process_doc(self, document):
return re.sub("( )+|(\n)+"," ",document).lower()
def get_shingles(self, document):
shingles = set()
document= self.process_doc(document)
for i in range(0, len(document)-self.k+1 ):
shingles.add(document[i:i+self.k])
return shingles
Code language: Python (python)
And use it as follows:
>>> doc = "I love pizza Margherita! xd 1111@\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> shingler_instance = shingler(10)
>>> shingles_set = shingler_instance.get_shingles(doc)
>>> shingles_set
{' xd 1111@ ', 'herita! xd', 'izza margh', 'e pizza ma', 'rgherita! ', 'ta! xd 111', 'ita! xd 11', ' pizza mar', 'argherita!', 'rita! xd 1', 'zza marghe', 'a! xd 1111', 'a margheri', ' love pizz', 'erita! xd ', 'love pizza', 'za margher', 'margherita', 'gherita! x', ' margherit', 'i love piz', 'pizza marg', 've pizza m', 'ove pizza ', '! xd 1111@'}
Code language: Python (python)
Now we can check if two documents are similar using the Jaccard Similarity, a popular set similarity indicator:
$$ J(s1, s2) = \frac{|s1 \cap s2|}{|s1 \cup s2|} $$
In general, we might want to hash shingles to reduce their size, but even by hashing, they can take in a mean 4 times the size of the original document. So “what’s the point?” you might think. Let me introduce you to MinHashing.
Minhash – The signature of a set
The minash of a set can be seen as its signature, which is a unique and short representation of the set with a fixed length. The magic of MinHashing for a set is that it preserves Jaccard similarity (more or less).
We can represent a set with its characteristic matrix: a matrix whose columns are sets and rows are elements. The matrix contains a 1 in all the cells that correspond to an element contained in a set.
For example:
S1 = {a, b}, S2 = {a, c} S3 = {a,b,c,d}
| |S1|S2|S3|
|-| -| -| -|
|a| 1| 1| 1|
|b| 1| 0| 1|
|c| 0| 1| 1|
|d| 0| 0| 1|
Ideally, the minhash function takes a random permutation of the rows and for each set return the first element with a 1 in the characteristic matrix with permuted rows:
if the permutation is badc
, the characteristic matrix is
| |S1|S2|S3|
|-| -| -| -|
|b| 1| 0| 1|
|a| 1| 1| 1|
|d| 0| 0| 1|
|c| 0| 1| 1|
and the minhash values are
Minhash(S1) = b
Minhash(S2) = a
Minhash(S3) = b
We obtain a signature of size n for the set if we compute minhash for n random permutations of the rows of the characteristic matrix.
Minhash in practice
The permutation approach is not feasible in practice, hence once again we use hash functions to approximate. We define a family of hash functions as
import hashlib
# the family of hash functions, in this case, is the same function (sha1) applied with a different salt.
class hashFamily:
def __init__(self, i):
self.resultSize = 8 # how many bytes we want back
self.maxLen = 20 # how long can our salt be (in decimal)
self.salt = str(i).zfill(self.maxLen)[-self.maxLen:]
def get_hash_value(self, el_to_hash):
return int(hashlib.sha1(str(el_to_hash).encode('utf-8') + self.salt.encode('utf-8')).hexdigest()[-self.resultSize:], 16)
# NOTE: we use sha1 to avoid installing and importing an external library, sacrificing performances. No crypto-hash is required for this use case.
Code language: Python (python)
Then we compute a minhash signature for a set with the following algorithm:
- Take the first hash function, and apply it to all of the shingle values in a document. Find the minimum hash value produced and use it as the first component of the MinHash signature.
- Now take the second hash function, and again find the minimum resulting hash value, and use this as the second component.
- And so on…
So if we have 10 random hash functions, we’ll get a MinHash signature with 10 values for each set. We’ll use the same 10 hash functions for every document in the dataset and generate their signatures as well.
from random import randint, seed
class minhashSigner:
def __init__(self, sig_size):
self.sig_size=sig_size
# Init the hash family of functions using a random salt
self.hash_functions = [hashFamily(randint(0,10000000000)) for i in range(0,sig_size)]
def compute_set_signature(self, set_):
set_sig = []
for h_funct in self.hash_functions:
min_hash = math.inf
for el in set_:
h = h_funct.get_hash_value(el)
if h < min_hash:
min_hash = h
set_sig.append(min_hash)
return set_sig
#return a list of lists that can be seen as the signature matrix
def compute_signature_matrix(self, set_list):
signatures = []
for s in set_list:
signatures.append( self.compute_set_signature(s) )
return signatures
Code language: Python (python)
Example usage:
>>> doc = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
>>> doc2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident bla bla bla."
>>> shingler_inst = shingler(10)
>>> shinglings = shingler_inst.get_shingles(doc)
>>> shinglings2 = shingler_inst.get_shingles(doc2)
>>> minhash_sig = minhashSigner(20).compute_signature_matrix([shinglings,shinglings2])
>>> minhash_sig
[[1783860, 5303198, 4845898, 3935831, 198103, 5887007, 8061377, 4526224, 8467452, 7766250, 15505788, 825368, 12316643, 4502308, 6403903, 9259449, 8533874, 31889076, 940623, 278359], [1783860, 20203690, 4845898, 3935831, 198103, 12350289, 8061377, 4526224, 8467452, 7766250, 15505788, 825368, 12316643, 6122847, 6403903, 9259449, 8533874, 31889076, 940623, 278359]]
>>>
>>> set_sig_1 = set(minhash_sig[0])
>>> set_sig_2 = set(minhash_sig[1])
>>> jaccard_similarity_sig = len(set_sig_1.intersection(set_sig_2))/len(set_sig_1.union(set_sig_2))
>>> jaccard_similarity_sig
0.7391304347826086
>>>
>>> jaccard_similarity_shingle_set = len(set(shinglings).intersection(shinglings2))/len(set(shinglings).union(shinglings2))
>>> jaccard_similarity_shingle_set
0.8285077951002228
Code language: Python (python)
From the example, we can see that the Jaccard similarity between set signatures is near to the similarity between shingle sets. Of course, we can achieve a lower approximation error by tuning the shingle size and signature size.
Locality Sensitive Hashing
Quoting “Mining of Massive Datasets”
‘Locality-sensitive hashing (also known as near-neighbor search) is a general theory focused on how to approximatively find similar pairs without investigating all of them. The principle is that we are going to hash items several times in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.’
When you have a minhashed set a popular choice for LSH is the “banding technique”, that is:
- get the signature matrix and divide it into bands
- hash each column of the band. The elements that hash to the same bucket are likely to be similar.
- repeat for all the bands. You can use the same hash function of the previous band but you must use different buckets.
In python:
class lsh:
def __init__(self, threshold=0.8):
self.threshold = threshold
def get_signature_matrix_bands(self, sig_matrix, bands_nr, sign_len):
#bands_nr = b
#sign_len = n
r = int(sign_len/bands_nr) #number of rows in each band
bands = {} # {band_nr: [col_1,col_2,...]} where col_1 is all the values of Sig(S_i) for band b.
for i in range(0,bands_nr):
bands[i] = []
# put Subsets of the columns of signature matrix into the appropriate bucket and cosider a column
# as a unique block so that we can hash the entire column.
# Basically a band is a list of element, where each element is a subset of a signature of a given set.
for signature in sig_matrix:
for i in range(0, bands_nr):
idx = i*r
bands[i].append(' '.join(str(x) for x in signature[idx:idx+r]) )
return bands
#band is a list
# construct a dictionary {hash(band_column): doc_id that produced this hash}
def get_band_buckets(self, band, hash_funct):
buckets = {}
for doc_id in range(0,len(band)):
value = hash_funct.get_hash_value( band[doc_id] )
if value not in buckets:
buckets[value] = [doc_id]
else:
buckets[value].append(doc_id)
return buckets
def get_candidates_list(self, buckets):
candidates = set()
# buckets is a dictionary containing key=bucket, value= list of doc_ids that hashed to bucket
for bucket,candidate_list in buckets.items():
if len(candidate_list) > 1:
for i in range(0,len(candidate_list)-1):
for j in range(i+1,len(candidate_list)):
pair = tuple(sorted( (candidate_list[i],candidate_list[j]) ))
candidates.add(pair)
return candidates #ie a set of couples, each couple is a candidate pair
def check_candidates(self, candidates_list, threshold, sigs):
similar_docs = set() #set of tuples
# similar_pair is a couple containing doc_ids of documents that hashed to same bucket
for similar_pair in candidates_list:
#for all the pairs of document in the list check similarity of their signatures
doc_id_1 = similar_pair[0]
doc_id_2 = similar_pair[1]
signature_1 = set(sigs[doc_id_1]) #get the i-th column from signature matrix where i is doc_id in the collision list
signature_2 = set(sigs[doc_id_2])
js = len(signature_1.intersection(signature_2)) /len(signature_1.union(signature_2))
if js >= threshold:
similar_docs.add( tuple(sorted((doc_id_1,doc_id_2) )) )
return similar_docs
def get_similar_items(self, sig_matrix, bands_nr, sign_len):
similar_docs = set()
#divide signature matrix into bands
bands = lsh_instance.get_signature_matrix_bands(sig_matrix,bands_nr,sign_len)
#for all the bands
for band_id, elements in bands.items():
#produce the buckets for the given band (band_id) with a random hash function
buckets = lsh_instance.get_band_buckets(elements, hash_funct=hashFamily(randint(0,10000000000)))
#Get all the candidate pairs
candidates = lsh_instance.get_candidates_list(buckets)
#Check all candidate pairs' signatures
for sim_tuple in lsh_instance.check_candidates(candidates, self.threshold, sig_matrix):
similar_docs.add( sim_tuple)
return similar_docs #return all the similar signatures that respect the threshold
Code language: Python (python)
The class performs the following tasks:
- Divide signature matrix into bands with
get_signature_matrix_bands(sig_matrix, bands_nr, sign_len)
. The method will return a list of strings where each string represents a column of the signature matrix for the given band. It is done to hash the entire column next when producing buckets. - Compute buckets for each band with
get_band_buckets(band, hash_funct)
, which will take as input a bandb
and a hash functionh
and will return a dict containing as key the bucket ids and as value a list of document ids that hashed to that bucket for the given bandb
: ${bucket_{id}:[doc_i, doc_k, …] } $. - With
get_candidates_list(buckets)
we are going to take as input a list of buckets (whose union is the signature matrix) and will produce as output a set of tuples: those tuples represent documents that are hashed in the same bucket. - With
check_candidates(candidates_list, threshold, sigs)
we take all the candidates from all the bands and check if effectively their signatures produce a match with approximate Jaccard similarity greater than the threshold we give as a parameter. - With
get_similar_items(sig_matrix, bands_nr, sign_len)
we put together all the methods listed above and return the ids of similar documents that respect the threshold value.
The similarity threshold is passed as a value to the constructor of the class, default is 0.8.
Example usage:
>>> shingler_inst = shingler(10)
>>> doc = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
>>> doc2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident bla bla bla."
>>>
>>> shinglings = shingler_inst.get_shingles(doc)
>>> shinglings = shingler_inst.get_hashed_shingles(shinglings)
>>>
>>> shingle2 = shingler_inst.get_shingles(doc2)
>>> shingle2 = shingler_inst.get_hashed_shingles(shingle2)
>>>
>>> signer = minhashSigner(100)
>>> signature_matrix = signer.compute_signature_matrix([shinglings,shingle2])
>>> lsh_instance = lsh(threshold=0.7)
>>> lsh_instance.get_similar_items(signature_matrix,10,100)
{(0, 1)}
Code language: Python (python)
Benchmark Time
Now that we have implemented the whole code for fast approximate duplicate document detection, let’s test it out against brute force detection and see what happens.
As the first step, we load the dataset as a pandas data frame.
print("Loading dataset...")
dataset=pd.read_csv("dataset_rent_rome_kijiji.tsv", sep="\t")
dataset['doc_id']=dataset.index
doc_nr = dataset['doc_id'].max()
print("Dataset loaded correctly.")
Code language: Python (python)
Then transform documents into k-shingles.
print("Producing Shingles...")
start_time = time.time()
#an array where the index i represent the document_id and the element shingling_list[i] the hashed shingles for document document_id
shingling_list = [None] * (doc_nr +1)
shingling_size = 10
signature_size = 50
bands_nr = 10
shingler_inst = shingler(shingling_size)
#produce hashed shinglings for all documents
for index, row in dataset.iterrows():
doc = row['Title']+" "+row['Short Description']
i = row['doc_id']
shinglings = shingler_inst.get_hashed_shingles( shingler_inst.get_shingles(doc) )
shingling_list[i] = shinglings
end_time = time.time()
print("Shingles produced in:\t %.2f seconds."%(end_time - start_time))
Code language: Python (python)
And we sign all the shingles and get the signature matrix.
signer = minhashSigner(signature_size)
start_time = time.time()
print("Computing signature matrix...")
#produce a signature for each shingle set
signature_matrix = signer.compute_signature_matrix( shingling_list )
end_time = time.time()
print("Signature Matrix computed in:\t %.2f seconds." %(end_time - start_time))
Code language: Python (python)
In the end, we compute similar pairs with LSH.
lsh_instance = lsh(threshold=0.8)
start_time = time.time()
print("Computing LSH similarity...")
lsh_similar_itemset = lsh_instance.get_similar_items(signature_matrix, bands_nr, signature_size)
end_time = time.time()
lsh_computation_time = end_time - start_time
print("LSH Similarity computed in:\t %.2f seconds.\nSimilar Elements Found: %d" %(lsh_computation_time,len(lsh_similar_itemset)))
Code language: Python (python)
We compute similar pairs using brute force
class bfsc():
def compare_shingles_set_js(self, set1, set2):
return len(set1.intersection(set2))/len(set1.union(set2))
brute_force_comparer = bfsc()
brute_force_similar_items = set()
start_time = time.time()
for i in range(0,doc_nr-1):
for j in range(i+1,doc_nr):
similarity = brute_force_comparer.compare_shingles_set_js(set(shingling_list[i]),set(shingling_list[j]))
if similarity >= 0.8:
brute_force_similar_items.add( (i,j) )
end_time = time.time()
bf_computation_time = end_time - start_time
Code language: Python (python)
Check for similar pairs
- LSH
docs = lsh_similar_itemset.pop()
print("Doc 1:")
print(dataset.iloc[docs[0]] )
print("Is similar to\nDoc2:")
print(dataset.iloc[docs[1]] )
Code language: Python (python)
- Brute Force
docs = brute_force_similar_items.pop()
print("Doc 1:")
print(dataset.iloc[docs[0]] )
print("Is similar to\nDoc2:")
print(dataset.iloc[docs[1]] )
Code language: Python (python)
Print stats
report = ('LSH\n%.2f seconds to execute\n'
'%d similar documents found\n\n'
'Bruteforce search\n%.2f seconds to execute\n'
'%d similar documents found.\n\n'
'They find %d common similarities.')
print(report %(lsh_computation_time, len(lsh_similar_itemset),bf_computation_time,len(brute_force_similar_items),len(brute_force_similar_items.intersection(lsh_similar_itemset)) ))
Code language: Python (python)
Conclusion
In this article about checking document similarity we have seen:
- how to transform a document into a set using k-shingling
- how to represent a set in a compressed way computing its signature in such a way that set similarity is preserved, using MinHashing.
- how to implement a hashing function that hashes similar items in the same bucket, using LSH (locality-sensitive hashing).
- how to compare the approximate method with the brute force method.
All the theory behind the code here can be found in chapter 3 of the book “Mining of Massive Datasets”
You can find the full example on GitHub. I wrote also a notebook with an Apache Spark implementation, available on GitHub.
Despite it is interesting to understand how that algorithm works, it is always a good idea to check for robust libraries to put this algorithm in production. A lot of how implementations exist depends on the use case, here are a few: