Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings transforms the problem into a matrix multiplication problem, which is computationally much cheaper. Using this approach made it possible to search for near duplicates in a set of 663,000 company names in 42 minutes using only a dual-core laptop.
- Super Vectorizer 2 is a professional vector tracing software that automatically converts bitmap images like JPEG, GIF and PNG to clean, scalable vector graphic of Ai, SVG, DXF and PDF. Super Vectorizer 2 - Vector Trace Tool for PC and Mac Screenshots.
- Super Vectorizer for Mac easily convert any bitmap image to tweakable vector graphics of Ai, SVG, DXF and PDF with transparency background. Quickly trace and smooth out bitmap line art, logo, scanned images to clean outlines with all necessary details.
Update: run all code in the below post with one line using string_grouper:
Image Vectorizer. Convert raster images like PNGs, BMPs and JPEGs to scalable vector graphics (SVG, EPS, DXF) Upload Images How does it work. Vectorization of raster images is done by converting pixel color information into simple geometric objects. The most common variant is looking over edge detection areas of the same or similar brightness. Meet Super Vectorizer 2 For Mac- an advanced software to convert raster images to vector images with just a few clicks! Whether you are a freelancer or a graphic design company, Super Vectorizer 2 for Mac does an impressive job of vectorizing raster bitmap images.
Name Matching
A problem that I have witnessed working with databases, and I think many other people with me, is name matching. Databases often have multiple entries that relate to the same entity, for example a person or company, where one entry has a slightly different spelling then the other. This is a problem, and you want to de-duplicate these. A similar problem occurs when you want to merge or join databases using the names as identifier.
The following table gives an example:
Company Name |
---|
Burger King |
Mc Donalds |
KFC |
Mac Donald’s |
For the human reader it is obvious that both Mc Donalds and Mac Donald’s are the same company. However for a computer these are completely different making spotting these nearly identical strings difficult.
Png To Svg Vectorizer
One way to solve this would be using a string similarity measures like Jaro-Winkler or the Levenshtein distance measure. The obvious problem here is that the amount of calculations necessary grow quadratic. Every entry has to be compared with every other entry in the dataset, in our case this means calculating one of these measures 663.000^2 times. In this post I will explain how this can be done faster using TF-IDF, N-Grams, and sparse matrix multiplication.
The Dataset
I just grabbed a random dataset with lots of company names from Kaggle. It contains all company names in the SEC EDGAR database. I don’t know anything about the data or the amount of duplicates in this dataset (it should be 0), but most likely there will be some very similar names.
Line Number | Company Name | Company CIK Key | |
---|---|---|---|
0 | 1 | !J INC | 1438823 |
1 | 2 | #1 A LIFESAFER HOLDINGS, INC. | 1509607 |
2 | 3 | #1 ARIZONA DISCOUNT PROPERTIES LLC | 1457512 |
3 | 4 | #1 PAINTBALL CORP | 1433777 |
4 | 5 | $ LLC | 1427189 |
TF-IDF
Super Vectorizer Mac
TF-IDF is a method to generate features from text by multiplying the frequency of a term (usually a word) in a document (the Term Frequency, or TF) by the importance (the Inverse Document Frequency or IDF) of the same term in an entire corpus. This last term weights less important words (e.g. the, it, and etc) down, and words that don’t occur frequently up. IDF is calculated as:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
An example (from www.tfidf.com/):
Consider a document containing 100 words in which the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
TF-IDF is very useful in text classification and text clustering. It is used to transform documents into numeric vectors, that can easily be compared.
N-Grams
While the terms in TF-IDF are usually words, this is not a necessity. In our case using words as terms wouldn’t help us much, as most company names only contain one or two words. This is why we will use n-grams: sequences of N contiguous items, in this case characters. The following function cleans a string and generates all n-grams in this string:
As you can see, the code above does some cleaning as well. Next to removing some punctuation (dots, comma’s etc) it removes the string “ BD”. This is a nice example of one of the pitfalls of this approach: some terms that appear very infrequent will result in a high bias towards this term. In this case there where some company names ending with “ BD” that where being identified as similar, even though the rest of the string was not similar.
The code to generate the matrix of TF-IDF values for each is shown below.
The resulting matrix is very sparse as most terms in the corpus will not appear in most company names. Scikit-learn deals with this nicely by returning a sparse CSR matrix.
You can see the first row (“!J INC”) contains three terms for the columns 11, 16196, and 15541.
The last term (‘INC’) has a relatively low value, which makes sense as this term will appear often in the corpus, thus receiving a lower IDF weight.
Cosine Similarity
To calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually used. The cosine similarity can be seen as a normalized dot product. For a good explanation see: this site. We can theoretically calculate the cosine similarity of all items in our dataset with all other items in scikit-learn by using the cosine_similarity function, however the Data Scientists at ING found out this has some disadvantages:
- The sklearn version does a lot of type checking and error handling.
- The sklearn version calculates and stores all similarities in one go, while we are only interested in the most similar ones. Therefore it uses a lot more memory than necessary.
To optimize for these disadvantages they created their own library which stores only the top N highest matches in each row, and only the similarities above an (optional) threshold.
The following code runs the optimized cosine similarity function. It only stores the top 10 most similar items, and only items with a similarity above 0.8:
The following code unpacks the resulting sparse matrix. As it is a bit slow, an option to look at only the first n values is added.
Lets look at our matches:
left_side | right_side | similairity | |
---|---|---|---|
41024 | ADVISORY U S EQUITY MARKET NEUTRAL OVERSEAS FUND LTD | ADVISORY US EQUITY MARKET NEUTRAL FUND LP | 0.818439 |
48061 | AIM VARIABLE INSURANCE FUNDS | AIM VARIABLE INSURANCE FUNDS (INVESCO VARIABLE INSURANCE FUNDS) | 0.856922 |
14978 | ACP ACQUISITION CORP | CP ACQUISITION CORP | 0.913479 |
54837 | ALLFIRST TRUST CO NA | ALLFIRST TRUST CO NA /TA/ | 0.938206 |
89788 | ARMSTRONG MICHAEL L | ARMSTRONG MICHAEL | 0.981860 |
54124 | ALLEN MICHAEL D | ALLEN MICHAEL J | 0.928606 |
66765 | AMERICAN SCRAP PROCESSING INC | SCRAP PROCESSING INC | 0.858714 |
44886 | AGL LIFE ASSURANCE CO SEPARATE ACCOUNT VA 27 | AGL LIFE ASSURANCE CO SEPARATE ACCOUNT VA 24 | 0.880202 |
49119 | AJW PARTNERS II LLC | AJW PARTNERS LLC | 0.876761 |
16712 | ADAMS MICHAEL C. | ADAMS MICHAEL A | 0.891636 |
96207 | ASTRONOVA, INC. | PETRONOVA, INC. | 0.841667 |
26079 | ADVISORS DISCIPLINED TRUST 1329 | ADVISORS DISCIPLINED TRUST 1327 | 0.862806 |
16200 | ADAMANT TECHNOLOGIES | NT TECHNOLOGIES, INC. | 0.814618 |
77473 | ANGELLIST-SORY-FUND, A SERIES OF ANGELLIST-SDA-FUNDS, LLC | ANGELLIST-NABS-FUND, A SERIES OF ANGELLIST-SDA-FUNDS, LLC | 0.828394 |
70624 | AN STD ACQUISITION CORP | OT ACQUISITION CORP | 0.855598 |
16669 | ADAMS MARK B | ADAMS MARY C | 0.812897 |
48371 | AIR SEMICONDUCTOR INC | LION SEMICONDUCTOR INC. | 0.814091 |
53755 | ALLEN DANIEL M. | ALLEN DANIEL J | 0.829631 |
16005 | ADA EMERGING MARKETS FUND, LP | ORANDA EMERGING MARKETS FUND LP | 0.839016 |
97135 | ATHENE ASSET MANAGEMENT LLC | CRANE ASSET MANAGEMENT LLC | 0.807580 |
The matches look pretty similar! The cossine similarity gives a good indication of the similarity between the two company names. ATHENE ASSET MANAGEMENT LLC and CRANE ASSET MANAGEMENT LLC are probably not the same company, and the similarity measure of 0.81 reflects this. When we look at the company names with the highest similarity, we see that these are pretty long strings that differ by only 1 character:
left_side | right_side | similairity | |
---|---|---|---|
77993 | ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES I | ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES II | 0.994860 |
77996 | ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES II | ANGLE LIGHT CAPITAL, LP - ANGLE LIGHT CAPITAL - QUASAR SERIES I | 0.994860 |
81120 | APOLLO OVERSEAS PARTNERS (DELAWARE 892) VIII, L.P. | APOLLO OVERSEAS PARTNERS (DELAWARE 892) VII LP | 0.993736 |
81116 | APOLLO OVERSEAS PARTNERS (DELAWARE 892) VII LP | APOLLO OVERSEAS PARTNERS (DELAWARE 892) VIII, L.P. | 0.993736 |
66974 | AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT E | AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT B | 0.993527 |
66968 | AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT B | AMERICAN SKANDIA LIFE ASSURANCE CORP VARIABLE ACCOUNT E | 0.993527 |
80929 | APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (EURO B), L.P. | APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (EURO B), L.P. | 0.993375 |
80918 | APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (EURO B), L.P. | APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (EURO B), L.P. | 0.993375 |
80921 | APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (DOLLAR A), L.P. | APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (DOLLAR A), L.P. | 0.993116 |
80907 | APOLLO EUROPEAN PRINCIPAL FINANCE FUND II (DOLLAR A), L.P. | APOLLO EUROPEAN PRINCIPAL FINANCE FUND III (DOLLAR A), L.P. | 0.993116 |
Conclusion
Super Vectorizer 2 Review
As we saw by visual inspection the matches created with this method are quite good, as the strings are very similar. The biggest advantage however, is the speed. The method described above can be scaled to much larger datasets by using a distributed computing environment such as Apache Spark. This could be done by broadcasting one of the TF-IDF matrices to all workers, and parallelizing the second (in our case a copy of the TF-IDF matrix) into multiple sub-matrices. Multiplication can then be done (using Numpy or the sparse_dot_topn library) by each worker on part of the second matrix and the entire first matrix. An example of this is described here.
- In German households, the Internet is available in almost everywhere by copper cable. These cables are very susceptible to electromagnetic interference, and other influences. By Vectoring the disturbance will be reduced massively.
- The less interference there is, the higher the data rates can be achieved. In conclusion, this means that you can get through Vectoring and Super-Vectoring faster Internet in your home.
- Super-Vectoring must only be used in the distribution boxes installed and requires no access to the house. This makes the technique very cheap, because little glass fiber has to be laid.
- Currently, the Telecom is in the process of Super-Vectoring in Germany. For this purpose, it hails a lot of criticism from the press and from experts, because Super-Vectoring has two major problems.
- If a line of Super-Vectoring is used, this can only be offered by a Provider. This is problematic, since by law all providers to rent space in the network of Telekom and at the same time offer their services to can.
- The much bigger Problem, however, is that the Telekom is reluctant due to the Expansion of Super-Vectoring the Expansion of glass fiber addition. In the long term, this is bad, because the data rate of Super-Vectoring in a few years could already be exhausted.