Company Name Standardization using a Fuzzy NLP Approach

Published on:

28 Mar 2020, 4:55 am

1. Background

Any online B2B platform which has a company registration process faces the common challenge of data harmonization with respect to the names of the registered entities. A common example is Cloud Service Providers who have several business organizations and their entities as customers. If the Cloud Service Provider wants to determine their biggest customers, they are often faced with a daunting task of harmonizing the business entity names and map them to their parent or legal entities. Quite a few organizations today handle this through painstaking manual processes.

In this study, we showcase a two-tier automated methodology for Company Name Standardization achieved by using NLP and Fuzzy Logic-based techniques. This reduces the effort required to less than 15% of that when done entirely manually. The potentials of such a study can be exploited in various other domains like e-Procurement, e-Auctions, government electronic platforms where there is substantial data harmonization needs.

2. Problem Statement

The task at hand is thus: given a list of customer names in a non-standardized format, return a corresponding set of standardized and cleansed names.

The original raw data could have several challenges such as:

Different/Non-standard legal entities in name, department names in addition to organization name
Organization names abbreviated
Spelling errors
Country/region names present in addition to organization name
Email ids provided instead of organization name
Non-English characters used.
Subsidiary name which may not be mapped to parent organization name

3. Solution Methodology

We follow a two-step solutioning approach for this problem. The first step identifies common business entity descriptive names as 'Stop Words' and then removed as 'common' words. In the second, step we use a fuzzy string matching based approach to achieve our objective standardizing entity names. Fig. 1 details the two-step approach.

Fig.1 Schematic of two-step solution methodology

3.1 Step 1: Deep Dive

We start with a text cleansing exercise. We identify commonly occurring terms that are present in company names and then removed them from the text. This process serves to reduce noise in the data that could potentially lead the ML model to tag different companies together. These common words to be removed are treated as stop-words.

For example, Corporation, Private Limited, Solutions and such terms are commonly present in several company names and therefore might incorrectly result in high similarity scores for different company names.

Detailed steps are listed below.

Step 1 workflow:

Basic pre-processing that includes removing special characters, extra whitespaces, strings containing Non-English characters, and converting all text to lower case

Tokenize strings to analyse every word separately
Identify words occurring most frequently in the corpus
Of the frequently occurring terms identified in step (iii), manually select those that are not relevant to specific company names and which add noise. These words will be treated as stop-words and removed from analysis.

A code snippet for Step 1 is shown in Fig. 2

Fig. 2 Code snippet for relevant python functions for Step 1

3.1 Step 2: Deep Dive

Post Step 1, we execute Step 2 wherein we use a fuzzy string matching based approach to achieve our objective standardizing entity names

Detailed steps are listed below

Step 2 Workflow:

1. Determining Similarity Score

Using cleansed company names obtained from Step 1, create a similarity matrix S of dimension nxn, where n is the number of company names in our dataset. The element S_ij of the similarity matrix is a score which quantifies the text similarity between i^th and j^th names.

For computing the score, we take help of the FuzzyWuzzy library in Python which uses the underlying concept of Levenshtein Distance to calculate the differences between two strings.

Several methods are available in the FuzzyWuzzy library to compute the string similarities. For this study we consider the harmonic mean of partial_ratio and token_set_ratio metrics from the FuzzyWuzzy package to have a pairwise text similarity metric. This takes care of partial string matches.

2. Clustering of Similar Names

We run a Clustering algorithm on this matrix to create clusters of names which potentially belong to the same company. The clustering algorithm used here was Affinity Propagation, as it chooses the number of clusters based on the data provided as against say K-means clustering where the cluster number has to be provided This algorithm has the option to run clustering on a pre-computed similarity matrix.

3. Assigning Standard Names

Once the clusters are assigned, we consider all pairs of names in a particular cluster. For each pair, we find the longest common substring. This is done using the Sequence Matcher function from difflib library in Python.

From the list of substrings for a cluster, we take the one with the highest occurrence (mode), which is considered as the Standard name to be assigned to the current cluster. The exercise is then repeated for all clusters. It is possible to get multiple modes for a list; in which case, all the modes are returned.

4. Confidence score

After the standard names are assigned, we try to measure the confidence of the standard name to be the actual representative name for that cluster. This is done comparing the cleansed string to the standard name. For cases where multiple standard names were identified, string matching is done with each and mean of all values is taken. The token_set_ratio function of the FuzzyWuzzy library is used again for this purpose.

This gives us a Confidence score, which quantifies the confidence with which we can say that the standard name we identified truly represents the company name for the raw string.

Thus, the manual effort is reduced to only reviewing where Confidence Score is low.

5. White Space correction

Finally, as a last step, we check if two different Standard names identified have a difference of just whitespaces. If yes, then whitespaces are removed to get a single Standard name

A code snippet for Step 3 is shown in Fig. 3

Fig. 3 Code snippet for relevant python functions for Iteration 2 processing

4. Results

Table 1 below showcases results from the activity mentioned in the above Section 3

Column 'Raw Text' illustrates how non-Standard names may look like. The result of the Name Standardization process on the 'Raw Text' is presented in the column 'Standard Name' along with the corresponding Confidence scores.

Table 1: Sample result from code

5. Future Developments and Constraints

The current capability is limited to English characters only. This can be extended to other languages using newly developed packages such as the R based 'udpipe' package and methods like diacritic restoration.
We have not identified the actual Legal name for the organization. This can be identified using our Standard Names as a lookup with curated company names data sources such as Bloomberg, Thomson Reuters etc.
The methodology proposed with the pairwise similarities computing is O(N²) complex, thus scaling to large datasets with the current approach would be computationally expensive.
If a company name has been referred by their abbreviated string in some records and the full name in others, the algorithm will not be able to group them together, would put them in two separate clusters.
If certain instances of company names are anomalous, i.e., the name is very different from the actual company name, they may not be identified correctly.
If different companies very similar names, they run the risk of getting grouped together. Example, 'ABC' and 'ABC Document Imaging' may be grouped together, even though they are different companies.

6. References

APPENDIX A: Details of Python Libraries used

1. FuzzyWuzzy

FuzzyWuzzy is a python package for string matching. The underlying metric used is Levenshtein Distance. Several approaches are available in this package for string matching, namely: Simple ratio, Partial ratio, Token sort ratio, Token set ratio. In the current study we use Partial ratio and Token set ratio

2. Levenshtein Distance

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who discovered this equation in 1965

Details on the equation may be found in the URL below:

https://dzone.com/articles/the-levenshtein-algorithm-1

3. Partial ratio

Fuzzy Wuzzy partial ratio score is a measure of the string pair's similarity as an integer in the range [0, 100]. Given two strings X and Y, let the shorter string (X) be of length m. It finds the Fuzzy Wuzzy ratio similarity measure between the shorter string and every substring of length m of the longer string and returns the maximum of those similarity measures.

Details maybe found in the following URL: https://anhaidgroup.github.io/py_stringmatching/v0.3.x/PartialRatio.html

4. Token set ratio

Token_set_ratio(): in addition to tokenizing the strings, sorting, and then pasting the tokens back together, this also performs a set operation that takes out the common tokens (the intersection) and then makes pairwise comparisons between the following strings using fuzz.ratio():

s1 = Sorted_tokens_in_intersection

s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens

s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens

For further details refer the URL :

https://www.datacamp.com/community/tutorials/fuzzy-string-python

5. Affinity Propagation (scikit-learn library)

Affinity Propagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.

More details may be referred from:

https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation

6. Sequence matcher (difflib library)

This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching." The idea is to find the longest contiguous matching subsequence that contains no "junk" elements.

For an in-depth explanation, refer to: https://docs.python.org/2/library/difflib.html

Authors

Shashank Gupta is currently a Data Scientist at Brillio, a Leading Digital Services Organization. He works on various Text Analytics and NLP Solutions for solving business problems for clients. In his previous role, he was with Dunnhumby as a Senior Applied Data Scientist working on Digital Media Personalization solutions to improve Customer Engagement for retail clients. He has a total of 2+ years of work experience in the field of Data Science and Analytics.

Shashank completed his Masters in Physics and Bachelors in Electronics and Instrumentation Engineering from BITS Pilani, Goa Campus.

Paulami Das is a seasoned Analytics Leader with 14 years' experience across industries. She is passionate about helping businesses tackle complex problems through Machine Learning. Over her career, Paulami has worked several large and complex Machine Learning-centric projects around the globe.

In her current role as Head of Data Science of Brillio Technologies she heads a team that solves some of the most challenging problems for companies across industries using AI tools and techniques. Her team is also instrumental in driving innovation through building state-of-the-art AI-based products in the areas of Natural Language Processing, Computer Vision, and Augmented Analytics.

Prior to Brillio, Paulami was the Director of Business Development of Cytel, for whom she helped scale the new Analytics business lines. Prior to Brillio, Paulami held Analytics leadership positions with JP Morgan Chase and Dell.

Paulami graduated from IIT Kanpur with a degree in Electrical Engineering. She also holds an MBA from IIM Ahmedabad.

Corresponding Author :

Anish.roychowdhury@gmail.com

Dr. Anish Roy Chowdhury is currently an Industry Data Science Leader at Brillio, a Leading Digital Services Organization. In previous roles he was with ABInBev as a Data Science Research lead working in areas of Assortment Optimization, Reinforcement Learning to name a few, He also led several machine learning projects in areas of Credit Risk, Logistics and Sales forecasting. In his stint with HP Supply Chain Analytics he developed data Quality solutions for logistics projects and worked on building statistical models to predict spares part demands for large format printers. Prior to HP, he has 6 years of Work Experience in the IT sector as a DataBase Programmer. During his stint in IT he has worked for Credit Card Fraud Detection among other Analytics related Projects. He has a PhD in Mechanical Engineering (IISc Bangalore) . He also holds a MS degree in Mechanical Engineering from Louisiana State Univ. USA. He did his undergraduate studies from NIT Durgapur with published research in GA- Fuzzy Logic applications to Medical diagnostics

Dr. Anish is also a highly acclaimed public speaker with numerous best presentation awards from National and international conferences and has also conducted several workshops in Academic institutes on R programming and MATLAB. He also has several academic publications to his credit and is also a Chapter Co-Author for a Springer Publication and a Oxford University Press, best selling publication in MATLAB.

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp

_____________

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.