Parse And validate unstructured contact data
Many Openprise customers have to deal with contact data they receive from third parties that are rather “untidy”. This contact data can contain information for person name, company name, and address. This data can be very difficult to ingest by your applications until they are put in a more structured format. Let’s take a look at some of the challenges we see when parsing unstructured data and we will outline how Openprise can help.
Unstructured Contact Data Challenges
Unknown Data Combinations
Contact data can contain any combination of person, company, and address data. The address data can also contain any combination of common abbreviations. For example:
- John Doe, Acme Corporation, 123 Sunny View Ave, Springfield, CA 91234
- John Doe, 123 Sunny View Avenue, Springfield, California 91234-1000
- Acme Corp, Attn: John Doe, 123 Sunnyvale Av, Springfield, CA 91234
Lack of Structured Delimitation
Data can come in either in one big block of text, or a number of generic data fields that are not structured, for example: Address Line 1, Address Line 2, … Address Line 5.
International Address
US and Canadian addresses are relatively easy to handle. They are highly structured with easily identifiable parts based on US and Canadian postal service standards, namely: house number, street name, city, state/province, ZIP code. European addresses are more varied in format and can be more challenging. Developing country addresses can be very challenging. For example, many addresses from South America and Asia lack structure and can be landmark based. For example: Acme Co, Empire Building, 6th Floor, across the street from Big Bank Tower, Downtown District.
Person Name Combinations
Data can contain one or more person names and these names can have these complications:
- Different first, middle, and last name combinations, for example: John S. Doe and Jane Joe, John Stephen and Jane Doe
- Long multi-part international names, for example Mario Eliecer Narvaez Torres
- Contains relationship words such as: attn, c/o, custodian of, trustee
How To Use Openprise to Process Unstructured Contact Data
Openprise provides powerful data automation features to turn your unstructured contact data into standardized data that is parsed, structured, cleaned, validated, and normalized. Here is how:
Parse
Openprise’s Contact Information Parsing rule parses your unstructured contact data into component parts, namely:
- Person name, company name, and address
- Further parse person name into first, middle, last names, salutation, and suffix
- Further parse address into number, floor, street, locality, state, country, postal code, … etc.
Validate and Auto-Complete Address
Openprise’s Contact Information Parsing rule leverages the powerful Google Places API to help you validate and auto-complete your addresses. It can:
- Validate address
- Auto-complete missing address parts
- Suggest best-match addresses
Normalize And Infer Missing Address Parts
Your address data can contain missing parts that can be easily inferred from other parts you do have. Using the Infer Data rule and the reference data from Openprise’s large Open Data Catalog, you can:
- Infer missing city and state data from postal code data
- Infer missing country data from state and phone number data
- Complete and normalize telephone number using country data
- Infer missing contact data from IP inferred data
Address parts like state name (CA or California), country name (United States or USA), and even street name (Ave or Avenue) can also be easily normalized using the Simple Replacement / Normalization rule.
Clean And Normalize Company Name
Do you have company names like Toyota USA, Toyota Motors U.S.A., and Toyota Motor Sales Corp, USA in your database that are all for the same company? Openprise’s Company Name Clean-up rule can scrub your company data to remove noises like corporate entity words, abbreviations, special characters, extra spaces, and upper/lower cases. Openprise can also help to create a company master list from your raw company data, using statistical model and fuzzy search algorithms to find and correlate all the different variations of each company name. This company master can then be used to fully normalize your company data and keep it clean.
Recommended resources
Some More Helpful Resources
If you have to deal with unstructured contact data, give Openprise data automation a try. Here are 3 Cook Books that can help you with your project:
Leave a comment