Database Cleansing - the What, Why and How
In this article I will attempt to set out simply what it means to clean a database, why you need to do it, and how you might go about it.
Cleaning a database is done to:
To explain why, I am going to use the example of a customer database, but the principles apply to other types of data also.
Have you ever received a marketing message / catalogue in the mail twice or more times? I receive multiple copies of such communications regularly, and I don't always get around to telling the sender of their mistake. This can:
In addition, cleaning your data, will help you to analyse your data more accurately. For instance, you will know the real number of contacts and perhaps how they are geographically distributed, rather than the distorted figures that can be derived from analysing a corrupted database.
It's not a crime! In fact it is very easy for your data to get in a state that requires cleaning. For example, when a client changes their address, your staff might update the suburb but forget to put in the new postcode. Or, an existing client returns to your organisation several years later, without informing new staff that they are an existing client, and if you don't have the appropriate keys on your database preventing duplicates, the client could be set up again as another customer with the same or similar details.
Having documented processes that your staff can use as a checklist, and appropriate unique keys on your database fields, will go some way to ensuring that your data is kept clean, but incorrect data will never be prevented."How" then, do you efficiently clean your database?
Fixing incorrect information such as the postcode matching the suburb is usually done by comparing each record to the correct values in another table. For example, to correct all the postcodes in your data, assuming that the suburb entered is correct, you would write SQL code that would compare the postcode of your record against a table of postcode + suburb + state that you may have obtained from Australia Post. Such a process would likely generate a list of records where the suburb was not found, requiring you to manually investigate and correct the data.
Correcting the formatting of your data, is usually done using some pretty simple SQL perhaps combined with logic programming. You need to decide the format you wish to apply to your data, for example, whether you would like the suburb in title case or all capitals. While this is much less important than getting the data actually right, it can help to make your communications look more professional.
Finding duplicates is a fairly easy task for someone who knows a little about the SQL database language. It is more difficult to find similar records that really are the same person, but are not listed in exactly the same way in your database. For instance the following two records may actually be the same person:IDFirstnameSurnameAddress1Address2SuburbPostcodeState3442JohnCitizenPO Box 33Frankston3199VIC682JonathonCitien14 Beach RoadFRANKSTON3199VIC
Finding records such as the above calls for what is usually called "Fuzzy" Matching. Software is available to find such records, and much more experienced SQL programmers could write software to find such possible duplicates.
Because you can't confidently use logic to determine whether or not two records are the same in the case given above, usually fuzzy matching would leave the data as is, but produce an exception report, highlighting likely duplicate records.
Even when you can determine confidently that two records are the same, you may wish to manually process the data cleanup to ensure that only the correct data is kept, and that all associated pieces of information are transferred across to the valid record e.g. customer payment history. It is possible however, to set up your de-duplication process to remove all the duplicates and clean up all the records automatically.
Cleaning your database can take some time, and some manual effort on the part of your staff. If you are just starting out with a new database, it is very worthwhile to:
If you need help cleaning your database, Contact Point can help you. We provide a quick and efficient service to deal with all the database issues discussed above, and can tailor our service to meet your particular needs. Submit a request now for an obligation free quote.
Source: Free Articles from ArticlesFactory.com
ABOUT THE AUTHOR
Heather Maloney is the Managing Director of Contact Point IT Services Pty Ltd a business specialising in the provision of IT solutions that deliver measurable value to small-medium sized businesses. In particular Contact Point focuses on helping businesses to interact better with their clients, customers, suppliers and other 3rd parties... from electronic marketing to online sales to back office systems.