High quality of data is a pre-requisite for making valuable business decisions. However, most of the time, data quality of a dataset often turns out to be poor owing to inconsistencies, errors, and missing data among other reasons. Data inconsistency occurs due to multiple reasons including manual wrong entry, misspelling, missing information and presence of redundant data in different representations.
Not correcting this erroneous data can lead to major problems during subsequent downstream data processing, which leads to wrong business decisions, which can be extremely costly for the organization. It is important for data managers to ensure that data cleansing procedures are in place. A data entry outsourcing service expert would have systematic data cleansing and scrubbing procedures in place.
Data Cleansing or Scrubbing is the process of detecting & removing inconsistencies & errors from data to improve the quality of data. The need for data cleansing increases significantly when multiple data sources are integrated. This process of making data accurate and consistent is riddled with many problems, few of which are mentioned below:
Applications such as Data Warehouses load huge amounts of data from a variety of sources continuously and further they carry significant amount of dirty data (data errors). In such case the task of data cleansing becomes both significant and formidable at the same time.
Misspellings occur mostly due to typing error. The wrong spelling can be detected and corrected for common words and grammatical errors, however, as database constrain huge amount of data that is unique, it is hard to detect spelling mistake at input-level. Further, Spelling mistakes in data such as names, addresses are always difficult to identify and correct.
Lexical errors occur in data due to name discrepancies between the structure of the data items and the specified format. Example, a particular database records attribute for name, age, sex and height. When an individual does not enter an intermediate value say (age) the data for following attributes changes field. In above case, when individual does not enter value for age, value for sex, say male is read as age and value of height is read as sex.
Misfielded value problem occurs when the values entered are correct as far format is concerned but does not belong to the field. Example in field of city, value recorded is Germany.
Domain format errors occur when the value for a particular attribute is correct but do not comply with format of domain. Example, a particular NAME database requires first name and surname to be separated with comma but the input is without comma. In this case while the input may be correct but it does not comply with domain format.
Irregularities deal with non-uniform use of units or values. Example while doing entry of salary of employee, the salary is mentioned using different currencies. This kind of data requires subjective interpretation and can often result in wrong results.
Missing values occur as a result of omissions that happen while collecting the data. They signify unavailability of values during process of data entry. Both dummy values and null values are included in missing values. For example, 000-0000 and 999-9999 in the telephone number field.
Contradiction error occurs when the same real world entity is described by two different values in data. Example in personal database for the same person there are two records with two different date of birth, however, other values and entity is same.
Duplication problem signifies a situation where the same data is represented multiple times on account of some data entry error. For example, there can be two records of same person with everything same just minor difference in name with no use of middle name in one of the entry. No data is wrong but the person gets represented twice on account of failure to check duplicity.
Integrity constraint violations describe values that do not satisfy integrity value constraints. It occurs when input value is outside limits of values allowed for representing a particular attribute.
These include use of cryptic values and abbreviations in fields. Example instead of full mention of college name using only initials. This kind of errors increase chances of duplication and reduce sorting ability.
These errors when value for a secondary does not match the primary attribute. Example when the listed city does not lie inside the country or when postal zip-code does not coincide with the mentioned city.
Errors related to wrong result inhibit data validation and result in data mismatch. For example in department field if an individual enters wrong value of reference department. The subsequent process of data validation results in mismatch.
This type of error occurs when multiple values are entered in the same field. This practice seriously restricts the ability of data indexing and sorting abilities. As an example where the values for name, age and sex are entered in the name field itself.
Data cleansing is an integral part of data management. It is necessary to make data accurate, consistent and to avoid duplication of information. The article highlights the common problems one encounters while doing data cleansing and aims to serve as guideline for data quality improvement & data cleansing process. Each of the above problems can be easily avoided if proper procedures are followed during the design and execution task of cleansing.
Outsourcing the data scrubbing task to an expert in providing data cleansing outsourcing services can considerably speed up the task and ensure that your data gets and remains clean.