Data is central to any business function. Major business processes and decisions depend on relevant data. However, in today’s technological and competitive world, the existence of data does not solely ensure that the corresponding business functions will run smoothly. The quality of the underlying data is of paramount importance in ensuring correct decisions.
The quality of a set of data is judged by many parameters including accuracy, consistency, reliability, completeness, usefulness and timeliness. Poor quality data refers to missing, invalid, irrelevant, outdated or incorrect data. Poor data quality does not only imply that the data has been acquired incorrectly. There could be many other reasons that could result in absolutely valid data at one point of time for one function, becoming totally wrong for another business or function.
The following list categorizes the various stages in the data life cycle where deterioration of data quality may occur, in a business enterprise.
The above classifications are described below in greater detail along with the actual processes occurring within each classification.
Data is entered into the company’s systems by many ways including manual and automatic. Data migration in large volumes introduces serious errors. There are many factors that contribute to poor quality of incoming data.
When data is migrated from an existing database to another, a host of quality issues can arise. The source data may itself be incorrect owing to its own limitations; or the mapping of the old database to the new database may show inconsistencies or the conversion routines may map them incorrectly. It is often seen that the ‘legacy’ systems have metadata which is out-of-sync with the actual schema implemented. The data dictionary’s accuracy serves as the base for conversion algorithms, mapping and efforts. If the dictionary and actual data are out of sync, it can lead to major problems in data quality.
Business mergers and acquisitions lead to data consolidations. Since the focus is mostly on business process streamlining, the data conjoining is usually given lesser importance. This can be catastrophic especially if data experts of the previous company are not involved in the consolidation process, and there are challenges with out-of-sync meta data. Merging two databases which do not have compatible fields may lead to incorrect data ‘fitment’ and hit data accuracy adversely.
Data is fed manually into the system many times, and is hence prone to human error. Since user data is often entered through various user-friendly interfaces, they may not be directly compatible with the internal data representation. In addition, end-users tend to fill ‘shortcut’ information in fields that they perceive to be unimportant, but which may be crucial to internal data management. The data operator may not have the expertise to understand this data and might incorrectly fill values in the wrong fields or may mistype the information.
Often automated processes are used to fill in large volumes of similar data in batches, as this saves effort and time. The systems pushing this bulk amount of data may pump in equally huge amounts of wrong data as well. This can be quite disastrous, especially when data travels down a number of databases in series. Wrong data can trigger incorrect processes that may result in incorrect decisions which might have huge adverse impacts for the firm. Usually, the data flow across the integrating systems is not fully tested, and any upgrade of software processes in the data chain with inadequate regression testing may have a detrimental effect of great magnitude on live data.
This is in complete opposition to batch feeds. With real-time interfaces and applications becoming the flavor of interactive and enhanced user experience, data enters the database in real-time and often propagates quickly to the chain of interconnected databases. This triggers actions and responses that might be visible to the user almost immediately, leaving little room for validation and verification. This causes a huge hole in the data quality assurance where a wrong entry may cause havoc at the back end.
The company may be running processes that modify data residing within the system. This may lead to the involuntary introduction of errors in the data. The following processes are responsible for internal changes in enterprise data.
Data in an enterprise needs to be processed regularly for summarizations, calculations and clearing up. There may have been a well-tested and proven cycle of such data processing conducted in the past. The code of the collation programs, the processes themselves, as well as the actual data evolve with time; hence a repeated cycle of collation may not yield similar results. The processed data may be completely off-the-mark and if it forms the basis of further successive processing, the error may travel down in a multifold manner.
Every company needs to rectify its incorrect data periodically. Manual cleansing has been taken over by time and effort saving automations. Although this is very helpful, it has the potential risk of wrongly affecting thousands of records. The software used to automate may have bugs, or the data specifications which form the basis of cleansing algorithms may be incorrect. This can result in making absolutely valid data, invalid, and virtually reverse the very advantage of the cleaning exercise.
Old data constantly needs to be removed from the system, to save valuable storage space and reduce the maintenance efforts needed for retaining mammoth and obsolete volumes of information. Any purging results in destruction of data. Hence, a wrong or accidental deletion can affect data quality hazardously. Just as in the case of cleansing, bugs and incorrect data specifications in the purging software may unleash unwarranted destruction of valuable data. At times, valid data may incorrectly fit the purging criteria, and get erased.
Data inside the company’s databases is subjected to manipulations due to system upgrades, database redesign and other such exercises. This results in data quality issues, since the personnel involved may not be data experts and the data specifications may be unreliable. Such deterioration of data is termed as data decay. Some reasons this problem occurs.
Data represents real-world objects which may change on their own with time, and the data representation might not be catching up with this change. Thus, this data gets automatically aged and transformed into a meaningless form. Also in the case of interconnected systems, changes made to one branch are not migrated to the interfacing systems. This can lead to huge inconsistencies in data which may show up adversely at a later stage, often after the damage has occurred.
System upgrades are inevitable, and such exercises rely heavily on the data specifications for the expected data representation. In reality, the data is far from the documented behavior and the result is often chaotic. A system upgrade that is poorly tested can hence lead to irreversible data quality damages.
Businesses need to find more revenue generating uses of existing data and this may open up a new set of issues. Data meant for one purpose may practically not suit another objective and using it for new purposes may lead to incorrect interpretations and assumptions in the new area.
Data and data experts seal an impenetrable pact; the expert usually has an ‘eye’ for wrong data, is well-versed with the exceptions, knows how to extract relevant data usefully and discard the rest. This is due to long years of association with the ‘legacy’. When such experts retire, move on, or are dropped due to a new merger, the new data handling member may be unaware of these data anomalies which were earlier rectified by the experts. Hence, wrong data may travel unchecked into a process.
As more applications with higher automation levels share huge volumes of data, users get more exposure to erroneous internal data that was previously ignored. Companies stand to lose credibility in case of such exposure. Automation cannot replace the need to validate information; intentional and unintentional tweaking of data by the users may also lead to data decay, which may be out of the company’s control. IN conclusion, data quality can be lost due to processes that bring data into the system, those that tend to cleanup and modify the data and through a process of data ageing and decay where the data may itself not change in time.
New business models in today’s modern and rapidly changing scenarios are throwing up innovative and newer functions and processes every day. Each cycle of automation or a revamp of an existing process throws up its own set of unforeseen challenges that affect data quality. The key to ensuring data quality is a dedicated study of the data flow within each process and implementing a regular audit and monitoring mechanism to detect data decay.
A blend of automation and manual validation and cleansing by trained data processing operators is the need of the hour. Data quality challenges should be the key focus for any enterprise if it wants to ensure its smooth running and subsequent growth.