BERD tips: Dimensions of data quality

8/25/2021

Written by

By Aman Kaur, Research Biostatistician

Checking the quality of your data is one of the most crucial steps before beginning any analysis. Poor quality of data can lead to inaccurate analysis, erroneous conclusions, and can be overall expensive by wasting human effort and time. Hence, it is imperative that the data quality be checked, re-checked, and maintained during the data collection and data management process. Here are six dimensions of data quality that you can examine before analyzing any data.

6 dimensions of data quality figure

Completeness

Completeness is the proportion of stored data against the potential of "100% complete." Completeness indicates whether there is enough information to draw a conclusion about the data and whether enough individuals responded to it to ensure representativeness. Some questions you might ask yourself include:

  • Does the reported data contain enough information to represent performance measure activities?
  • Did the reported data come from all stations and/or a random sampling of volunteers/service recipients?

Consider this example:

Parents of new students at school are requested to complete a data collection sheet which includes medical conditions and emergency contact details as well as confirming the name, address, and date of birth of the student. At the end of the first week of the fall term, data analysis was performed on the ‘First Emergency Contact Telephone Number’ data item in the contact table. There are 300 students in the school and 294 out of a potential 300 records were populated, therefore 294/300 x 100 = 98% completeness has been achieved for this data item in the contact table. Use non-missing 'First Emergency Contact Telephone Number' count / all current students count in the contact table.

Timeliness

Timeliness is the degree to which data represents reality from the required point in time.

Consider this example:

John Smith provides details of an updated emergency contact number on 1st June 2021 which is then entered into the student database by the admin team on 4th June 2021. This indicates a delay of 3 days.  This delay breaches the timeliness constraint as the service level agreement for changes is 2 days. Date emergency contact number entered in the student database (4th June 2021) minus the date provided (1st June 2021).

Consistency

Consistency is the absence of difference when comparing two or more representations of a thing against a definition. Consistency considers the extent to which data is collected using the same process and procedures by everyone doing the data collection and in all locations over time.

Consider this example:

In school administration, a student’s date of birth has the same value and format in the school register as that stored within the student database. Select count distinct on ‘Date of Birth.’

Uniqueness

Uniqueness means that nothing will be recorded more than once based upon how that thing is identified.

Consider this example:

A school has 120 current students and 380 former students (i.e., 500 in total), however, the student database shows 520 different student records. This could include Fred Smith and Freddy Smith as separate records, despite there only being one student at the school named Fred Smith. This indicates a uniqueness of 500/520 x 100 = 96.2%.

Validity

Data are valid if it conforms to the syntax (format, type, range) of its definition. Validity indicates whether the data collected and reported by grantees appears to measure the approved performance measure or program goal. To determine validity, you may ask:

  • Is the data relevant?
  • Are your reported items consistent with the approved goals of the current grant and/or program?
  • Are you measuring what you intended to measure?

Consider this example:

Each class in a school is allocated a class identifier; this consists of the 3 initials of the teacher plus a two-digit year group number of the class. It is declared as AAA99 (3 Alpha characters and two numeric characters). A new Class 9 teacher, Sally Hearn (without a middle name) is appointed therefore there are only two initials. A decision must be made as to how to represent two initials, or the rule will fail, and the database will reject the class identifier of “SH09.” It is decided that an additional character “Z” will be added to pad the letters to 3: “SZH09,” however, this could break the accuracy rule. A better solution would be to amend the database to accept 2 or 3 initials and 1 or 2 numbers. Evaluate that the class identifier is 2 or 3 letters a-z followed by 1 or 2 numbers 7 – 11.

Accuracy

Accuracy is the degree to which data correctly describes the "real world" object or event being described. Accuracy indicates whether the data is free from significant errors and whether the numbers make sense. To determine accuracy, you may ask:

  • Does the data vary significantly in unexpected ways?

Consider this example:

A European school is receiving applications for its annual September intake and requires students to be aged 5 before the 31st of August of the intake year. In this scenario, the parent, a US Citizen applying to a European school, completes the Date of Birth (D.O.B) on the application form in the US date format, MM/DD/YYYY rather than the European DD/MM/YYYY format, causing the representation of days and months to be reversed. As a result, 09/08/2005 really meant 08/09/2005 causing the student to be accepted as the age of 5 on the 31st of August in 2005. The representation of the student’s D.O.B. – though valid in its US context – means that in Europe the age was not derived correctly, and the value recorded was consequently not accurate.

Reference: https://silo.tips/download/the-six-primary-dimensions-for-data-quality-assessment

The Biostatistics, Epidemiology, and Research Design (BERD) team helps researchers design studies and enhance data collection, management, and analysis for health-related research projects. Biostatistical support is available throughout the project lifespan with a cost structure that is subsidized for Illinois investigators. Visit our website or contact  berd-ihsi@illinois.edu to learn more.