Data Validation & Data Cleaning 101
Data Validation & Data Cleaning 101
Introduction to Data Validation & Data Cleaning
Data Quality refers to the usability of data. The usability in this case refers to how fit for use the Data is for decision making.
In our case, Data Quality can usually be measured by looking at the Data for:
Importance of Data Quality
Having high quality Data is beneficial because it leads to better analysis and as a result more accurate insights can be generated from the same data. These insights are usually more reliable when it comes to being used for decision making.
Data Validation Techniques
To begin the Data Validation process, the Data must be profiled. Data profiling involves identifying the quality, completeness and characteristics of the dataset. Anomalies are detected, and Data Quality is understood at this stage.
This stage is important because it helps ensure that the Data the user is about to work with is reliable.
If any inconsistencies or errors are detected, they can be dealt with via a process of Data Cleaning.
Validating Data Against Business Rules and Constraints
Before working with any Data, the user should validate whether the data meets their internal Data requirements and expectations. The internal constraints will decide how the data is formatted and structured.
Based on how the available data lines up against their internal Rules and Constraints, Data Cleaning can be incorporated into the workflow to Standardize and Normalize the Data.
Cross-Field and Cross-Dataset Validation
An additional step needed during this Data Validation process is one of cross-field data validation, to detect inconsistencies between fields that are interrelated. The data does not necessarily need to be validated against information in the same dataset, and can be validated against what is available in fields present in a different Table as well.
Identifying Errors and Missing Data
To identify Errors and Missing Data, Toric offers a few useful nodes and features. With the help of the Count Unique node, users can identify the unique elements, and their corresponding counts, thus attaining a value distribution. Nodes such as the Summarize provide a brief insight into the range of values in a column.
Toric visualizations available also provide an ability to visually identify outliers, anomalies, and errors within the data, which helps in this process of Data Validation.
Data Cleaning Techniques
In order to increase Data Quality, it is useful to begin a Data Workflow with a Data Cleaning and Validation Step.
The first step of this process is to identify the problems with the Data and determine how far it is from being usable, i.e. from the stage of Data Quality that is desirable and usable within workflows. This would be the Validation Process.
Once the Data has been Validated, and the user determines its Quality, it can be integrated into workflows via a cleaning process.
Handling Missing Data
It may not be beneficial to have empty cells within your dataset. Empty and missing values can cause misleading insights by skewing the analysis.
Once identified, missing cells can be filled in, or filtered out, based on how the data must be used. Nodes such as the Filter allow data to be filtered out, and can be used to remove rows with empty cells in certain columns. Alternatively, the Find & Replace can also be used to fill in those empty cells with pre-defined values.
Dealing With Outliers and Anomalies
Identifying and dealing with Outliers and Anomalies is an important aspect of increasing Data Quality. Outliers can lead to skewed and misrepresented analysis. Outliers do not have a strict definition, and usually refer to data points that are unusual, compared to the rest of the Data.
The first step in dealing with Outliers is to begin identifying them. Based on the type of data, numerical, or text, the kind of outliers can be different. Some basic statistical methods such as z-scores, interquartile ranges can be used to identify numerical outliers.
Once identified, outliers can be resolved by either replacing them, or dropping them. However, it is also beneficial in some cases to keep outliers.
Standardizing and Normalizing Data
A key component of ensuring Data Quality is ensuring that data within the same fields/columns is formatted in a similar manner. The data must be consistent across to ensure no discrepancies arise when it comes to analysis.
An example of this would be:
- Date formats must be identical across the column
- Units used must be consistent
- Whitespace must be consistent
Toric nodes such as Clean Up Columns helps resolve such issues.
Handling Duplicates and Redundant Data
Duplicates are costly. They bloat the size of your data, increasing storage costs, while also impacting analysis. Duplicates can be dealt in Toric easily using the Remove Duplicates Node. The Node also provides a breakdown of the Duplicates removed, providing a certain degree of oversight into their removal.
Imputing and Tagging Data
Data Imputing and Tagging are important aspects of this pre-processing of data that we have been discussing. In this, the missing data is enriched as necessary, using pre-determined values, or mathematical calculations.
Toric nodes such as the Find & Replace, and Data Tagging help complete this process to make your data more ready for a workflow.
Toric Data Cleaning Nodes
Toric offers a wide array of Nodes designed to help clean your Data and increase its Quality. Some of the commonly used Nodes that help identify discrepancies in the Data, and deal with it include:
Building Dashboards & Reports
Data Visualization Principles
Data Visualization at a high level is the visual representation of data, using tools such as plots, graphs, and diagrams, among others. It is key to gaining insights from data.
Designing Interactive Dashboards for Data Exploration
Visualizations tie in well with the Data Validation process that we discussed earlier. In exploratory process of understanding the quality of the dataset, it is useful to chart different aspects. Visualizations at this stage help identify outliers, determine unique elements and characterize your data before you begin working with it.
Types of Visualizations in Toric
Toric visualizations can be separated into three categories:
- Tables - A representation of the data available at any point as a set of rows and columns
- Charts - Pictorial representation of the data, through one of, or a combination of bars, lines, percentages of area.
- Textual - Data can also be represented as individual numbers, text, or a combination of the two.
- Bar Chart
- Pie Chart
- Funnel Chart
- Line Chart
Visualizations in Toric can be set up and adjusted via the configurations panel. From the selection of columns as the different dimensions, and measures, to the tuning of legends and color schemes.
Toric Visualizations have the unique feature of interactivity, which ties different charts, table, and text together.
Interactivity, when turned on via the Configurations panel allows the user to click on charts and tables, and see corresponding selections being made, automatically, in any linked visual.
Toric Visualizations can be shared with users, within and outside your organization, while maintaining the same set of features that the user would have when setting it up.