The journey to being data-driven
While it is easy to talk about transitioning to a data-driven audit selection process, it can be a difficult transformation. It is beyond a technology change. It requires process changes as well as organization changes. We will address some of those challenges in future posts. In this post, we will focus on the technological aspects of this transformation.
One of the first aspects is data governance. While it is not an exciting topic, it is the foundation of a data-driven process. Analytics are most effective when performed on trusted data. We like to jokingly refer to this process as “land it, scrub it, and trust it.” Prior to turning your data scientist loose on the data, it is vital to get your data sources identified, organized, defined, profiled, and controlled. Here we briefly explain each aspect:
- Identify – Name specific business domains of critical data and where it is stored.
- Organize – Isolate the location of every needed data source and how it will integrate with other data sources.
- Define – Prepare a business description of every attribute to ensure your team understands what it is and how it will relate to their analytic objectives.
- Profile – Document the specific values, or ranges, of values to ensure they are aligned with your definitions.
- Control – Based on definitions and profiles, create quality rules that will be assessed with every refresh of the data. This will ensure it remains aligned with your expectations and needs.
Data governance is not a one-time event or a barrier for initiating your transformation. You can start with a minimal number of data sources and qualify them incrementally. Starting with something is more valuable than starting nothing while you wait for perfection.
We frequently receive questions on the amount of historical data necessary. While results can be produced with limited amounts of historical data, accuracy improves as more data is made available. There is not a firm definition of the amount necessary, however a few years of data will establish an analytic baseline. Ideally, this means having a few years of aligned history. “Aligned history” means the data sources must match in the same time period. For example, attempting to correlate 2008 audit outcomes with 2014 tax forms is not a useful combination.
The significance of external data cannot be over emphasized. If your team only uses data from your tax forms and IRS data sets, you are only getting a small portion of the complete picture. To have robust models, you should incorporate data from multiple state sources (e.g., business and vendor licenses), vehicle registrations, and external data sources (e.g., third party published business listings). In addition to providing discovery leads, this additional data will provide a more complete view of your taxpayers.
The Art of Data Science
Data science is the application of statistical analysis to large volumes of data. While driven by science, there is also an “art” to the techniques and methods. From a technical perspective, your team and their tools must be prepared to incorporate these modeling methods:
- Discriminant, generalized linear, logistic and nonparametric regression models
- Decision trees
- Neural networks
- K-means clustering
With the wide variety of science behind just those few model types, you can start to see the need for the “art,” or the experiential side of data science. The experiential aspect to selecting the correct model type for your data is highly dependent on what data is available and may change with refreshed data sets. The best model for your situation may be a blend of different, underlying models. This is very similar to a weather forecast model consisting of a mix of over 100 individual models rather than just one “golden” model. While finding the optimal model requires extensive mathematics and statistics, it also requires a significant amount of “art” to be successful.
One of the final challenges with applying data science is incorporating automation. Due to the extensive libraries of modeling methods, a state-of-the-art solution requires automation to support the data scientist. This will enable them to adapt to various data sets and quickly deliver results. Additionally, as data sets and fraudulent practices shift, revision of models requires automation. The volume and complexity of the methods dictates automation to maintain the pace of change.
XDS doesn’t just talk about the challenges and opportunities. We exist to help you rapidly realize the future. Turning the corner toward a data-driven and productive approach can be a significant initiative. To discover some key steps to gaining executive support for your transformation, return for our next post in the series: Build Executive Support for Progress.