2.3 Analytics Process
People Analytics is all about exploring, inferring, and communicating data patterns. The objective is to support decisions related to people in the organization. Therefore, it is essential to understand how analytics serves as a phase in the decision-making process. To do so, excuse my nostalgia; I browse my graduation project written in 1995 to quote the DECIDE acronym.
The DECIDE model is a framework developed by marketing researchers, which I found in an early edition of Malhotra’s book: “Marketing Research – An Applied Orientation” (Prentice Hall). Although I originally learned it in a different context of business research, it is undoubtedly applicable to the domain of people.
The acronym DECIDE may stand for: Define the problem, Explore the factors, Choose research design, Investigate and analyze, Draw conclusions and recommendations, and Evaluate action effectiveness. Research is a loop process because the evaluation phase at its end leads to new problem definitions, therefore, a new or repeated study.
The acronym DECIDE is essential because it emphasizes that every study in the business realm must start with a question in mind and end with actionable insights. The domain of people is not an exception. Defining the research objectives and drawing conclusions and recommendations are essentially the role of the research sponsor in the HR department. In a sense, it is the casing of the data scientist’s role in the analysis phase.
Therefore, when communicating with a data scientist to present the research objectives before the analysis phase or to extract study results for conclusions and recommendations, it is helpful to understand the analytics process from the point of view of the data scientist. This process includes five action steps: Importing and Cleaning; Manipulating and Transforming; Exploring and Visualizing; Analyzing and Modeling; Communicating and Reporting. So, let’s delve into these steps.
2.3.1 Importing and Cleaning
Data import and cleaning are essential in analytics, laying the foundation for all subsequent phases. Ensuring that the data is accurate and ready for analysis avoids errors later in the process.
This step involves bringing the data into the project environment and preparing it for analysis. It includes tasks such as: importing the data from its source, checking the data quality by identifying and filling in missing data, spotting invalid records and correcting errors, and removing duplicates.
This step also includes structuring the data to make it appropriate for work and organizing it in a standard way, called tidy data. This standard format has three key characteristics: each column represents a variable, each row represents a unique observation, and each value is the intersection of variable and observation. Tidy data allows data scientists to focus on the core analysis rather than getting bogged down in data preparation.
Here are some of the most useful libraries available for importing and cleaning data for People Analysts who take their first steps in R: I suggest covering the documentation and cheatsheets of these libraries.
readr
functions for reading and writing flat files, such as CSVtidyr
functions for cleaning and reshaping data in a tidy formatdplyr
functions for manipulating and summarizing data
2.3.2 Manipulating and Transforming
Data manipulation and transformation are vital in every data science project because they prepare the data for analysis according to the research objectives and enable the revealing of patterns and relationships in the data later in the project.
This step involves modifying and rearranging the data to make it more suitable for analysis and modeling. It includes tasks such as grouping and summarizing records, selecting and filtering subsets based on specific criteria, and transforming the data to scale or normalize it.
One fundamental transformation is related to the data format, i.e., its structure: “long” or “wide.” For example, suppose that each record in a dataset represents a group whose identity is recorded in the first column. In a wide format, the first column contains unique values. But in a long format, the first column contains values that may repeat, so the information of a specific group may be recorded in several rows.
Both “long” and “wide” formats include the same information but in a different structure. Most datasets you’ll encounter in People Analytics are probably in a wide format. However, for certain functions, e.g., when visualizing data, you’ll have to convert the data to a long format.
An essential part of transforming the data is manipulating variables according to their particular classes, e.g., dates, text, or categorical variables. Therefore, in addition to the R libraries mentioned earlier, it is worth exploring the following libraries in the context of data transformations. Again, I recommend covering the documentation and cheatsheets of these libraries while remembering there are many other libraries to explore.
stringr
functions for working with strings (character variables), e.g., searching for patterns, extracting substrings, and manipulations.lubridate
functions for working with dates and times, e.g., parsing from strings, extracting components, and calculations.forcats
functions for working with factors (categorical variables), e.g., reordering factor levels, collapsing levels, and more.
These libraries simplify data scientists’ workflow when transforming non-numeric variables, which are very common in People Analytics.
2.3.3 Exploring and Visualizing
Exploring and visualizing the data first, before modeling or continuing with advanced analytics, is an essential step in any project. It enables the data scientist to gain fundamental insights about patterns, factors, relationships, or trends that may be essential for the project’s goals. Such understanding will be later helpful for informed decisions about further analysis, modeling methods, and potential challenges.
There are typically two primary techniques to explore the data: plotting it and examining its summary statistics. The outputs of these techniques enable us to understand the distributions in the data, including centers, dispersions, skewness, and outliers. Understanding distributions is crucial at this stage because the distributions of variables significantly impact any analysis, particularly in choosing the appropriate modeling techniques.
In addition, this stage is also an essential step in communication at an early stage of the project. The research sponsors usually expect to receive some findings early in the project. By sharing some informative visualizations and information about trends, the data scientist can effectively lower the pressure from the research stakeholders, keep them involved and sometimes get more directions based on the implications of the data exploration for further analysis.
In the case of continuous variables, the fundamental charts commonly used in a project’s exploration phase are histograms, boxplots, and scatterplots. They are suitable for exploration because they enable one to grasp the distributions well. Bar charts are also typical for exploring categorical variables. I included these charts in the following use cases presented in the book, but as you will see, there are many other types of charts to choose from, depending on the nature of the data.
There are many helpful R libraries for data visualization. But the most fundamental is ggplot2
. This famous library provides a wide range of chart types and features, and it is particularly well-suited for creating complex plots with multiple aesthetic options. Many other libraries for visualization are based on it to provide additional customizations.
The underlying philosophy of ggplot2
is The Grammar of Graphics, a conceptual framework that became a standard approach to data visualization. The idea is that statistical graphics can be spoken as a language, with specific grammar rules for combining different plot elements to create a compelling and informative visualization.
The four components of statistical graphics are data, aesthetic mappings (axis and attributes), geoms (data representations), and additional statistical transformations (e.g., binning, smoothing). The Grammar of Graphics that ggplot2 uses is described in detail in the book ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham et al.
2.3.4 Analyzing and Modeling
Analyzing and modeling refer to selecting, using, and validating mathematical or statistical tools and algorithms to infer or predict based on the data.
While analyzing and modeling data may be perceived as the essential phase in any analytics process, it is crucial to understand two conditions for its success: First, the data scientist needs to have a question in mind, and second, success depends on the former phases of the process.
At this point, I’d like to remind you that your proactivity as an HR professional in the conversation and collaboration with the data scientist is invaluable to the project’s success because it enables both raising the relevant questions and access to appropriate and maintained datasets.
Having a straightforward question in mind when analyzing and modeling data is vital for two main reasons: First, it helps to focus the analysis and modeling efforts and avoid getting lost or wasting time on irrelevant analyses or models. A well-defined question guides the selection of appropriate methods for the research objective, for example, comparing groups, exploring trends, or predicting a specific outcome.
Secondly, a question in mind allows you to frame the findings and results in the business context, which make it easier to communicate the insights and recommendations to research sponsors and decision-makers. This business context is also relevant to discuss the research limitations, constraints, and potential biases.
When we are ready with a question in mind and enthusiast to start analyzing and modeling, we should have already imported and cleaned the data, transformed and manipulated it to make it ready to work with, and gained a pretty good orientation after exploring and visualizing it. So now, we should pick a few mathematical and statistical tools and algorithms out of enormous options. These tools will enable us to infer insights and make predictions, but only if we pick the right ones and use them on suitable datasets we prepared in advance.
An important distinction to make here is between inferential statistics and predictive analytics. Both approaches are frequently used, but they have a critical difference: Inferential statistics enables one to test hypotheses and infer about a population based on a sample. In contrast, predictive analytics allows modeling data patterns to make predictions about new and unseen data.
Each approach uses different assumptions about the data types and distributions and has different ways to evaluate findings and results. In the use cases in the following sections, I’ll relate to these assumptions and evaluations based on the tool used. Both inferential statistics and predictive analytics have so many R libraries that this book will be too short of mentioning them all. Frankly, I know only a handsome out of the thousands of packages. Still, beyond Tidyverse libraries discussed earlier that are relevant to all use cases, I’ll mention suggested libraries that may be a good start in each one.
However, note that R programming language evolves constantly, and so do we. Therefore, your library selection may change over time. Nevertheless, your selected libraries will always depend on the research objectives and the relevant analysis or modeling tools.
2.3.5 Communicating and Reporting
The final phase of the analytics process is communicating the results. However, this is the final phase only from the point of view of a data scientist. The reporting tools and contents will serve the research sponsor for conclusions and recommendations that will be a basis for discussions and decision-making.
The primary goal of this phase is to clearly and effectively communicate the main findings and insights and to provide guidance for using these findings and insights for informed business decisions. Therefore, the data scientist’s deliverables should be designed as a friendly tool that is easily operated and understood by non-technical audiences.
A friendly tool is insufficient, even if it is most visually appealing. Therefore, any communication and reporting format must include A clear explanation of the following: the analysis goal, the methods used to collect and analyze the data, the results and their accuracy, a summary that highlights the main findings and conclusions, and references for external sources.
There are many ways to communicate and report findings at the end of the analytics process. Although R scripts are included in many data visualization tools, such as interactive dashboards, I’ll mention two other formats which are congruent to this book:
R Markdown: A powerful tool that combines text, code, and output (including visualization) in one document and is used to create reports, presentations, and even websites. The book R Markdown Cookbook by Yihui Xie et al. is a comprehensive guide for this tool.
Shiny: A web application framework for R that enables the creation of interactive web apps to share results.
When you cover the use cases section of the book, you’ll notice that the complete steps in each use case were written in R code. Of course, it is not always the case since data scientists may use other programming languages or data analysis platforms. However, beyond the advantages of R that I shared in the introduction, one that deserves special attention is its inherited reproducibility, a term I will describe next.