This section aims to explain the patterns of missing values in our dataset. Please note that the data we have is not 100% administrative data, as it was collected and structured by researchers from the University of Michigan, which may have resulted in some flaws. Despite being unable to eliminate this influence and measure its impact on the data, we are interested in discussing the challenges of working with this type of data (as no data is free from the influence of its creator) and developing methodological approaches that can be applied to similar datasets.
The missing values in our dataset can be classified into two types:
Our focus is on the second type of missing values in order to describe the processing of criminal cases in certain states of America. All natural NAs have been marked with a special category distinct from NA.
Regarding regression, our main hypothesis is that the missing data exhibit regional specificity. We assume that particular counties in specific states may differ in their documentation of decisions on criminal cases due to various organisational and bureaucratic causes. The statistical hypothesis is that there is no significant difference in data registration.
In addition to other available dataset factors, we also take into account socio-demographic features of the offenders, such as age, sex, and race. We believe that the recording of some cases may be influenced by these factors (strong statement about why socio-demographic factors might influence).
Considering that the number of counties in the states included in our study is quite high (median = 65, sd = 36) to be included as predictors in our models, we have decided to group them using a clustering method. This helps reduce the number of predictors in our models.
The decision to perform clusterization within each state separately was made to ensure better interpretability of the results.
The selection of predictors as grouping parameters was driven by the requirements of the clustering method. First, we excluded outliers for numeric predictors. Second, we considered variables with minimal missing data. Third, for categorical predictors, we included variables with a moderate number of factor levels.
Based on these criteria, the final set of predictors used as grouping parameters was as follows: the number of arrest charges (numeric), prosecution/grand jury disposition type (categorical), court disposition type (categorical), type of court at final disposition (categorical), and sentencing type (categorical).
As a result, clusterization will group counties within each state based on the overall number of arrests and the types of dispositions made at different stages of the process. (Add an explanation.)
Two types of clusterization methods were used: