Clusterization and missing values

Research Aim and Procedure of This Part

This section aims to explain the patterns of missing values in our dataset. Please note that the data we have is not 100% administrative data, as it was collected and structured by researchers from the University of Michigan, which may have resulted in some flaws. Despite being unable to eliminate this influence and measure its impact on the data, we are interested in discussing the challenges of working with this type of data (as no data is free from the influence of its creator) and developing methodological approaches that can be applied to similar datasets.

The missing values in our dataset can be classified into two types:

Natural NAs: These occur when a case is closed at some stage of legal proceedings without being sent to court, etc.
Systematic NAs: These are produced by actors in the legislation system due to issues such as bad recording or the "human factor."

Our focus is on the second type of missing values in order to describe the processing of criminal cases in certain states of America. All natural NAs have been marked with a special category distinct from NA.

Hypothesis for Clusterization

Regarding regression, our main hypothesis is that the missing data exhibit regional specificity. We assume that particular counties in specific states may differ in their documentation of decisions on criminal cases due to various organisational and bureaucratic causes. The statistical hypothesis is that there is no significant difference in data registration.

In addition to other available dataset factors, we also take into account socio-demographic features of the offenders, such as age, sex, and race. We believe that the recording of some cases may be influenced by these factors (strong statement about why socio-demographic factors might influence).

Considering that the number of counties in the states included in our study is quite high (median = 65, sd = 36) to be included as predictors in our models, we have decided to group them using a clustering method. This helps reduce the number of predictors in our models.

The decision to perform clusterization within each state separately was made to ensure better interpretability of the results.

Clusterization

The selection of predictors as grouping parameters was driven by the requirements of the clustering method. First, we excluded outliers for numeric predictors. Second, we considered variables with minimal missing data. Third, for categorical predictors, we included variables with a moderate number of factor levels.

Based on these criteria, the final set of predictors used as grouping parameters was as follows: the number of arrest charges (numeric), prosecution/grand jury disposition type (categorical), court disposition type (categorical), type of court at final disposition (categorical), and sentencing type (categorical).

As a result, clusterization will group counties within each state based on the overall number of arrests and the types of dispositions made at different stages of the process. (Add an explanation.)

Two types of clusterization methods were used:

K-means clustering method (PAM algorithm, partition around medoids) implemented using the "cluster" package in R, specifically the "pam" function.
Hierarchical clusterization using the "cluster" package in R, specifically the "hclust" function.

Research Aim and Procedure of This Part

Hypothesis for Clusterization

Clusterization

Triangulation Results