bca_survival.analyzer module
BCA Survival Analyzer Module.
This module provides a class for performing survival analysis on body composition assessment (BCA) data. It combines clinical/demographic data with body measurement data and provides methods for univariate and multivariate Cox regression analysis, as well as Kaplan-Meier survival curves.
Requires: pandas, numpy, and custom preprocessing and models modules
- class bca_survival.analyzer.BCASurvivalAnalyzer(df_main, df_measurements, main_id_col, measurement_id_col, start_date_col, event_date_col, event_col, standardize=False)[source]
Bases:
objectA class for analyzing the relationship between body composition measurements and survival outcomes.
This class combines clinical/demographic data with body composition measurements, preprocesses the data for survival analysis, and provides methods for performing Cox regression and Kaplan-Meier survival analysis.
- df
The merged and preprocessed dataframe.
- Type:
pd.DataFrame
- df_negative_days
Dataframe containing records with negative or NaN ‘days’ values for evaluation.
- Type:
pd.DataFrame
- start_date_col
Name of the column containing start dates.
- Type:
str
- event_date_col
Name of the column containing event dates.
- Type:
str
- event_col
Name of the column containing event indicators.
- Type:
str
- standardize
Whether to standardize variables for analysis.
- Type:
bool
Initializes the BCASurvivalAnalyzer with clinical and measurement data.
- Parameters:
df_main (pd.DataFrame) – Dataframe containing clinical/demographic data.
df_measurements (pd.DataFrame) – Dataframe containing body composition measurements.
main_id_col (str) – Name of the PID column in df_main.
measurement_id_col (str) – Name of the PID column in df_measurements.
start_date_col (str) – Name of the column containing start dates.
event_date_col (str) – Name of the column containing event dates.
event_col (str) – Name of the column containing event indicators.
standardize (bool, optional) – Whether to standardize variables for analysis. Defaults to False.
Note
The function renames ID columns to ‘PID’ for consistency, replaces ‘nd’ with NaN, and handles infinite values. It also checks for and warns about missing measurements.
- preprocess_data()[source]
Preprocesses the data for survival analysis.
This method calculates time-to-event days and removes records with negative days.
- Returns:
The preprocessed dataframe.
- Return type:
pd.DataFrame
Note
This method is called automatically during initialization but can be called again if the underlying data changes.
- univariate_cox_regression(columns, verbose=False, penalizer=0.0, correction_values=None, nan_threshold=0.7, significant_only=True)[source]
Performs univariate Cox proportional hazards regression for each specified variable.
- Parameters:
columns (list) – List of predictor column names to test individually.
verbose (bool, optional) – Whether to print detailed progress information. Defaults to False.
penalizer (float, optional) – L2 penalizer value to apply to the regression. Defaults to 0.0.
correction_values (list, optional) – List of column names to include as correction terms in each univariate model. Defaults to None.
nan_threshold (float, optional) – Threshold for NaN values if standardizing. Defaults to 0.7.
significant_only (bool, optional) – Whether to only include significant observations. Defaults to True.
- Returns:
DataFrame containing significant variables and their statistics.
- Return type:
pd.DataFrame
Note
This method tests each variable individually in a Cox regression model, and returns only statistically significant variables (p < 0.05). If correction_values is provided, those variables will be included in each model to adjust for their effects.
- kaplan_meier_plot(column, split_strategy='median', fixed_value=None, output_path=None, percentage=None, custom_title=None, dpi=400, custom_high_low_names=('low', 'high'))[source]
Generates a Kaplan-Meier survival plot for a specified variable.
- Parameters:
column (str) – Column name to use for grouping.
split_strategy (str, optional) – Strategy for splitting data into high/low groups. Options: ‘mean’, ‘median’, ‘percentage’, ‘fixed’. Defaults to ‘median’.
fixed_value (float, optional) – Fixed threshold value when split_strategy is ‘fixed’. Defaults to None.
output_path (str, optional) – Directory path to save the plot. If None, saves in current directory. Defaults to None.
percentage (float, optional) – Percentile threshold when split_strategy is ‘percentage’. Defaults to None.
custom_title (str, optional) – Custom title for the plot. If None, a default title will
None. (be generated based on the column and split strategy. Defaults to)
dpi (int, optional) – Resolution of the output image in dots per inch. Higher values
400. (result in better quality but larger file sizes. Defaults to)
custom_high_low_names (Tuple[str, str], optional) – Custom high and low variable names. Defaults to (“low”, “high”).
- Returns:
Dictionary containing the log-rank test p-value, plot filename, and test statistic.
- Return type:
dict
Note
This method splits the data into “high” and “low” groups based on the specified variable and strategy, then generates a Kaplan-Meier survival plot comparing the two groups. It also performs a log-rank test to compare the survival curves.
- multivariate_cox_regression(columns, penalizer=0.1)[source]
Performs multivariate Cox proportional hazards regression.
- Parameters:
columns (list) – List of predictor column names.
penalizer (float, optional) – L2 penalizer value to apply to the regression. Defaults to 0.1.
- Returns:
Fitted Cox proportional hazards model.
- Return type:
lifelines.CoxPHFitter
Note
This method fits a Cox regression model with all specified variables simultaneously. It handles multicollinearity by iteratively removing variables with high VIF values. The standardize parameter from the class initialization determines whether variables are standardized before analysis.