bca_survival.models module

Survival Analysis Utilities Module.

This module provides functions for performing survival analysis, including Cox proportional hazards regression models and Kaplan-Meier survival curves. It includes utilities for data preprocessing, multicollinearity checking, and visualization of results.

Requires: pandas, numpy, scikit-learn, lifelines, matplotlib, statsmodels, seaborn

bca_survival.models.standardize_columns(df, columns, nan_threshold=0.7)[source]

Standardizes only numeric columns and handles missing values.

Parameters:
  • df (pd.DataFrame) – The input dataframe.

  • columns (list) – List of column names to consider for standardization.

  • nan_threshold (float, optional) – Threshold for NaN values. Columns with more NaNs than this threshold will be dropped. Defaults to 0.7.

Returns:

DataFrame with standardized numeric columns.

Return type:

pd.DataFrame

Note

This function creates a copy of the dataframe and standardizes only the numeric columns using StandardScaler. Categorical columns are left unchanged.

bca_survival.models.check_multicollinearity(df, columns)[source]

Checks multicollinearity between variables using a correlation matrix.

Parameters:
  • df (pd.DataFrame) – The input dataframe.

  • columns (list) – List of column names to check for multicollinearity.

Returns:

Correlation matrix of the specified columns.

Return type:

pd.DataFrame

Note

This function also displays a heatmap of the correlation matrix.

bca_survival.models.perform_multivariate_cox_regression(df, columns, penalizer=0.1, standardize=True, vif_threshold=20)[source]

Performs multivariate Cox proportional hazards regression.

Parameters:
  • df (pd.DataFrame) – The input dataframe. Must contain ‘days’ and ‘event’ columns.

  • columns (list) – List of predictor column names.

  • penalizer (float, optional) – L2 penalizer value to apply to the regression. Defaults to 0.1.

  • standardize (bool, optional) – Whether to standardize the columns. Defaults to True.

  • vif_threshold (float, optional) – Threshold for Variance Inflation Factor (VIF). Variables with VIF above this threshold will be removed. Defaults to 20.

Returns:

Fitted Cox proportional hazards model.

Return type:

lifelines.CoxPHFitter

Note

This function handles multicollinearity by iteratively removing variables with high VIF values until all variables have VIF below the threshold.

bca_survival.models.perform_univariate_cox_regression(df, columns, standardize=False, penalizer=0, verbose=False, correction_values=None, nan_threshold=0.7, significant_only=True)[source]

Performs univariate Cox proportional hazards regression for each variable.

Parameters:
  • df (pd.DataFrame) – The input dataframe. Must contain ‘days’ and ‘event’ columns.

  • columns (list) – List of predictor column names to test individually.

  • standardize (bool, optional) – Whether to standardize the columns. Defaults to False.

  • penalizer (float, optional) – L2 penalizer value to apply to the regression. Defaults to 0.

  • verbose (bool, optional) – Whether to print detailed progress information. Defaults to False.

  • correction_values (list, optional) – List of column names to include as correction terms in each univariate model. Often you’ll use this to correct for age or gender effects. Defaults to None.

  • nan_threshold (float, optional) – Threshold for NaN values if standardizing. Defaults to 0.7.

  • significant_only (bool, optional) – Whether to only include significant values. Defaults to True.

Returns:

DataFrame containing significant variables and their statistics.

Return type:

pd.DataFrame

Note

This function tests each variable individually in a Cox regression model, and returns only statistically significant variables (p < 0.05).

bca_survival.models.generate_kaplan_meier_plot(df, column, split_strategy='median', fixed_value=None, percentage=None, output_path=None, dpi=600, custom_title=None, display_plot=False, custom_high_low_names=('low', 'high'))[source]

Generates a Kaplan-Meier survival plot for a specified variable.

Parameters:
  • df (pd.DataFrame) – The input dataframe. Must contain ‘days’ and ‘event’ columns.

  • column (str) – Column name to use for grouping.

  • split_strategy (str, optional) – Strategy for splitting data into high/low groups. Options: ‘mean’, ‘median’, ‘percentage’, ‘fixed’, ‘quantile’. Defaults to ‘median’.

  • fixed_value (float, optional) – Fixed threshold value when split_strategy is ‘fixed’. You can use this when you have found cutoff values from literature. Defaults to None.

  • percentage (float, optional) – Percentile threshold when split_strategy is ‘percentage’. Defaults to None.

  • output_path (str, optional) – Directory path to save the plot. If None, saves in current directory. Defaults to None.

  • dpi (int, optional) – Resolution of the output image in dots per inch. Higher values result in better quality but larger file sizes. Defaults to 600.

  • custom_title (str, optional) – Custom title for the plot. If None, a default title will be generated based on the column and split strategy. Defaults to None.

  • display_plot (bool, optional) – Whether to display the plot in the notebook. If False, the plot is only saved to file without rendering. Defaults to False.

  • custom_high_low_names (Tuple[str, str], optional) – Custom high and low variable names. Defaults to (“low”, “high”).

Returns:

Dictionary containing the log-rank test p-value, plot filename, and test statistic.

Return type:

dict

Raises:

ValueError – If an invalid split_strategy is provided or if required parameters for a particular strategy are missing.

Note

This function splits the data into “high” and “low” groups based on the specified variable and strategy, then generates a Kaplan-Meier survival plot comparing the two groups. It also performs a log-rank test to compare the survival curves.

bca_survival.models.calculate_vif(df, columns)[source]

Calculates the Variance Inflation Factor (VIF) for each variable.

Parameters:
  • df (pd.DataFrame) – The input dataframe.

  • columns (list) – List of column names to calculate VIF for.

Returns:

DataFrame containing variables and their corresponding VIF values.

Return type:

pd.DataFrame

Note

VIF is a measure of multicollinearity. Higher values indicate stronger correlation with other variables. VIF > 10 is often considered problematic.