bca_survival.models module
Survival Analysis Utilities Module.
This module provides functions for performing survival analysis, including Cox proportional hazards regression models and Kaplan-Meier survival curves. It includes utilities for data preprocessing, multicollinearity checking, and visualization of results.
Requires: pandas, numpy, scikit-learn, lifelines, matplotlib, statsmodels, seaborn
- bca_survival.models.standardize_columns(df, columns, nan_threshold=0.7)[source]
Standardizes only numeric columns and handles missing values.
- Parameters:
df (pd.DataFrame) – The input dataframe.
columns (list) – List of column names to consider for standardization.
nan_threshold (float, optional) – Threshold for NaN values. Columns with more NaNs than this threshold will be dropped. Defaults to 0.7.
- Returns:
DataFrame with standardized numeric columns.
- Return type:
pd.DataFrame
Note
This function creates a copy of the dataframe and standardizes only the numeric columns using StandardScaler. Categorical columns are left unchanged.
- bca_survival.models.check_multicollinearity(df, columns)[source]
Checks multicollinearity between variables using a correlation matrix.
- Parameters:
df (pd.DataFrame) – The input dataframe.
columns (list) – List of column names to check for multicollinearity.
- Returns:
Correlation matrix of the specified columns.
- Return type:
pd.DataFrame
Note
This function also displays a heatmap of the correlation matrix.
- bca_survival.models.perform_multivariate_cox_regression(df, columns, penalizer=0.1, standardize=True, vif_threshold=20)[source]
Performs multivariate Cox proportional hazards regression.
- Parameters:
df (pd.DataFrame) – The input dataframe. Must contain ‘days’ and ‘event’ columns.
columns (list) – List of predictor column names.
penalizer (float, optional) – L2 penalizer value to apply to the regression. Defaults to 0.1.
standardize (bool, optional) – Whether to standardize the columns. Defaults to True.
vif_threshold (float, optional) – Threshold for Variance Inflation Factor (VIF). Variables with VIF above this threshold will be removed. Defaults to 20.
- Returns:
Fitted Cox proportional hazards model.
- Return type:
lifelines.CoxPHFitter
Note
This function handles multicollinearity by iteratively removing variables with high VIF values until all variables have VIF below the threshold.
- bca_survival.models.perform_univariate_cox_regression(df, columns, standardize=False, penalizer=0, verbose=False, correction_values=None, nan_threshold=0.7, significant_only=True)[source]
Performs univariate Cox proportional hazards regression for each variable.
- Parameters:
df (pd.DataFrame) – The input dataframe. Must contain ‘days’ and ‘event’ columns.
columns (list) – List of predictor column names to test individually.
standardize (bool, optional) – Whether to standardize the columns. Defaults to False.
penalizer (float, optional) – L2 penalizer value to apply to the regression. Defaults to 0.
verbose (bool, optional) – Whether to print detailed progress information. Defaults to False.
correction_values (list, optional) – List of column names to include as correction terms in each univariate model. Often you’ll use this to correct for age or gender effects. Defaults to None.
nan_threshold (float, optional) – Threshold for NaN values if standardizing. Defaults to 0.7.
significant_only (bool, optional) – Whether to only include significant values. Defaults to True.
- Returns:
DataFrame containing significant variables and their statistics.
- Return type:
pd.DataFrame
Note
This function tests each variable individually in a Cox regression model, and returns only statistically significant variables (p < 0.05).
- bca_survival.models.generate_kaplan_meier_plot(df, column, split_strategy='median', fixed_value=None, percentage=None, output_path=None, dpi=600, custom_title=None, display_plot=False, custom_high_low_names=('low', 'high'))[source]
Generates a Kaplan-Meier survival plot for a specified variable.
- Parameters:
df (pd.DataFrame) – The input dataframe. Must contain ‘days’ and ‘event’ columns.
column (str) – Column name to use for grouping.
split_strategy (str, optional) – Strategy for splitting data into high/low groups. Options: ‘mean’, ‘median’, ‘percentage’, ‘fixed’, ‘quantile’. Defaults to ‘median’.
fixed_value (float, optional) – Fixed threshold value when split_strategy is ‘fixed’. You can use this when you have found cutoff values from literature. Defaults to None.
percentage (float, optional) – Percentile threshold when split_strategy is ‘percentage’. Defaults to None.
output_path (str, optional) – Directory path to save the plot. If None, saves in current directory. Defaults to None.
dpi (int, optional) – Resolution of the output image in dots per inch. Higher values result in better quality but larger file sizes. Defaults to 600.
custom_title (str, optional) – Custom title for the plot. If None, a default title will be generated based on the column and split strategy. Defaults to None.
display_plot (bool, optional) – Whether to display the plot in the notebook. If False, the plot is only saved to file without rendering. Defaults to False.
custom_high_low_names (Tuple[str, str], optional) – Custom high and low variable names. Defaults to (“low”, “high”).
- Returns:
Dictionary containing the log-rank test p-value, plot filename, and test statistic.
- Return type:
dict
- Raises:
ValueError – If an invalid split_strategy is provided or if required parameters for a particular strategy are missing.
Note
This function splits the data into “high” and “low” groups based on the specified variable and strategy, then generates a Kaplan-Meier survival plot comparing the two groups. It also performs a log-rank test to compare the survival curves.
- bca_survival.models.calculate_vif(df, columns)[source]
Calculates the Variance Inflation Factor (VIF) for each variable.
- Parameters:
df (pd.DataFrame) – The input dataframe.
columns (list) – List of column names to calculate VIF for.
- Returns:
DataFrame containing variables and their corresponding VIF values.
- Return type:
pd.DataFrame
Note
VIF is a measure of multicollinearity. Higher values indicate stronger correlation with other variables. VIF > 10 is often considered problematic.