bca_survival.preprocessing module

Survival Data Preprocessing Module.

This module provides utility functions for preprocessing survival analysis data, including calculating time-to-event durations, handling missing or invalid data, creating event indicators, and computing tissue ratios from the BCA values.

Requires: pandas

bca_survival.preprocessing.calculate_days(df, start_date_col, event_date_col, event_col)[source]

Calculates the number of days between two date columns and sets an event indicator.

Parameters:
  • df (pd.DataFrame) – The input dataframe.

  • start_date_col (str) – Name of the column containing start dates.

  • event_date_col (str, optional) – Name of the column containing event dates. If None, only the event indicator will be created.

  • event_col (str) – Name of the column containing event indicators (1/0 or True/False).

Returns:

DataFrame with added ‘days’ and ‘event’ columns.

Return type:

pd.DataFrame

Note

The function expects dates in the format ‘%d.%m.%Y’ (e.g., ‘31.12.2020’). The ‘days’ column represents the time between start and event dates. The ‘event’ column is converted to integer type.

bca_survival.preprocessing.check_and_remove_negative_days(df)[source]

Checks for and removes rows with negative or NaN values in the ‘days’ column.

Parameters:

df (pd.DataFrame) – The input dataframe with a ‘days’ column.

Returns:

A tuple containing:
  • pd.DataFrame: DataFrame with negative and NaN ‘days’ values removed.

  • pd.DataFrame or None: DataFrame containing only the removed rows, or None if no rows were removed.

Return type:

tuple

Note

Negative days values can occur due to data entry errors or when an event occurs before the recorded start date. This function identifies and removes such problematic records. It prints a warning if any rows are removed.

bca_survival.preprocessing.create_event_date_column(df, date_death, date_disease_death, date_followup)[source]

Creates an event date column and event indicator based on multiple date columns. This is used to prepare for Overall Survival analysis.

Parameters:
  • df (pd.DataFrame) – The input dataframe.

  • date_death (str) – Column name containing the date of death.

  • date_disease_death (str) – Column name containing the date of disease-specific death.

  • date_followup (str) – Column name containing the date of last follow-up.

Returns:

DataFrame with added ‘event_date’ and ‘event’ columns.

Return type:

pd.DataFrame

Note

This function prioritizes death dates over follow-up dates. It sets the event indicator to True if either death date is present, and False if only the follow-up date is available. If no dates are available, both columns are set to NaN.

bca_survival.preprocessing.compute_ratios(df)[source]

Computes ratios between different tissue measurements across body parts and metrics.

Parameters:

df (pd.DataFrame) – The input dataframe containing tissue measurement columns.

Returns:

DataFrame with additional columns for computed ratios.

Return type:

pd.DataFrame

Note

This function calculates ratios such as intramuscular adipose tissue to total adipose tissue (imat/tat), visceral fat to total fat (vat/tat), etc., for various body parts and metrics.

The column naming convention is: ‘{body_part}::WL::{tissue_type}::{metric}’ for measurements ‘{body_part}::WL::{numerator}/{denominator}::{metric}’ for ratios

For example, ‘l5::WL::imat/tat::mean_ml’ represents the ratio of mean milliliter volume of intramuscular adipose tissue to total adipose tissue at the L5 vertebra level.