Skip to content
All projects
Core engineering libraryEngineer

data_utilities.py — DataFrame Toolbox

BCHPR · 2023 – present

5,764-line DataFrame utility library: smart deduplication with null prioritisation, column manipulation, HTML cleaning, Cameroon-specific phone-number standardisation, and Polars bulk operations.

Highlights

  • Deduplication prioritising non-null values with custom sort columns (e.g. prefer most-recent date).
  • Bulk column operations: ensure_columns, move_column_after, rename_in_bulk, drop_duplicates_prioritize_non_null.
  • Cameroon phone cleaner with international-format standardisation and SMS-friendly output.
  • Polars ⇄ pandas bridge with automatic 5-10× speedup on large operations.
  • HTML-tag stripper for REDCap text exports with malformed markup.