Evaluation of race/ethnicity-specific survival machine learning models for Hispanic and Black patients with breast cancer

Jung In Park; Selen Bozkurt; Jong Won Park; Sunmin Lee

doi:10.1136/bmjhci-2022-100666

Article Text

Original research

Evaluation of race/ethnicity-specific survival machine learning models for Hispanic and Black patients with breast cancer

http://orcid.org/0000-0002-1771-7361Jung In Park1,
Selen Bozkurt2,
Jong Won Park3 and
Sunmin Lee4

¹Sue & Bill Gross School of Nursing, University of California Irvine, Irvine, California, USA
²Stanford University, Stanford, California, USA
³Yonsei University College of Medicine, Seoul, Seodaemun-gu, Korea (the Republic of)
⁴School of Medicine, University of California Irvine, Irvine, California, USA

Correspondence to Dr Jung In Park; junginp{at}uci.edu

Abstract

Objectives Survival machine learning (ML) has been suggested as a useful approach for forecasting future events, but a growing concern exists that ML models have the potential to cause racial disparities through the data used to train them. This study aims to develop race/ethnicity-specific survival ML models for Hispanic and black women diagnosed with breast cancer to examine whether race/ethnicity-specific ML models outperform the general models trained with all races/ethnicity data.

Methods We used the data from the US National Cancer Institute’s Surveillance, Epidemiology and End Results programme registries. We developed the Hispanic-specific and black-specific models and compared them with the general model using the Cox proportional-hazards model, Gradient Boost Tree, survival tree and survival support vector machine.

Results A total of 322 348 female patients who had breast cancer diagnoses between 1 January 2000 and 31 December 2017 were identified. The race/ethnicity-specific models for Hispanic and black women consistently outperformed the general model when predicting the outcomes of specific race/ethnicity.

Discussion Accurately predicting the survival outcome of a patient is critical in determining treatment options and providing appropriate cancer care. The high-performing models developed in this study can contribute to providing individualised oncology care and improving the survival outcome of black and Hispanic women.

Conclusion Predicting the individualised survival outcome of breast cancer can provide the evidence necessary for determining treatment options and high-quality, patient-centred cancer care delivery for under-represented populations. Also, the race/ethnicity-specific ML models can mitigate representation bias and contribute to addressing health disparities.

machine learning
artificial intelligence
health equity
informatics

Data availability statement

Data may be obtained from a third party and are not publicly available. We used the National Cancer Institute's Surveillance, Epidemiology and End Results (SEER) Program data. It provides information on cancer statistics in an effort to reduce the cancer burden among the US population.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

https://doi.org/10.1136/bmjhci-2022-100666

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

Survival machine learning allows healthcare professionals to identify patients at high risk, but models trained with data poorly representative of minority groups, they may exacerbate health disparities. To date, no study developed race/ethnicity-specific survival machine learning models for Hispanic and black women diagnosed with breast cancer.

WHAT THIS STUDY ADDS

The race/ethnicity-specific survival machine learning models outperformed the general models trained with all races/ethnicity when predicting the outcomes of specific races/ethnicity.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

Predicting the individualised survival outcome of breast cancer can provide the evidence necessary for determining treatment options and high-quality, patient-centred cancer care delivery for underrepresented populations. Also, the race/ethnicity-specific machine learning models can mitigate representation bias and contribute to addressing health disparities.

Introduction

Breast cancer is the second-leading cause of cancer-related deaths in women in the USA, and it affects every ethnic group of women in the USA.1 2 However, there are racial and ethnic divides in cancer survival. Breast cancer is the most prevalent reason for cancer-related death in Hispanic women in the USA.3 Also, minority women, especially black women, have a higher mortality rate (26.8 per 100 000 women) even though white women (18.8 per 1 00 000 women) have higher cancer incidence.2 4 5 These facts indicate that the cancer survival rates need to be improved among Hispanic and black women, and various features contributing to breast cancer mortality should be understood to provide tailored intervention for enhanced survival.

Unlike traditional survival models that use a standard statistical method, survival machine learning (ML) has been suggested as a useful approach for learning the patterns from high-dimensional data and complex feature interactions for forecasting future events.6 This approach allows healthcare professionals to identify patients at high risk or predict those who need increased utilisation of healthcare services to proactively support and provide interventions necessary for the patients.7 However, a growing concern exists that ML models have the potential to cause racial disparities through the data used to train them.8 The ML model trained with the data representing general population would not contain sufficient number of participants from the minority population and is biased, resulting in inaccurate predictions for the minority group even if the overall accuracy is high.9 If the ML models trained with data poorly representative of minority groups are used in healthcare, they may exacerbate health disparities.10 To address such harmful effects, it is recommended to train an ML model with data that resemble the population that the model is intended to use.11 12 To the best of our knowledge, no study developed race/ethnicity-specific survival ML models for Hispanic and black women diagnosed with breast cancer.

Therefore, there is a need for race/ethnicity-specific survival ML models trained with the underrepresented populations to examine the feasibility of race/ethnicity-specific ML models that may outperform the general model trained with all races/ethnicity. Accurate prediction of the individualised outcome will enable tailored healthcare delivery and a better outcome for the underrepresented populations. This study aims to develop race/ethnicity-specific survival ML models for Hispanic and black women diagnosed with breast cancer to examine whether race/ethnicity-specific ML models outperform the models trained with the general population data when predicting the survival of Hispanic and black women diagnosed with breast cancer.

Methods

Data source

We used the data from the US National Cancer Institute’s population-based Surveillance, Epidemiology and End Results (SEER) programme registries. The SEER programme currently collects and publishes cancer incidence and survival data in the USA from population-based cancer registries in 22 geographical areas, representing approximately 48% of the US population.13 The SEER data are considered the gold standard for data quality among cancer registries in the USA and globally.14 We selected adult female patients’ data (18 or older) from SEER who had breast cancer diagnoses between 1 January 2000 and 31 December 2017. Also, we selected California as the geographical location for the diverse characteristics of the patient population. The Hispanic population included all races, and the black population was non-Hispanic. Figure 1 shows the flow chart of data collection.

Figure 1

Flow chart of data collection.

Predictor and outcome variables

The predictor variables included age at cancer diagnosis, marital status at diagnosis, first malignant primary tumour indicator, the sequence number of tumours, primary site, histology, the total number of in situ/malignant tumours, SEER summary stage, derived stage, grade, regional lymph nodes examined, regional lymph nodes positive, oestrogen receptor status, progesterone receptor status, chemotherapy, radiation, sequence of radiation and surgery performed, reason no cancer-directed surgery and sequence of systemic therapy and surgical procedures. Vital status was recorded as alive/dead at the time of the cut-off date (31 December 2017). The sequence number of tumours describes the sequence of all reportable tumours that occurred over a patient’s lifetime.

The outcome variable was the survival months of a patient.

Data preprocessing and preparation

Before training the survival models, we preprocessed the predictor variables to enhance the ML modelling performance. Rows containing missing values were dropped. All the categorical features were reencoded using a one-hot-encoding scheme where each new column represented a single category. We applied variance filtering (with the threshold of 0.01) to drop the features that were near-constant or had low variance. Thus, a feature containing outliers would appear as a low-variance column and be filtered out. Once the preprocessing was completed, the final dataset was exported into a new flat file for the training. To train an ML model for survival analysis, the ‘survival months’ variable was used as the target for the training. ‘Vital status’ was used for the event.

We took several steps for data preparation to develop race/ethnicity-specific models for the Hispanic and black populations and compare them with the general model that included all races/ethnicity. Figure 2 shows the process of data preparation for model development.

Figure 2

Data preparation for model development. SEER, Surveillance, Epidemiology and End Results.

First, we split the full dataset into a training set (T_all) for model development and a test set (E_all) for evaluation with a 7:3 ratio to randomly sample the populations. Each set was used to sample the populations for model development randomly. The randomly sampled population sets maintained the original ratio of each race/ethnicity in the full dataset. Second, we extracted the Hispanic population from the original training set, T_all (T_h) to train the Hispanic-specific model (M_h). We also extracted the Hispanic population from the original test set, E_all (E_h) to test the model, M_h. Then, we randomly sampled the populations from the original training set (T_all) that included all races/ethnicity (T_all,h), to match the exact number of samples used for the Hispanic-specific model training. We also randomly sampled the populations from the original test set (E_all) that included all races/ethnicity (E_all,h), to match the exact number of samples used for the Hispanic-specific model testing. T_all,h was used to develop a model M_all,h. Then, the performance of the models M_all,h (a) and M_h (b) were compared with the same test set, E_h. Third, we repeated the process of Hispanic-specific model development for Black-specific model development.

We extracted the black population from the original training set, T_all (T_b) to train the black-specific model (M_b). We also extracted the black population from the original test set, E_all (E_b) to test the model, M_b. Then, we randomly sampled the populations from the original training set (T_all) that included all races/ethnicity (T_all,b), to match the exact number of samples used for the black-specific model training. We also randomly sampled the populations from the original test set (E_all) that included all races/ethnicity (E_all,b), to match the exact number of samples used for the black-specific model testing. T_all,b was used to develop a model M_all,b. Then, the performance of the models M_all,b (c) and M_b (d) were compared with the same test set, E_b.

Race/ethnicity-specific models

For the survival ML modelling, we developed and compared four models: Cox proportional-hazards (PH) model (CoxPH), Gradient Boost Tree (GBT), survival tree (ST) and survival support cector machine (SSVM). The description of each model is shown in table 1.

View this table:

Table 1

Description of survival machine learning models

Each model’s performance was evaluated using the C-index. The C-index is a standard way of measuring the performance of survival models. It can be viewed as the fraction of all pairs of patients predicted to have correct orders over the total number of possible evaluation pairs.15

For each race/ethnicity, we trained and compared two different models based on the two datasets mentioned above—one with a specific race/ethnicity and the other one with all races/ethnicity. Our hypothesis was that the model trained with specific race/ethnicity would outperform the general model trained with all races/ethnicity when predicting the breast cancer survival of a specific race/ethnicity.

Results

Sample characteristics

A total of 322 348 female patients who had breast cancer diagnoses between 1 January 2000 and 31 December 2017 were identified. Among them, the number of Hispanic patients was 59 204 (18.4%), and black was 20 073 (6.2%). Table 2 shows the detailed characteristics of the study sample, Hispanic, black and all races/ethnicity.

View this table:

Table 2

Sample characteristics

Compared with all races/ethnicity (15.2%) and Hispanic (14.9%) populations, more black population was dead (24.4%). Hispanic population’s survival months (mean: 80.6, median: 67.0) were lower compared with all races/ethnicity (90.4, 79.0) and black (82.9, 69.0) populations. Hispanic population was younger (mean:55.3, median 54.0), compared with all races/ethnicity (59.1, 59.0) and black (57.8, 57.0) populations. Black (36.6%) and Hispanic (39.2%) population had higher percentage of poorly differentiated grade III cancer, compared with all races/ethnic (32.7%) groups. Black population had lower percentages of positive oestrogen receptor status (65.6%), compared with all races/ethnicity (77.4%) and Hispanic (73.5%) populations. Also, black population had lower percentages of positive progesterone receptor status (52.1%), compared with all races/ethnicity (65.6%) and Hispanic (62.3%) populations.

Lower percentages of Hispanic (51.0%) and black (50.0%) populations had chemotherapy compared with all races/ethnicity (57.4%). Higher percentages of black (57.3%) and Hispanic (55.6%) populations had no radiation and/or cancer-directed surgery, compared with all races/ethnicity (52.3%). Higher percentages of overall (46.4%) and Hispanic (43.4%) populations had radiation after surgery than Black (41.5%) populations.

Figure 3 shows the Kaplan-Meier curves of the Hispanic, black and all races/ethnic groups. All races/ethnic groups had the better survival than the Hispanic and black groups.

Figure 3

Kalplan-Meier survival curves.

Data preprocessing and preparation

After data preprocessing and cleaning, the final dataset for analysis contained 260 variables. Values in ‘Derived stages’ variable were grouped into ‘0’, ‘I’, ‘II’, ‘III’, ‘IV’ and ‘unknown’. ‘Regional nodes examined’ and ‘Regional nodes positive’ were integer variables which contains both numeric and encoded values (90+). Numeric values were categorised (ie, 0–9, 11–19, …, 40+), while encoded values were mapped to ‘oher’.

Model development

We extracted 59 204 Hispanic populations for each training set for Hispanic-specific model (T_h) and a comparison model with all races/ethnicity (T_all,h). Also, we extracted 20 073 black populations for each training set for black-specific model (T_b) and a comparison model with all races/ethnicity (T_all,b). Once data were prepared, we applied variance filtering and dropped the features that had low variance. After filtering, the number of features we had for the T_h was 72, and for the T_all,h was 71 for Hispanic-specific model training, and the number of features we had for the T_b and T_all,b was 72 for the black-specific model training.

During the training, both training sets (race-specific and all races/ethnicity) were further split into actual training set and validation set during a cross-validation phase when parameter tuning was necessary (GBT and ST models). We used random search method to find the most optimal parameters for each survival analysis model. We used 20 iterations and 5-fold cross validation was used for all cases for each training. We used scikit-survival package (V.0.17.1) for the modelling (CoxPHSurvivalAnalysis class for CoxPH, Gradient Boosting Survival Analysis class for GBT, SurvivalTree class for ST and FastKernelSurvivalSVM class for SSVM), scikit-learn (V.1.0.2) for the feature selection (VarianceThreshold), hyperopt (V.0.2.7) for the hyperparameter search, and pandas (V.1.4.1) for general data preprocessing and preparation.

Model evaluations

The model evaluation results are shown in table 3 and figure 4 where we compared different combinations of modelling methods and input training/test sets.

View this table:

Table 3

Model performance comparison using c-index

Figure 4

Race/ethnicity-specific model performance comparison using C-index. GBT, Gradient Boost Tree; PH, proportional hazards; SSVM, survival support vector machine; ST, survival tree.

Hispanic-specific model (M_h) and all races/ethnicity model (M_all,h) were evaluated using the same the test set (E_h). Hispanic-specific model (M_h) outperformed all races/ethnicity model (M_{all, h}) in three out of four approaches, which were Cox PH (0.832 vs 0.828), ST (0.772 vs 0.763) and SSVM (0.834 vs 0.790). The GBT model showed the same c-index score (0.813) for both models.

Black-specific model (M_b) and all races/ethnicity model (M_all,b) were evaluated using the same the test set (E_b). Black-specific model (M_b) outperformed all races/ethnicity model (M_all,b) in all four approaches, Cox PH (0.823 vs 0.821), GBT (0.808 vs 0.803), ST (0.804 vs 0.801) and SSVM (0.824 vs 0.786). In both race/ethnicity-specific models, Cox PH showed the highest c-index score followed by GBT, SSVM and ST.

Discussion

Accurately predicting the survival outcome of a patient is critical in determining treatment options and providing appropriate cancer care. The ML approaches provide a robust way of predicting health outcomes using large data points with complex feature interactions. However, current ML models are often built with all races/ethnicity data, having the potential to have representation bias, and not tailored to each minority group. To date, race/ethnicity-specific survival ML models predicting the outcomes of the black and Hispanic women diagnosed with breast cancer are lacking. This study developed and evaluated race/ethnicity-specific survival ML models for black and Hispanic women with breast cancer and compared with the general population model. The high performing ML models developed in this study will be able to contribute to providing individualised oncology care and improving the survival outcome of specific populations, the black and Hispanic women. Also, it is a strength of our model that we used the patient data from more than 3 22 348 women in a large, population-based dataset from 2000 to 2017, including 59 204 (18.4%) Hispanic women and 20 073 (6.2%) Black women.

The sample population in this study showed that the black population had the highest death rate followed by the Hispanic and all races/ethnicity, supporting the findings from other literature.4 5 Also, the survival months for the black and Hispanic groups were low and they were younger compared with all races/ethnicity. It is congruent with the literature that young black women have higher breast cancer mortality than young white women,16 17 and the Latinas have the higher rates of more advanced cancer than non-Hispanic Whites.18 Also, breast cancer is more aggressive in younger women than older premenopausal women.19 Our study sample also showed that the Hispanic and black populations had higher percentage of poorly differentiated grade III cancer than overall populations. Poorly differentiated tumours lack normal features, tend to grow and spread faster and have a worse prognosis20; and these tumours expressed lower levels of oestrogen receptor.21 Our study sample showed likewise that Hispanic and black populations showed the lower percentage of oestrogen receptor positive status and progesterone receptor positive status than overall population. Studies have shown that young age breast cancer has more advanced stage at presentation, more grades and higher oestrogen receptor negativity.22

The result also showed that lower percentages of Hispanic and black populations had chemotherapy. Existing literature has shown that African American and Hispanic patients tend to experience diagnostic and treatment delays, which were related to worse survival outcomes.23 24 Perhaps lower percentages of Hispanic and black patients receiving chemotherapy were associated with the fewer survival months of the Hispanic and black populations in this study.

After the race/ethnicity-specific model development and evaluation, we observed that the general models trained with all races/ethnicity did not perform well when tested with specific races/ethnicity. That is, the race/ethnicity-specific survival ML models developed in this study consistently outperformed the general models when predicting the outcomes of specific race/ethnicity, addressing bias in ML. Especially, black and Hispanic-specific survival ML models using the Cox PH approach showed the best performance among the four ML models tested, showing that this model outperformed the other models in predicting the survival of specific race/ethnicity. Also, the ST model performance showed the highest difference between the race/ethnicity-specific model and the general model. This indicates that the ST model tends to overfit to a specific race/ethnicity compared with the other models. Our study demonstrated that a tailored ML model for each race/ethnicity is needed to better predict the patient survival than the general ML model using all races/ethnicity. By accurately forecasting a patient’s survival, healthcare professionals will be able to guide individualised treatment decisions and provide tailored interventions for the well-being of a cancer survivor.

It is worth noting that although the performance of the general model is not low, it was trained with the general population with an imbalanced portion of the underrepresented population, including the Hispanic and black populations. It was still meaningful to examine the feasibility of race/ethnicity-specific models since it is recommended to train an ML model with data resembling the people the model is intended to use to mitigate representation bias. Although the performance difference between the models was sometimes marginal depending on the algorithms, our race/ethnicity-specific models consistently outperformed the general model. It shows the potential to accurately predict individualised patient outcomes for quality care delivery for underrepresented populations and lead to alleviating health disparities.

There are several limitations to this study. The SEER database only includes the first course of treatment and do not have information on adjuvant therapy.25 This causes difficulties comparing the outcomes of the treatment sequence. To overcome this limitation, a comprehensive database that has more information on cancer treatment can be used as a future work to provide additional insights on the impact of treatment sequence. Also, the dataset did not include the human epidermal growth factor 2 receptor status, which is a critical tumour marker for breast cancer prognosis. The variable was missing because it was collected from 2010, but our data were dated from 2000. Incorporating this variable in the modelling will be needed in future work to provide more accurate predictions for patient outcomes.

Conclusion

This study has developed and evaluated accurate race/ethnicity-specific survival ML models for black and Hispanic women diagnosed with breast cancer. Predicting the individualised survival outcome of breast cancer can provide the evidence necessary for determining treatment options and high-quality, patient-centred cancer care delivery for underrepresented populations. Also, the race/ethnicity-specific ML models can mitigate representation bias and contribute to addressing health disparities.

Data availability statement

Ethics statements

Patient consent for publication

Ethics approval

Since the data were fully deidentified, this study was not considered human subject research by the Institutional Review Board at the University, and no informed consent was required.

References

↵
1. Siegel RL,
2. Miller KD,
3. Fuchs HE, et al
. Cancer statistics, 2021. CA Cancer J Clin 2021;71:7–33.doi:10.3322/caac.21654pmid:http://www.ncbi.nlm.nih.gov/pubmed/33433946
OpenUrl CrossRef PubMed
↵
1. Yedjou CG,
2. Sims JN,
3. Miele L
. Health and racial disparity in breast cancer. Breast cancer metastasis and drug resistance 2019;1152:31–49.doi:10.1007/978-3-030-20301-6_3
OpenUrl
↵
1. Power EJ,
2. Chin ML,
3. Haq MM
. Breast cancer incidence and risk reduction in the Hispanic population. Cureus 2018;10:e2235.doi:10.7759/cureus.2235pmid:http://www.ncbi.nlm.nih.gov/pubmed/29713580
OpenUrl PubMed
↵
1. Jemal A,
2. Ward EM,
3. Johnson CJ, et al
. Annual report to the nation on the status of cancer, 1975–2014, featuring survival. J Natl Cancer Inst 2017;109.doi:10.1093/jnci/djx030
↵
1. Copeland G,
2. Green D,
3. Firth R
. Cancer in North America: 2011–2015 volume one: combined cancer incidence for the United States, Canada and North America. Springfield North American Association of Central Cancer Registries; 2018.
↵
1. Rajkomar A,
2. Dean J,
3. Kohane I
. Machine learning in medicine. N Engl J Med 2019;380:1347–58.doi:10.1056/NEJMra1814259pmid:http://www.ncbi.nlm.nih.gov/pubmed/30943338
OpenUrl CrossRef PubMed
↵
1. Bates DW,
2. Saria S,
3. Ohno-Machado L, et al
. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff 2014;33:1123–31.doi:10.1377/hlthaff.2014.0041pmid:http://www.ncbi.nlm.nih.gov/pubmed/25006137
OpenUrl Abstract/FREE Full Text
↵
1. Obermeyer Z,
2. Powers B,
3. Vogeli C, et al
. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447–53.doi:10.1126/science.aax2342pmid:http://www.ncbi.nlm.nih.gov/pubmed/31649194
OpenUrl Abstract/FREE Full Text
↵
1. Saria S,
2. Subbaswamy A
. Tutorial: safe and reliable machine learning. arXiv preprint 2019.doi:10.48550/arXiv.1904.07204
↵
1. Vayena E,
2. Blasimme A,
3. Cohen IG
. Machine learning in medicine: addressing ethical challenges. PLoS Med 2018;15:e1002689.doi:10.1371/journal.pmed.1002689pmid:http://www.ncbi.nlm.nih.gov/pubmed/30399149
OpenUrl CrossRef PubMed
↵
1. Rajkomar A,
2. Hardt M,
3. Howell MD, et al
. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018;169:866–72.doi:10.7326/M18-1990pmid:http://www.ncbi.nlm.nih.gov/pubmed/30508424
OpenUrl CrossRef PubMed
↵
1. Wiens J,
2. Saria S,
3. Sendak M, et al
. Do no harm: a roadmap for responsible machine learning for health care. Nat Med 2019;25:1337–40.doi:10.1038/s41591-019-0548-6pmid:http://www.ncbi.nlm.nih.gov/pubmed/31427808
OpenUrl CrossRef PubMed
↵
1. SEER
2. National Cancer Institute
. SEER research plus data description cases diagnosed in 1975-2017, 2020. Available: https://seer.cancer.gov/data-software/documentation/seerstat/nov2019/TextData.FileDescription.pdf
↵
1. Duggan MA,
2. Anderson WF,
3. Altekruse S, et al
. The surveillance, epidemiology, and end results (SEER) program and pathology: toward strengthening the critical relationship. Am J Surg Pathol 2016;40:e94.doi:10.1097/PAS.0000000000000749pmid:http://www.ncbi.nlm.nih.gov/pubmed/27740970
OpenUrl PubMed
↵
1. Chen Y,
2. Jia Z,
3. Mercola D, et al
. A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Comput Math Methods Med 2013;2013:1–8.doi:10.1155/2013/873595pmid:http://www.ncbi.nlm.nih.gov/pubmed/24348746
OpenUrl CrossRef PubMed
↵
1. Shavers VL,
2. Harlan LC,
3. Stevens JL
. Racial/Ethnic variation in clinical presentation, treatment, and survival among breast cancer patients under age 35. Cancer 2003;97:134–47.doi:10.1002/cncr.11051pmid:http://www.ncbi.nlm.nih.gov/pubmed/12491515
OpenUrl CrossRef PubMed Web of Science
↵
1. Althuis MD,
2. Brogan DD,
3. Coates RJ, et al
. Breast cancers among very young premenopausal women (United States). Cancer Causes Control 2003;14:151–60.doi:10.1023/A:1023006000760pmid:http://www.ncbi.nlm.nih.gov/pubmed/12749720
OpenUrl CrossRef PubMed Web of Science
↵
1. Yanez B,
2. Thompson EH,
3. Stanton AL
. Quality of life among Latina breast cancer patients: a systematic review of the literature. J Cancer Surviv 2011;5:191–207.doi:10.1007/s11764-011-0171-0pmid:http://www.ncbi.nlm.nih.gov/pubmed/21274649
OpenUrl CrossRef PubMed
↵
1. Gonzalez-Angulo AM,
2. Broglio K,
3. Kau S-W, et al
. Women age < or = 35 years with primary breast carcinoma: disease features at presentation. Cancer 2005;103:2466–72.doi:10.1002/cncr.21070pmid:http://www.ncbi.nlm.nih.gov/pubmed/15852360
OpenUrl CrossRef PubMed Web of Science
↵
1. American Cancer Society
. Understanding your pathology report: breast cancer, 2020. Available: https://www.cancer.org/treatment/understanding-your-diagnosis/tests/understanding-your-pathology-report/breast-pathology/breast-cancer-pathology.html
↵
1. Scimeca M,
2. Antonacci C,
3. Colombo D, et al
. Emerging prognostic markers related to mesenchymal characteristics of poorly differentiated breast cancers. Tumour Biol 2016;37:5427–35.doi:10.1007/s13277-015-4361-7pmid:http://www.ncbi.nlm.nih.gov/pubmed/26563370
OpenUrl PubMed
↵
1. Zabicki K,
2. Colbert JA,
3. Dominguez FJ, et al
. Breast cancer diagnosis in women < or = 40 versus 50 to 60 years: increasing size and stage disparity compared with older women over time. Ann Surg Oncol 2006;13:1072–7.doi:10.1245/ASO.2006.03.055pmid:http://www.ncbi.nlm.nih.gov/pubmed/16865599
OpenUrl CrossRef PubMed Web of Science
↵
1. Gwyn K,
2. Bondy ML,
3. Cohen DS, et al
. Racial differences in diagnosis, treatment, and clinical delays in a population-based study of patients with newly diagnosed breast carcinoma. Cancer 2004;100:1595–604.doi:10.1002/cncr.20169
OpenUrl CrossRef PubMed Web of Science
↵
1. Fedewa SA,
2. Ward EM,
3. Stewart AK, et al
. Delays in adjuvant chemotherapy treatment among patients with breast cancer are more likely in African American and Hispanic populations: a national cohort study 2004-2006. J Clin Oncol 2010;28:4135–41.doi:10.1200/JCO.2009.27.2427pmid:http://www.ncbi.nlm.nih.gov/pubmed/20697082
OpenUrl Abstract/FREE Full Text
↵
1. Yu JB,
2. Gross CP,
3. Wilson LD, et al
. NCI SEER public-use data: applications and limitations in oncology research. Oncology 2009;23:288.pmid:http://www.ncbi.nlm.nih.gov/pubmed/19418830
OpenUrl PubMed
1. Katzman JL,
2. Shaham U,
3. Cloninger A, et al
. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 2018;18:1–2.doi:10.1186/s12874-018-0482-1
OpenUrl CrossRef
1. Lin DY,
2. Wei LJ
. The robust inference for the Cox proportional hazards model. J Am Stat Assoc 1989;84:1074–8.doi:10.1080/01621459.1989.10478874
OpenUrl CrossRef Web of Science
1. Friedman JH
. Stochastic gradient boosting. Comput Stat Data Anal 2002;38:367–78.doi:10.1016/S0167-9473(01)00065-2
OpenUrl CrossRef Web of Science
1. Friedman JH
. Greedy function approximation: a gradient boosting machine. Ann. Statist. 2001;29:1189–232.doi:10.1214/aos/1013203451
OpenUrl
1. Bertsimas D,
2. Dunn J,
3. Gibson E, et al
. Optimal survival trees. Mach Learn 2022;111:2951–3023.doi:10.1007/s10994-021-06117-0
OpenUrl
1. Leblanc M,
2. Crowley J
. Survival trees by Goodness of split. J Am Stat Assoc 1993;88:457–67.doi:10.1080/01621459.1993.10476296
OpenUrl CrossRef Web of Science
1. Van Belle V,
2. Pelckmans K,
3. Van Huffel S, et al
. Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med 2011;53:107–18.doi:10.1016/j.artmed.2011.06.006pmid:http://www.ncbi.nlm.nih.gov/pubmed/21821401
OpenUrl CrossRef PubMed
1. Pölsterl S,
2. Navab N,
3. Katouzian A
. An efficient training algorithm for kernel survival support vector machines. arXiv preprint 2016.doi:10.48550/arXiv.1611.07054

Footnotes

Contributors All the authors contributed to the design of the work and the final approval of the submission. JIP worked on the data acquisition, analysis and interpretation of the data, and acted as guarantor. SB contributed to the data analysis. JWP and SL contributed to the interpretation of data for the work.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.

[1] ↵
Siegel RL,
Miller KD,
Fuchs HE, et al
. Cancer statistics, 2021. CA Cancer J Clin 2021;71:7–33.doi:10.3322/caac.21654pmid:http://www.ncbi.nlm.nih.gov/pubmed/33433946
OpenUrl CrossRef PubMed

[2] Siegel RL,

[3] Miller KD,

[4] Fuchs HE, et al

[5] ↵
Yedjou CG,
Sims JN,
Miele L
. Health and racial disparity in breast cancer. Breast cancer metastasis and drug resistance 2019;1152:31–49.doi:10.1007/978-3-030-20301-6_3
OpenUrl

[6] Yedjou CG,

[7] Sims JN,

[8] Miele L

[9] ↵
Power EJ,
Chin ML,
Haq MM
. Breast cancer incidence and risk reduction in the Hispanic population. Cureus 2018;10:e2235.doi:10.7759/cureus.2235pmid:http://www.ncbi.nlm.nih.gov/pubmed/29713580
OpenUrl PubMed

[10] Power EJ,

[11] Chin ML,

[12] Haq MM

[13] ↵
Jemal A,
Ward EM,
Johnson CJ, et al
. Annual report to the nation on the status of cancer, 1975–2014, featuring survival. J Natl Cancer Inst 2017;109.doi:10.1093/jnci/djx030

[14] Jemal A,

[15] Ward EM,

[16] Johnson CJ, et al

[17] ↵
Copeland G,
Green D,
Firth R
. Cancer in North America: 2011–2015 volume one: combined cancer incidence for the United States, Canada and North America. Springfield North American Association of Central Cancer Registries; 2018.

[18] Copeland G,

[19] Green D,

[20] Firth R

[21] ↵
Rajkomar A,
Dean J,
Kohane I
. Machine learning in medicine. N Engl J Med 2019;380:1347–58.doi:10.1056/NEJMra1814259pmid:http://www.ncbi.nlm.nih.gov/pubmed/30943338
OpenUrl CrossRef PubMed

[22] Rajkomar A,

[23] Dean J,

[24] Kohane I

[25] ↵
Bates DW,
Saria S,
Ohno-Machado L, et al
. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff 2014;33:1123–31.doi:10.1377/hlthaff.2014.0041pmid:http://www.ncbi.nlm.nih.gov/pubmed/25006137
OpenUrl Abstract/FREE Full Text

[26] Bates DW,

[27] Saria S,

[28] Ohno-Machado L, et al

[29] ↵
Obermeyer Z,
Powers B,
Vogeli C, et al
. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019;366:447–53.doi:10.1126/science.aax2342pmid:http://www.ncbi.nlm.nih.gov/pubmed/31649194
OpenUrl Abstract/FREE Full Text

[30] Obermeyer Z,

[31] Powers B,

[32] Vogeli C, et al

[33] ↵
Saria S,
Subbaswamy A
. Tutorial: safe and reliable machine learning. arXiv preprint 2019.doi:10.48550/arXiv.1904.07204

[34] Saria S,

[35] Subbaswamy A

[36] ↵
Vayena E,
Blasimme A,
Cohen IG
. Machine learning in medicine: addressing ethical challenges. PLoS Med 2018;15:e1002689.doi:10.1371/journal.pmed.1002689pmid:http://www.ncbi.nlm.nih.gov/pubmed/30399149
OpenUrl CrossRef PubMed

[37] Vayena E,

[38] Blasimme A,

[39] Cohen IG

[40] ↵
Rajkomar A,
Hardt M,
Howell MD, et al
. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018;169:866–72.doi:10.7326/M18-1990pmid:http://www.ncbi.nlm.nih.gov/pubmed/30508424
OpenUrl CrossRef PubMed

[41] Rajkomar A,

[42] Hardt M,

[43] Howell MD, et al

[44] ↵
Wiens J,
Saria S,
Sendak M, et al
. Do no harm: a roadmap for responsible machine learning for health care. Nat Med 2019;25:1337–40.doi:10.1038/s41591-019-0548-6pmid:http://www.ncbi.nlm.nih.gov/pubmed/31427808
OpenUrl CrossRef PubMed

[45] Wiens J,

[46] Saria S,

[47] Sendak M, et al

[48] ↵
SEER
National Cancer Institute
. SEER research plus data description cases diagnosed in 1975-2017, 2020. Available: https://seer.cancer.gov/data-software/documentation/seerstat/nov2019/TextData.FileDescription.pdf

[49] SEER

[50] National Cancer Institute

[51] ↵
Duggan MA,
Anderson WF,
Altekruse S, et al
. The surveillance, epidemiology, and end results (SEER) program and pathology: toward strengthening the critical relationship. Am J Surg Pathol 2016;40:e94.doi:10.1097/PAS.0000000000000749pmid:http://www.ncbi.nlm.nih.gov/pubmed/27740970
OpenUrl PubMed

[52] Duggan MA,

[53] Anderson WF,

[54] Altekruse S, et al

[55] ↵
Chen Y,
Jia Z,
Mercola D, et al
. A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Comput Math Methods Med 2013;2013:1–8.doi:10.1155/2013/873595pmid:http://www.ncbi.nlm.nih.gov/pubmed/24348746
OpenUrl CrossRef PubMed

[56] Chen Y,

[57] Jia Z,

[58] Mercola D, et al

[59] ↵
Shavers VL,
Harlan LC,
Stevens JL
. Racial/Ethnic variation in clinical presentation, treatment, and survival among breast cancer patients under age 35. Cancer 2003;97:134–47.doi:10.1002/cncr.11051pmid:http://www.ncbi.nlm.nih.gov/pubmed/12491515
OpenUrl CrossRef PubMed Web of Science

[60] Shavers VL,

[61] Harlan LC,

[62] Stevens JL

[63] ↵
Althuis MD,
Brogan DD,
Coates RJ, et al
. Breast cancers among very young premenopausal women (United States). Cancer Causes Control 2003;14:151–60.doi:10.1023/A:1023006000760pmid:http://www.ncbi.nlm.nih.gov/pubmed/12749720
OpenUrl CrossRef PubMed Web of Science

[64] Althuis MD,

[65] Brogan DD,

[66] Coates RJ, et al

[67] ↵
Yanez B,
Thompson EH,
Stanton AL
. Quality of life among Latina breast cancer patients: a systematic review of the literature. J Cancer Surviv 2011;5:191–207.doi:10.1007/s11764-011-0171-0pmid:http://www.ncbi.nlm.nih.gov/pubmed/21274649
OpenUrl CrossRef PubMed

[68] Yanez B,

[69] Thompson EH,

[70] Stanton AL

[71] ↵
Gonzalez-Angulo AM,
Broglio K,
Kau S-W, et al
. Women age < or = 35 years with primary breast carcinoma: disease features at presentation. Cancer 2005;103:2466–72.doi:10.1002/cncr.21070pmid:http://www.ncbi.nlm.nih.gov/pubmed/15852360
OpenUrl CrossRef PubMed Web of Science

[72] Gonzalez-Angulo AM,

[73] Broglio K,

[74] Kau S-W, et al

[75] ↵
American Cancer Society
. Understanding your pathology report: breast cancer, 2020. Available: https://www.cancer.org/treatment/understanding-your-diagnosis/tests/understanding-your-pathology-report/breast-pathology/breast-cancer-pathology.html

[76] American Cancer Society

[77] ↵
Scimeca M,
Antonacci C,
Colombo D, et al
. Emerging prognostic markers related to mesenchymal characteristics of poorly differentiated breast cancers. Tumour Biol 2016;37:5427–35.doi:10.1007/s13277-015-4361-7pmid:http://www.ncbi.nlm.nih.gov/pubmed/26563370
OpenUrl PubMed

[78] Scimeca M,

[79] Antonacci C,

[80] Colombo D, et al

[81] ↵
Zabicki K,
Colbert JA,
Dominguez FJ, et al
. Breast cancer diagnosis in women < or = 40 versus 50 to 60 years: increasing size and stage disparity compared with older women over time. Ann Surg Oncol 2006;13:1072–7.doi:10.1245/ASO.2006.03.055pmid:http://www.ncbi.nlm.nih.gov/pubmed/16865599
OpenUrl CrossRef PubMed Web of Science

[82] Zabicki K,

[83] Colbert JA,

[84] Dominguez FJ, et al

[85] ↵
Gwyn K,
Bondy ML,
Cohen DS, et al
. Racial differences in diagnosis, treatment, and clinical delays in a population-based study of patients with newly diagnosed breast carcinoma. Cancer 2004;100:1595–604.doi:10.1002/cncr.20169
OpenUrl CrossRef PubMed Web of Science

[86] Gwyn K,

[87] Bondy ML,

[88] Cohen DS, et al

[89] ↵
Fedewa SA,
Ward EM,
Stewart AK, et al
. Delays in adjuvant chemotherapy treatment among patients with breast cancer are more likely in African American and Hispanic populations: a national cohort study 2004-2006. J Clin Oncol 2010;28:4135–41.doi:10.1200/JCO.2009.27.2427pmid:http://www.ncbi.nlm.nih.gov/pubmed/20697082
OpenUrl Abstract/FREE Full Text

[90] Fedewa SA,

[91] Ward EM,

[92] Stewart AK, et al

[93] ↵
Yu JB,
Gross CP,
Wilson LD, et al
. NCI SEER public-use data: applications and limitations in oncology research. Oncology 2009;23:288.pmid:http://www.ncbi.nlm.nih.gov/pubmed/19418830
OpenUrl PubMed

[94] Yu JB,

[95] Gross CP,

[96] Wilson LD, et al

[97] Katzman JL,
Shaham U,
Cloninger A, et al
. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol 2018;18:1–2.doi:10.1186/s12874-018-0482-1
OpenUrl CrossRef

[98] Katzman JL,

[99] Shaham U,

[100] Cloninger A, et al

[101] Lin DY,
Wei LJ
. The robust inference for the Cox proportional hazards model. J Am Stat Assoc 1989;84:1074–8.doi:10.1080/01621459.1989.10478874
OpenUrl CrossRef Web of Science

[102] Lin DY,

[103] Wei LJ

[104] Friedman JH
. Stochastic gradient boosting. Comput Stat Data Anal 2002;38:367–78.doi:10.1016/S0167-9473(01)00065-2
OpenUrl CrossRef Web of Science

[105] Friedman JH

[106] Friedman JH
. Greedy function approximation: a gradient boosting machine. Ann. Statist. 2001;29:1189–232.doi:10.1214/aos/1013203451
OpenUrl

[107] Friedman JH

[108] Bertsimas D,
Dunn J,
Gibson E, et al
. Optimal survival trees. Mach Learn 2022;111:2951–3023.doi:10.1007/s10994-021-06117-0
OpenUrl

[109] Bertsimas D,

[110] Dunn J,

[111] Gibson E, et al

[112] Leblanc M,
Crowley J
. Survival trees by Goodness of split. J Am Stat Assoc 1993;88:457–67.doi:10.1080/01621459.1993.10476296
OpenUrl CrossRef Web of Science

[113] Leblanc M,

[114] Crowley J

[115] Van Belle V,
Pelckmans K,
Van Huffel S, et al
. Support vector methods for survival analysis: a comparison between ranking and regression approaches. Artif Intell Med 2011;53:107–18.doi:10.1016/j.artmed.2011.06.006pmid:http://www.ncbi.nlm.nih.gov/pubmed/21821401
OpenUrl CrossRef PubMed

[116] Van Belle V,

[117] Pelckmans K,

[118] Van Huffel S, et al

[119] Pölsterl S,
Navab N,
Katouzian A
. An efficient training algorithm for kernel survival support vector machines. arXiv preprint 2016.doi:10.48550/arXiv.1611.07054

[120] Pölsterl S,

[121] Navab N,

[122] Katouzian A

Log in using your username and password

Main menu

Log in using your username and password

You are here

Abstract

Data availability statement

Statistics from Altmetric.com

Request Permissions

WHAT IS ALREADY KNOWN ON THIS TOPIC

WHAT THIS STUDY ADDS

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

Introduction

Methods

Data source

Predictor and outcome variables

Data preprocessing and preparation

Race/ethnicity-specific models

Results

Sample characteristics

Data preprocessing and preparation

Model development

Model evaluations

Discussion

Conclusion

Data availability statement

Ethics statements

Patient consent for publication

Ethics approval

References

Footnotes

Read the full text or download the PDF:

Log in using your username and password