Objective To explore a deep learning-based automatic bone age estimation model for elbow joint X-ray images of Chinese Han adolescents and children and evaluate its performance. Methods A total of 943 (517 males and 426 females) elbow joint frontal view X-ray images of Chinese Han adolescents and children aged 6.00 to <16.00 years were collected from East, South, Central and Northwest China. Three experimental schemes were adopted for bone age estimation. Scheme 1: Directly input preprocessed images into the regression model; Scheme 2: Train a segmentation network using “key elbow joint bone annotations” as labels, then input segmented images into the regression model; Scheme 3: Train a segmentation network using “full elbow joint bone annotations” as labels, then input segmented images into the regression model. For segmentation, the optimal model was selected from U-Net, UNet++ and TransUNet. For regression, VGG16, VGG19, InceptionV2, InceptionV3, ResNet34, ResNet50, ResNet101 and DenseNet121 models were selected for bone age estimation. The dataset was randomly split into 80% (754 samples) for training and validation for model fitting and hyperparameter tuning, and 20% (189 samples) as an internal test set to test the performance of the trained model. An additional 104 elbow joint X-ray images from the same demographic and age group were collected and used as an external test set. Model performance was evaluated by comparing the mean absolute error (MAE), root mean square error (RMSE), accuracies within ±0.7 years (P±0.7 years) and ±1.0 years (P±1.0 years) between the estimated age and the actual age, and by drawing radar charts, scatter plots, and heatmaps. Results When segmented with Scheme 3, the UNet++ model achieved good segmentation performance with a segmentation loss of 0.000 4 and an accuracy of 93.8% at a learning rate of 0.000 1. In the internal test set, the DenseNet121 model with Scheme 3 yielded the best results with MAE, P±0.7 years and P±1.0 years being 0.83 years, 70.03%, and 84.30%, respectively. In the external test set, the DenseNet121 model with Scheme 3 also performed best, with an average MAE of 0.89 years and an average RMSE of 1.00 years. Conclusion When performing automatic bone age estimation using elbow joint X-ray images in Chinese Han adolescents and children, it is recommended to use the UNet++ model for segmentation. The DenseNet121 model with Scheme 3 achieves optimal performance. Using segmentation networks, especially that trained with annotation areas encompassing the full elbow joint including the distal humerus, proximal radius, and proximal ulna, can improve the accuracy of bone age estimation based on elbow joint X-ray images.