DOI: 10.32900/2312-8402-2025-135-66-75
Keywords: machine learning (ML), support vector regression (SVR), prediction, genetic markers, dairy cattle productivity
Traditional selection in cattle heavily relies on linear mixed models (BLUP), which are effective but limited in modeling non-linear genetic interactions (epistasis). Machine learning (ML) algorithms offer an alternative capable of detecting complex dependencies in genetic data. The aim of this work was to test the Support Vector Regression (SVR) methodology for predicting milk productivity and to develop a “reverse engineering” approach to identify optimal allelic combinations based on a limited and heterogeneous set of genetic markers.
The study was conducted on a sample of 81 Ukrainian Red-and-White dairy cows. Genotypes for 3 QTLs (PRL, LEP, TNF-α) were used, which were transformed into 12 binary features (One-Hot encoding). Milk yield (305 days) and fat content (kg) were used as target variables for building the SVR model. The target variable (milk yield) was standardized using StandardScaler. The model was trained using 5-fold cross-validation with hyperparameter tuning (GridSearchCV), comparing both non-shuffled and shuffled data splits. A synthetic “solution space” (54 combinations) was generated to identify “ideal” genotypes, which was then analyzed by the trained SVR model.
Three-way ANOVA did not reveal a statistically significant (p < 0.05) effect of the main factors (PRL, LEP, TNF-α) or their interactions on milk yield, although PRL showed a borderline trend (p=0.055). SVR models trained on non-shuffled data failed, yielding negative R² values (down to -0.066), indicating overfitting. However, the model using all 3 markers (12 features) combined with 5-fold cross-validation with shuffling (shuffle=True) achieved the best, albeit practically negligible, positive result (R² = 0.0064) using a non-linear ’rbf’ kernel, with an estimated RMSE of ~790 kg. The “reverse engineering” approach identified hypothetical complex genotypes (Top 3: CC-CC-AD, CT-CC-AD, CC-CC-AB) with a predicted yield (up to 5173 kg) significantly higher than the herd average (4838 kg).
The study confirmed the methodological suitability of SVR for analyzing heterogeneous genetic data and “reverse engineering” selection goals, even on a critically small sample (n=81). The low R² values highlight that the primary limitation is the small sample size relative to the number of features, which prevents the model from capturing reliable predictive signals. This approach serves as a powerful analytical complement to traditional BLUP methods, providing a framework for identifying desirable “genetic formulas” for targeted selection once larger datasets become available.
References
Alves K, Brito LF, Schenkel FS. (2023). Genomic prediction of fertility and calving traits in Holstein cattle based on models including epistatic genetic effects. J Anim Breed Genet. Sep;140(5):568-581. https://doi.org/10.1111/jbg.12810.
Azodi CB, Bolger E, McCarren A, Roantree M, de Los Campos G, Shiu SH. (2019). Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3 (Bethesda). Nov 5;9(11):3691-3702. https://doi.org/10.1534/g3.119.400498.
Berezovskyi, O. V., Yu. P. Polupan, S. Yu. Ruban, & Kopylov K. V. (2015). Zv’iazok polimorfizmu za henamy к-CN, TG5, LEP z molochnoiu produktyvnistiu koriv ukrains-kykh molochnykh pored [The connection of polymorphism to the к-CN, TG5, LEP genes with the milk yield of cows of Ukrainian breeds]. Rozvedennya i henetyka tvaryn: mizhdvidomchyy tematychnyy zbirnyk –Animal Breeding and Genetics: interdepartmental thematic scientific collection. Kyiv, 49:154-164 (in Ukrainian)
Bergstra J., Komer B., Eliasmith C. et al. (2015). Hyperopt: a Python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 8:014008. https://doi.org/10.1088/1749-4699/8/1/014008.
Cavallaro,C Cutello, V, Pavone, M & Zito, F. (2024). Machine Learning and Genetic Algorithms: A case study on image reconstruction. Knowledge-Based Systems 284 (111194). https://doi.org/10.1016/j.knosys.2023.111194.
Chafai, N., Hayah, I., Houaga, I., Badaoui, B. (2023). A review of machine learning models applied to genomic prediction in animal breeding. Frontiers in Genetics, 14:1150596. https://doi.org/10.3389/fgene.2023.1150596.
González-Recio, O., Forni, S. (2011). Genome-wide prediction of discrete traits using bayesian regressions and machine learning. Genet Sel Evol, 43, 7. https://doi.org/10.1186/1297-9686-43-7.
Hladiy, M. V., Polupan, Y. P., Kovtun, S. I., Kuzebnij, S. V., Vyshnevskiy, L. V., Kopylov, K. V., & ShcherbakО. V. (2018). Scientific and organizational aspects of generation, genetics, reproduction biotechnology and protection of the genofonds in livestock breeding. Animal Breeding and Genetics, 56, 5-14. https://doi.org/10.31073/abg.56.01
Ivashchenko O. Yu. (2023). Henetychne riznomanittia populiatsii velykoi rohatoi khudoby za asotsiiovanymy z rezystentnistiu DNK-markeramy [Genetic diversity of cattle populations by resistance-associated DNA markers]: avtoreferat dys. … d.filosof : 204 / O. Yu. Ivashchenko. — B.m., https://nubip.edu.ua/sites/default/files/u145/dis_ivashchenko.pdf (in Ukrainian).
Junhwa Choi, Sunghyun Cho, Subin Choi, Myunghee Jung, Yu-jin Lim, Eunchae Lee, Jaewon Lim, Han Yong Park & Younhee Shin. (2024). Genotype-Driven Phenotype Prediction in Onion Breeding: Machine Learning Models for Enhanced Bulb Weight Selection. Agriculture 14, 2239. https://doi.org/10.3390/agriculture14122239.
Kopylov, K. V., O. D. Biriukova, O. V. Berezovskyi, & Basovskyi D. M. (2015). Henetychnyi monitorynh v stadi ukrainskoi chervono-riaboi molochnoi porody za kompleksom heniv [Genetic monitoring in a herd of Ukrainian red-billed milk breed in a complex of genes]. Tekhnolohiya vyrobnytstva i pererobky produktsiyi tvarynnytstva – Technology o f production and processing of livestock products. Bila Tserkva. 1(116):28-31 (in Ukrainian).
Kopylov, K. V., O. I. Metlytska, N. B. Mokhnachova, & Suprovych T. M. (2016). Molekuliarno-henetychnyi monitorynh v systemi zberezhennia henetychnykh resursiv tvaryn [Molecular genetic monitoring in the system of conservation of genetic resources of animals]. Visnyk ahrarnoi nauky- Bulletin o f Agricultural Science. 6:43-47 (in Ukrainian)
Liu, M., Gao, Z., Chang, H., Li, S. Z., Shan, S., & Chen, X. (2025). G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion. arXiv preprint arXiv:2502.04684.
Mackay, T. F. (2014). Epistasis and quantitative traits: using model organisms to study gene-gene interactions. Nat Rev Genet. Jan; 15(1):22-33. https://doi.org/10.1038/nrg3627.
Mendoza H., Klein A., Feurer M. et al. (2019). Towards automatically tuned deep neural networks. In: Hutter F. et al. (eds) Automated Machine Learning. Springer, Cham, pp. 135–149.
Pedregosa F, Varoquaux G, Gramfort A et al (2011). Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830.
Pérez-Enciso, M.; Zingaretti, L.M. (2019). A Guide on Deep Learning for Complex Trait Genomic Prediction. Genes, 10, 553. https://doi.org/10.3390/genes10070553.
Rockman, M. (2008). Reverse engineering the genotype–phenotype map with natural genetic variation. Nature 456, 738–744. https://doi.org/10.1038/nature07633.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. http://dx.doi.org/10.1007/978-1-4757-2440-0.
Wang, X., Shi, S., Wang, G. et al. (2022). Using machine learning to improve the accuracy of genomic prediction of reproduction traits in pigs. Journal of Animal Science and Biotechnology, 13:60. https://doi.org/10.1186/s40104-022-00708-0.