Inspired by a 2023 Paper, Using personal writings to detect dementia: A text mining approach [1], we trained an AI model to detect dementia in written texts quickly and efficiently. To do this, we first scrapped blog posts from people with and without dementia. From this, we got about 1.4 million tokens worth of training data. We then generated sentence embeddings using the BGE-large-en-v1.5 transformer model since it can run efficiently CPU only and ranks fairly high on the HuggingFace MTEB leaderboard, and the embedding dimensions of 1024 are large enough to capture the minor subtleties and nuances our texts. Sentence embeddings are a way to represent a text as a single vector. To classify the texts, we used gradient boosting with the XGBoost library to classify each text as either "likely dementia" or "not likely dementia." Out of our roughly 3000 training points, we set aside 20% for testing and validation. These training points were not used in the training of our model. We performed 5-fold cross-validation on our model, resulting in an F1 score of 0.95 based on a total support of 606.
We ran out of time before the 10a deadline to upload our demo video, so here is the YouTube link to our video demo: https://youtu.be/nj9fFstPfls
- Asllani B, Mullen DM. Using personal writings to detect dementia: A text mining approach. Health Informatics Journal. 2023;29(4). doi: 10.1177/14604582231204409