On Wednesday the 20th of March 2019, Shaheen Syed successfully defended his PhD dissertation entitled “Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain” in the Academiegebouw of Utrecht University (the Netherlands). We wish him good luck in his future endeavors and for his contributions to the SAF21 project.

A short summary of his work

It is estimated that the world’s data will increase to roughly 160 billion terabytes by 2025, with most of that data occurring in an unstructured form. Today, we have already reached the point where more data is being produced than can be physically stored. To ingest all this data and to construct valuable knowledge from it, new computational tools and algorithms are needed, especially since manual probing of the data is slow, expensive, and subjective. For unstructured data, such as text in documents, an ongoing field of research is probabilistic topic models. Topic models are techniques to automatically uncover the hidden or latent topics present within a collection of documents. Topic models can infer the topical content of thousands or millions of documents without prior labeling or annotation. This unsupervised nature makes probabilistic topic models a useful tool for applied data scientists to interpret and examine large volumes of documents for extracting new and valuable knowledge. This dissertation scientifically investigates how to optimally and efficiently apply and interpret topic models to large collections of documents. Specifically, it shows how different types of textual data, pre-processing steps, and hyper-parameter settings can affect the quality of the derived latent topics. The results presented in this dissertation provide a starting point for researchers who want to apply topic models with scientific rigorousness to scientific publications.

Published papers

  • Syed, S., ní Aodha, L., Scougal, C., Spruit, M. (2019). Mapping the global network of fisheries science collaboration. Fish and Fisheries. 1–27. https://doi.org/10.1111/faf.12379
  • Syed, S., Spruit, M. (2018). Exploring Symmetrical and Asymmetrical Dirichlet Priors for Latent Dirichlet Allocation. International Journal of Semantic Computing, 12:3, 399-423. https://doi.org/10.1142/S1793351X18400184
  • Syed, S., Borit, M., Spruit, M. (2018). Narrow lenses for capturing the complexity of fisheries: A topic analysis of fisheries science from 1990 to 2016. Fish and Fisheries, 19:4, 643-661. https://doi.org/10.1111/faf.12280
  • Syed, S., & Spruit, M. (2018). Selecting Priors for Latent Dirichlet Allocation. In The 12th IEEE International Conference on Semantic Computing (pp. 194–202). Laguna Hills, CA, USA: IEEE. http://doi.org/10.1109/ICSC.2018.00035
  • Syed, S., & Spruit, M. (2017). Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. In The 4th IEEE International Conference on Data Science and Advanced Analytics (pp. 165–174). IEEE. http://doi.org/10.1109/DSAA.2017.61
  • Syed S., Spruit M. and Borit M. (2016). Bootstrapping a Semantic Lexicon on Verb Similarities. In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management – Volume 1: KDIR, (IC3K 2016) (pp 189-196). http://doi.org/10.5220/0006036901890196

Full dissertation

  • Syed, S. (2019). Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain. Utrecht University (Doctoral dissertation). https://dspace.library.uu.nl/handle/1874/374917