top of page
regular_1708x683_0417-an-introduction-to-deep-learning-from-perceptrons-to-deep-networks-W

AI Projects

Recommendation Systems

For a master's project, I built a recommendation system called RelRec. I trained multiple models and compared them. The winning algorithm used the Latent Dirichlet Allocation algorithm. The recommendation system was full IaC. It included a microservice pipeline to source documents, parse, sanitize, remove stop words, run algorithm, and compare benchmarks.

Research: https://scholarsarchive.byu.edu/etd/6195/

Code: http://linguistics.byu.edu/thesisdata/relrec.html (this page will be removed when Professor Lonsdale retires)

recommender-system-provides-smart-recommendations-260nw-1978170500.webp

Data Indexing

Visuals are critical to understanding AI output. This project converted output from LDA (Latent Dirichlet Allocation) and presents it as an interactive, nested index. A key feature is the ability to load source documents to verify. This allows humans to determine if the index is robust or needs additional tweaking of parameters.

LDS Talk Topic Index

maxresdefault.jpg

Classification: Is it actually candy?

Machine learning classifiers can help answer: Is this what it claims to be?

For nutritionists, this is a critical question for every food item. In this model, we demonstrated that a back-propagation model was sufficient to answer this while surpassing baseline performance expectations. Data was sourced from the FDA nutritional information database.

https://bean5.github.io/ml-food-classifier/

 

data classification.png

Human Training (Education)

Machine learning could one day help select optimal datasets for educational systems, but back in 2012, I set out to build a simpler tool: a command-line program for learning SAT vocabulary words.

I wrote it in Perl, a language often praised for its NLP capabilities, as both a way to sharpen my skills and to create something useful. Despite its simplicity—just 175 lines of code and under 1 MB in size—the tool delivers a practical learning experience. This project predates frameworks like TensorFlow, making its lightweight, command-line design a reflection of the era.

http://bean5.github.io/learn_words/

Sat-Vocabulary-Words.png

Plagiarism Detection with GUI

Goal: cross-platform paraphrase detection with user interface that can manipulate manipulate parameters on the fly. The result is a Java paraphrase detector that leverages n-gram/gene sequence alignment.

​​

https://bean5.github.io/nlp-paraphrase-detector/

Plagiarism-Detection-System.png

Data Analysis: Topics over Time

When you have a lot of documents spanning a lot of years, key questions invariably arise such as:

  • which topics are trending

  • which topics were trending 10 years ago

Using Topics Over Time using Gibbs sampling algorithms, I did exactly this. To read up on a model and conclusions, see Theological topics through time: An application of Gibbs sampling and other metrics to analyze topic venues in religious discourses

Evolution-of-topics-over-time.png

Regex

Regular expressions are a powerful way to uncover linguistic patterns in text, but some patterns are better captured with customizable automatons. After exploring this area through research, I developed a tool that detects alliteration in documents using orthographic cues.

Alliteration Locator

 

Related to this is haiku detection.

Haiku detection in LDS Conference talks

alliteration.jpeg

Tooling

I upgraded a stemming tool to work with Java 7.​ It is over 11 years old. Being in Java, it probably still works. My code is still available, while the source code is no longer on GitHub.

I recommend tools like Python Natural Language ToolKit (NLTK) at this point. Not because they are better, but because they are more mainstream and therefore render code more re-usable and maintainable. Or use lemmatizers.

https://bean5.github.io/nlp-porter-stemmer-java/

 

stemming vs lemmatization.png

©2018-2025 by Can Compute

bottom of page