top of page
regular_1708x683_0417-an-introduction-to-deep-learning-from-perceptrons-to-deep-networks-W

AI Projects

Recommendation Systems

For a master's project, I built a recommendation system called RelRec. I trained multiple models and compared them. The winning algorithm used the Latent Dirichlet Allocation algorithm. The recommendation system was full IaC. It included a microservice pipeline to source documents, parse, sanitize, remove stop words, run algorithm, and compare benchmarks.

​​

Research: https://scholarsarchive.byu.edu/etd/6195/

Code: http://linguistics.byu.edu/thesisdata/relrec.html (this page will be removed when Professor Lonsdale retires)

recommender-system-provides-smart-recommendations-260nw-1978170500.webp

Data Indexing

Visuals are critical to understanding AI output. This project converted output from LDA (Latent Dirichlet Allocation) and presents it as an interactive, nested index. A key feature is the ability to load source documents to verify. This allows humans to determine if the index is robust or needs additional tweaking of parameters.

​

LDS Talk Topic Index

maxresdefault.jpg

Classification: Is it actually candy?

Machine learning classifiers can help answer: Is this what it claims to be?

​

For nutritionists, this is a critical question for every food item. In this model, we demonstrated that a back-propagation model was sufficient to answer this while surpassing baseline performance expectations. Data was sourced from the FDA nutritional information database.

​

https://bean5.github.io/ml-food-classifier/

 

data classification.png

Human Training (Education)

Machine learning could one day help select optimal datasets for educational systems, but back in 2012, I set out to build a simpler tool: a command-line program for learning SAT vocabulary words.

​

I wrote it in Perl, a language often praised for its NLP capabilities, as both a way to sharpen my skills and to create something useful. Despite its simplicity—just 175 lines of code and under 1 MB in size—the tool delivers a practical learning experience. This project predates frameworks like TensorFlow, making its lightweight, command-line design a reflection of the era.

​

http://bean5.github.io/learn_words/

Sat-Vocabulary-Words.png

Plagiarism Detection with GUI

Goal: cross-platform paraphrase detection with user interface that can manipulate manipulate parameters on the fly. The result is a Java paraphrase detector that leverages n-gram/gene sequence alignment.

​​

https://bean5.github.io/nlp-paraphrase-detector/

Plagiarism-Detection-System.png

Data Analysis: Topics over Time

When you have a lot of documents spanning a lot of years, key questions invariably arise such as:

  • which topics are trending

  • which topics were trending 10 years ago

​

Using Topics Over Time using Gibbs sampling algorithms, I did exactly this. To read up on a model and conclusions, see Theological topics through time: An application of Gibbs sampling and other metrics to analyze topic venues in religious discourses

Evolution-of-topics-over-time.png

Regex

Regular expressions are a powerful way to uncover linguistic patterns in text, but some patterns are better captured with customizable automatons. After exploring this area through research, I developed a tool that detects alliteration in documents using orthographic cues.

​

Alliteration Locator

 

Related to this is haiku detection.

Haiku detection in LDS Conference talks

alliteration.jpeg

Tooling

I upgraded a stemming tool to work with Java 7.​ It is over 11 years old. Being in Java, it probably still works. My code is still available, while the source code is no longer on GitHub.

​

I recommend tools like Python Natural Language ToolKit (NLTK) at this point. Not because they are better, but because they are more mainstream and therefore render code more re-usable and maintainable. Or use lemmatizers.

​

https://bean5.github.io/nlp-porter-stemmer-java/

 

stemming vs lemmatization.png

©2018-2025 by Can Compute

bottom of page