top of page

Create Your First Project

Start adding your projects to your portfolio. Click on "Manage Projects" to get started

Sanitize Big Data

Role

Data Engineer - NLP

Tools and Skills

VMs, Linux, Bash, justext, github, SSH

The GloWbE corpus was the largest of its kind, being comprised of over 1 billion English words, stratified based on dialect/location (http://corpora.byu.edu).

In this role I performed the following:

- Streamlined the curation of millions of documents
- Wrote Linux scripts
- Integrated various NLP Linux tools into the pipeline
- Stood up VMs

The goal was to sanitize fast. We wanted to handle billions of documents without slowing down the pipeline. We also wanted a pipeline that was easily maintained.

See http://corpus.byu.edu/faq.asp#x2 for more information.

©2018-2025 by Can Compute

bottom of page