top of page
Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Sanitize Big Data
The GloWbE corpus was the largest of its kind, being comprised of over 1 billion English words, stratified based on dialect/location (http://corpora.byu.edu).
In this role I performed the following:
- Streamlined the curation of millions of documents
- Wrote Linux scripts
- Integrated various NLP Linux tools into the pipeline
- Stood up VMs
The goal was to sanitize fast. We wanted to handle billions of documents without slowing down the pipeline. We also wanted a pipeline that was easily maintained.
See http://corpus.byu.edu/faq.asp#x2 for more information.
bottom of page