phonorm is an exploratory project in which we apply a machine translation approach to the problem of phonetic normalization. The need for such a model arose from the type of conversations we observed in our chatbot ChitChat developed at the Leiden University Center for Innovation, as we observed a lot of text that is written much like it is spoken. Current phonetic algorithms, such as Soundex are too aggressive and do not work well in our use case.
You can find our writeup of the project here. Comments are welcome and can either be left in the issues section or can be sent to jasperginn[at]gmail.com
This repository contains the following files
+-- data | +-- extra - contains wikipedia dataset with commonly misspelled words | +-- preprocessed - contains preprocessed datasets | +-- raw - contains raw data (not preprocessed) +-- docs - Contains presentation and writeup +-- modeling - Contains Jupyter notebooks used for modeling +-- models - Contains pre-trained models +-- phonorm - Contains utilities and code for modeling +-- preprocessing - Contains utilities and code for preprocessing data +-- .gitignore +-- README.md +-- requirements.txt
If you want to retrain the model using the data in this repository, be aware that training will be slow on CPUs. You should consider using a GPU.
At a minimum, you need a python 3 installation. However, it would be best to use Anaconda. The steps below assume that you are using anaconda for this project.
conda create -n phonorm python=3.6 anaconda
source activate phonorm
conda activate phonorm
conda install --yes --file requirements.txt
pip install git+https://github.com/abuccts/wikt2pron.git
tensorflow-gpuif you are using a GPU
conda install tensorflow-gpu
At this point, your environment ready to be used.
If you want to train your own models, you should check out the modeling folder for examples.
If you want to use the pre-trained models, please see the examples folder.