Text processing notes

From Simson Garfinkel
Jump to: navigation, search

Basic rules for people learning text processing:

  • Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no more development on Python2, so if you are learning Python2, you'll then need to learn how Python3 is different.
    • In Python 3, the print statement requires parentheses. So you will see code that looks like this: print(3) and not like this: print 3
  • Use the Natural Language Toolkit (NLTK), because it gives you a lot of power and there are a lot of examples.
  • Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
  • Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.

For Beginners

You really want a basic idea about Python. Check out Python Resources.

Some videos:

Regular Expressions

Here are two web-based regular expression testers. Be sure that you use the Python version:

More Advanced