Text processing notes

From Simson Garfinkel
Revision as of 09:15, 3 March 2018 by Simson (talk | contribs) (Created page with "Basic rules for people learning text processing: * Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Basic rules for people learning text processing:

  • Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no more development on Python2, so if you are learning Python2, you'll then need to learn how Python3 is different.
    • In Python 3, the print statement requires parentheses. So you will see code that looks like this: print(3) and not like this: print 3
  • Use the Natural Language Toolkit (NLTK), because it gives you a lot of power and there are a lot of examples.
  • Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
  • Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.


You really want a basic idea about Python. Check out Python Resources.

These sources look okay, but I haven't vetted them personally: