Difference between revisions of "Text processing notes"

From Simson Garfinkel
Jump to navigationJump to search
(Created page with "Basic rules for people learning text processing: * Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no...")
 
m
 
(2 intermediate revisions by the same user not shown)
Line 7: Line 7:
* Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.
* Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.


==For Beginners==


You really want a basic idea about Python. Check out [[Python Resources]].
You really want a basic idea about Python. Check out [[Python Resources]].


These sources look okay, but I haven't vetted them personally:
* http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/ Very basic, includes a beginner's introduction to python
* https://www.amazon.com/Text-Processing-Python-David-Mertz/dp/0321112547  This book is available [http://gnosis.cx/TPiP/ online] in text form.
* http://opentechschool.github.io/python-data-intro/ Assumes you already kind of know Python
* http://www.nltk.org/book/ The NLTK book


Some videos:
* https://blog.pusher.com/introduction-to-natural-language-processing-with-python/ (25min, includes basic intro to python)
* https://www.coursera.org/learn/python-text-mining/lecture/AZCCB/basic-natural-language-processing (University of Michigan, course 4 of 5 of the data science certification)  [https://www.coursera.org/learn/python-text-mining (course description)]


== Regular Expressions ==
Here are two web-based regular expression testers. Be sure that you use the Python version:
* https://regex101.com
* https://www.regexplanet.com/advanced/python/index.html
==More Advanced==
* https://medium.com/createdd-notes/introduction-to-natural-language-processing-with-python-294988dbae56
* https://medium.com/createdd-notes/introduction-to-natural-language-processing-with-python-294988dbae56
* http://www.akbarian.org/notes/text-mining-nlp-python/

Latest revision as of 17:21, 7 March 2018

Basic rules for people learning text processing:

  • Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no more development on Python2, so if you are learning Python2, you'll then need to learn how Python3 is different.
    • In Python 3, the print statement requires parentheses. So you will see code that looks like this: print(3) and not like this: print 3
  • Use the Natural Language Toolkit (NLTK), because it gives you a lot of power and there are a lot of examples.
  • Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
  • Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.

For Beginners

You really want a basic idea about Python. Check out Python Resources.

Some videos:

Regular Expressions

Here are two web-based regular expression testers. Be sure that you use the Python version:

More Advanced