Difference between revisions of "Text processing notes"

From Simson Garfinkel
Jump to navigationJump to search
m
m
 
(One intermediate revision by the same user not shown)
Line 6: Line 6:
* Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
* Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
* Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.
* Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.


==For Beginners==
==For Beginners==
Line 14: Line 13:
* http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/ Very basic, includes a beginner's introduction to python
* http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/ Very basic, includes a beginner's introduction to python
* http://opentechschool.github.io/python-data-intro/ Assumes you already kind of know Python
* http://opentechschool.github.io/python-data-intro/ Assumes you already kind of know Python
* https://www.amazon.com/Text-Processing-Python-David-Mertz/dp/0321112547  This book is available [http://gnosis.cx/TPiP/ online] in text form.
* http://www.nltk.org/book/ The NLTK book


Some videos:
Some videos:
Line 20: Line 19:
* https://www.coursera.org/learn/python-text-mining/lecture/AZCCB/basic-natural-language-processing (University of Michigan, course 4 of 5 of the data science certification)  [https://www.coursera.org/learn/python-text-mining (course description)]
* https://www.coursera.org/learn/python-text-mining/lecture/AZCCB/basic-natural-language-processing (University of Michigan, course 4 of 5 of the data science certification)  [https://www.coursera.org/learn/python-text-mining (course description)]


 
== Regular Expressions ==
 
Here are two web-based regular expression testers. Be sure that you use the Python version:
* https://regex101.com
* https://www.regexplanet.com/advanced/python/index.html


==More Advanced==
==More Advanced==
* https://medium.com/createdd-notes/introduction-to-natural-language-processing-with-python-294988dbae56
* https://medium.com/createdd-notes/introduction-to-natural-language-processing-with-python-294988dbae56
* http://www.akbarian.org/notes/text-mining-nlp-python/
* http://www.akbarian.org/notes/text-mining-nlp-python/

Latest revision as of 17:21, 7 March 2018

Basic rules for people learning text processing:

  • Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no more development on Python2, so if you are learning Python2, you'll then need to learn how Python3 is different.
    • In Python 3, the print statement requires parentheses. So you will see code that looks like this: print(3) and not like this: print 3
  • Use the Natural Language Toolkit (NLTK), because it gives you a lot of power and there are a lot of examples.
  • Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
  • Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.

For Beginners

You really want a basic idea about Python. Check out Python Resources.

Some videos:

Regular Expressions

Here are two web-based regular expression testers. Be sure that you use the Python version:

More Advanced