Difference between revisions of "Text processing notes"
From Simson Garfinkel
Jump to navigationJump to search
m |
m |
||
(One intermediate revision by the same user not shown) | |||
Line 6: | Line 6: | ||
* Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage. | * Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage. | ||
* Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**. | * Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**. | ||
==For Beginners== | ==For Beginners== | ||
Line 14: | Line 13: | ||
* http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/ Very basic, includes a beginner's introduction to python | * http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/ Very basic, includes a beginner's introduction to python | ||
* http://opentechschool.github.io/python-data-intro/ Assumes you already kind of know Python | * http://opentechschool.github.io/python-data-intro/ Assumes you already kind of know Python | ||
* | * http://www.nltk.org/book/ The NLTK book | ||
Some videos: | Some videos: | ||
Line 20: | Line 19: | ||
* https://www.coursera.org/learn/python-text-mining/lecture/AZCCB/basic-natural-language-processing (University of Michigan, course 4 of 5 of the data science certification) [https://www.coursera.org/learn/python-text-mining (course description)] | * https://www.coursera.org/learn/python-text-mining/lecture/AZCCB/basic-natural-language-processing (University of Michigan, course 4 of 5 of the data science certification) [https://www.coursera.org/learn/python-text-mining (course description)] | ||
== Regular Expressions == | |||
Here are two web-based regular expression testers. Be sure that you use the Python version: | |||
* https://regex101.com | |||
* https://www.regexplanet.com/advanced/python/index.html | |||
==More Advanced== | ==More Advanced== | ||
* https://medium.com/createdd-notes/introduction-to-natural-language-processing-with-python-294988dbae56 | * https://medium.com/createdd-notes/introduction-to-natural-language-processing-with-python-294988dbae56 | ||
* http://www.akbarian.org/notes/text-mining-nlp-python/ | * http://www.akbarian.org/notes/text-mining-nlp-python/ |
Latest revision as of 17:21, 7 March 2018
Basic rules for people learning text processing:
- Use Python3, not Python2. Python3 is the "modern" version for python which has proper support for Unicode. There will be no more development on Python2, so if you are learning Python2, you'll then need to learn how Python3 is different.
- In Python 3, the print statement requires parentheses. So you will see code that looks like this: print(3) and not like this: print 3
- Use the Natural Language Toolkit (NLTK), because it gives you a lot of power and there are a lot of examples.
- Remember to start with a small extract of the corpus you want to work with. There's no reason waiting 5 minutes to see how something worked if you could have found out in 5 seconds. Work with a few paragraphs of text, not hundreds of pages, at least while you are learning the langugage.
- Although it's tempting to work with ipython or Jupyter notebook, be sure to keep extensive notes of everything you try. The easiest way to do this is to save your code in python files that run completely on their own from the command line. These files are typically called *scripts* or **programs**.
For Beginners
You really want a basic idea about Python. Check out Python Resources.
- http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/ Very basic, includes a beginner's introduction to python
- http://opentechschool.github.io/python-data-intro/ Assumes you already kind of know Python
- http://www.nltk.org/book/ The NLTK book
Some videos:
- https://blog.pusher.com/introduction-to-natural-language-processing-with-python/ (25min, includes basic intro to python)
- https://www.coursera.org/learn/python-text-mining/lecture/AZCCB/basic-natural-language-processing (University of Michigan, course 4 of 5 of the data science certification) (course description)
Regular Expressions
Here are two web-based regular expression testers. Be sure that you use the Python version: