Naval Postgraduate School
Fall 2008
Wed Oct 15, 2008
SQL and Lucene: Indexing human documents
- SQL
- Lucene
- Google Search
- Prefix arrays
Lesson Plan
We will be doing a document exploitation task to learn about searching and indexing.
- You will be given a set of files (file00.txt through file99.txt). Each file has words and email addresses.
Among the email addresses are the email addresses of everyone in the class. We would like to answer the following
questions:
- In which file(s) does your email address appear?
- How many NPS email addresses are in all of the files?
- Which file has the most NPS email addresses?
- Which email address(s) is in every file?
- Which two files have the most email addresses in common?
- download the names and the make.py program
Slides for today's class.
References
Readings
Optional Readings
- The book Learning SQL is available trhough ACM Online Book Program and Safari (NPS has licenses for both)