This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you'll learn how to write Python programs that work with large collections of unstructured text. You'll access richly annotated datasets using a comprehensive range of linguistic data structures, and you'll understand the main algorithms for analyzing the content and structure of written communication.
Packed with examples and exercises, Natural Language Processing with Python will help you:
Extract information from unstructured text, either to guess the topic or identify "named entities"
Analyze linguistic structure in text, including parsing and semantic analysis
Access popular linguistic databases, including WordNet and treebanks
Integrate techniques drawn from fields as diverse as linguistics and artificial intelligence
This book will help you gain practical skills in natural language processing using the Python programming language and the Natural Language Toolkit (NLTK) open source library. If you're interested in developing web applications, analyzing multilingual news sources, or documenting endangered languages -- or if you're simply curious to have a programmer's perspective on how human language works -- you'll find Natural Language Processing with Python both fascinating and immensely useful.
Chapter 1 Language Processing and Python
Computing with Language: Texts and Words
A Closer Look at Python: Texts as Lists of Words
Computing with Language: Simple Statistics
Back to Python: Making Decisions and Taking Control
Automatic Natural Language Understanding
Chapter 2 Accessing Text Corpora and Lexical Resources
Accessing Text Corpora
Conditional Frequency Distributions
More Python: Reusing Code
Chapter 3 Processing Raw Text
Accessing Text from the Web and from Disk
Strings: Text Processing at the Lowest Level
Text Processing with Unicode
Regular Expressions for Detecting Word Patterns
Useful Applications of Regular Expressions
Regular Expressions for Tokenizing Text
Formatting: From Lists to Strings
Chapter 4 Writing Structured Programs
Back to the Basics
Questions of Style
Functions: The Foundation of Structured Programming
Doing More with Functions
A Sample of Python Libraries
Chapter 5 Categorizing and Tagging Words
Using a Tagger
Mapping Words to Properties Using Python Dictionaries
Steven Bird is Associate Professor in the Department of Computer Science and Software Engineering at the University of Melbourne, and Senior Research Associate in the Linguistic Data Consortium at the University of Pennsylvania. He completed a PhD on computational phonology at the University of Edinburgh in 1990, supervised by Ewan Klein. He later moved to Cameroon to conduct linguistic fieldwork on the Grassfields Bantu languages under the auspices of the Summer Institute of Linguistics. More recently, he spent several years as Associate Director of the Linguistic Data Consortium where he led an R&D team to create models and tools for large databases of annotated text. At Melbourne University, he established a language technology research group and has taught at all levels of the undergraduate computer science curriculum. In 2009, Steven is President of the Association for Computational Linguistics.
Ewan Klein is Professor of Language Technology in the School of Informatics at the University of Edinburgh. He completed a PhD on formal semantics at the University of Cambridge in 1978. After some years working at the Universities of Sussex and Newcastle upon Tyne, Ewan took up a teaching position at Edinburgh. He was involved in the establishment of Edinburgh's Language Technology Group in 1993, and has been closely associated with it ever since. From 2000-2002, he took leave from the University to act as Research Manager for the Edinburgh-based Natural Language Research Group of Edify Corporation, Santa Clara, and was responsible for spoken dialogue processing. Ewan is a past President of the European Chapter of the Association for Computational Linguistics and was a founding member and Coordinator of the European Network of Excellence in Human Language Technologies (ELSNET).
Edward Loper has recently completed a PhD on machine learning for natural language processing at the the University of Pennsylvania. Edward was a student in Steven's graduate course on computational linguistics in the fall of 2000, and went on to be a TA and share in the development of NLTK. In addition to NLTK, he has helped develop two packages for documenting and testing Python software, epydoc, and doctest.
The animal on the cover of Natural Language Processing with Python is a right whale, the rarest of all large whales. It is identifiable by its enormous head, which can measure up to one-third of its total body length. It lives in temperate and cool seas in both hemispheres at the surface of the ocean. It's believed that the right whale may have gotten its name from whalers who thought that it was the "right" whale to kill for oil. Even though it has been protected since the 1930s, the right whale is still the most endangered of all the great whales.The large and bulky right whale is easily distinguished from other whales by the calluses on its head. It has a broad back without a dorsal fin and a long arching mouth that begins above the eye. Its body is black, except for a white patch on its belly. Wounds and scars may appear bright orange, often becoming infested with whale lice or cyamids. The calluses-which are also found near the blowholes, above the eyes, and on the chin, and upper lip-are black or gray. It has large flippers that are shaped like paddles, and a distinctive V-shaped blow, caused by the widely spaced blowholes on the top of its head, which rises to 16 feet above the ocean's surface.The right whale feeds on planktonic organisms, including shrimp-like krill and copepods. As baleen whales, they have a series of 225-250 fringed overlapping plates hanging from each side of the upper jaw, where teeth would otherwise be located. The plates are black and can be as long as 7.2 feet. Right whales are "grazers of the sea," often swimming slowly with their mouths open. As water flows into the mouth and through the baleen, prey is trapped near the tongue.Because females are not sexually mature until 10 years of age and they give birth to a single calf after a year-long pregnancy, populations grow slowly. The young right whale stays with its mother for one year.Right whales are found worldwide but in very small numbers. A right whale is commonly found alone or in small groups of 1 to 3, but when courting, they may form groups of up to 30. Like most baleen whales, they are seasonally migratory. They inhabit colder waters for feeding and then migrate to warmer waters for breeding and calving. Although they may move far out to sea during feeding seasons, right whales give birth in coastal areas. Interestingly, many of the females do not return to these coastal breeding areas every year, but visit the area only in calving years. Where they go in other years remains a mysteryThe right whale's only predators are orcas and humans. When danger lurks, a group of right whales may come together in a circle, with their tails pointing outward, to deter a predator. This defense is not always successful and calves are occasionally separated from their mother and killed.Right whales are among the slowest swimming whales, although they may reach speeds up to 10 mph in short spurts. They can dive to at least 1,000 feet and can stay submerged for up to 40 minutes. The right whale is extremely endangered, even after years of protected status. Only in the past 15 years is there evidence of a population recovery in the Southern Hemisphere, and it is still not known if the right whale will survive at all in the Northern Hemisphere. Although not presently hunted, current conservation problems include collisions with ships, conflicts with fishing activities, habitat destruction, oil drilling, and possible competition from other whale species. Right whales have no teeth, so ear bones and, in some cases, eye lenses can be used to estimate the age of a right whale at death. It is believed that right whales live at least 50 years, but there is little data on their longevity.
Comments about oreilly Natural Language Processing with Python:
This book is a near-perfect blend of Natural Language Processing done Python usage to its fullest. Not only did the authors describe NLP extremely well and provided great explanation to many different conditions but they also showed an effective use of Python to substantiate the technical content. The book presents a very detailed explanation of the Python based Natural Language Toolkit, NLTK, which is also the brain child of the authors. NLTK is a great piece of software. I have used the software off an on for the past year and half and really like how it was designed and developed by the creators. The book builds up by explaining the usage of Python as a programming language to manipulate words, phrases and sentences. Accessing Text Corpora and direct text processing is very well described in the first hundred and twenty pages or so. Chapter six is an excellent chapter for technologist who would like to learn different ways to classify text. Although it is not in-depth, which did not seam to be the driver for the this book, it presented a simple understanding to the readers. The concept of chunking of text and its use in classification is very well explained with examples in the book. The methods of developing context-free grammar and parsing of these CFG's probably needed a little more deeper explanation and perhaps some more examples could have helped. Over all the book is an excellent book and I must say that it has been a very long time since I have read a book that was extremely satisfactory. I would like to very strongly recommend this book to Python lovers who would like to explore the world of Natural Language understanding, parsing and processing. It brings out a very strong factor of Python programming language. I give this book an "A+".
Bottom Line Yes, I would recommend this to a friend
A guide to the classic computer science analysis of natural language text
Comments about oreilly Natural Language Processing with Python:
Natural Language Processing with Python is about scanning text samples of human languages like English, or Persian or Chineese with computer routines and doing tasks like counting word frequencies, parsing sentences, and further analyses that begin the difficult task of finding limited kinds of meaning in pieces of text .
The book has a matching website www.nltk.org.
This book is addressed to a broad academic community:
One audience is liberal arts students..
The second audience is the computer science based student.
The third audience is teachers and researchers worldwide.
This book tries hard to be a high quality introduction to natural language processing.
Natural Language Processing itself is one of the great problems of computing. One of the enjoyable things this book does is the authors carefully outline some of the great problems in computer science that are central to natural language processing. These problems are described starting with the texts and programs provided in the toolkit. The liberal arts students are included right at the start. The discussions include further reading references to the classics of computer science, like Knuth.
Natural Language Processing is also a field of some interest and utility to linguists, critics, historians, students of language and rhetoric and students of 20th century philosophy. This dimension is also covered with a good sequence of examples and references.
I remember reading the philosopher Wittgenstein (his writings vintage 1943) where he did thought experiments of putting words in a tray. This way of thinking about meaning is a provocative way of thinking about meaning that could lead to some interesting Toolkit projects.
The fourth audience for this book might be the programmer seeking an interesting opportunity:
Is this a book that might help me write a project specific text analysis engine? I have been wishing for a way to clarify and reorganize the Ubuntu Forums website with a structured language query tree.
Would the NLTK be useful if I wanted to write a search engine?
Problem one with using the NLTK in a search engine project is the non-commercial clause in the Creative Commons license. Using the NLTK as part of a search engine processing framework would require inquiry and clarification of the license terms.
Problem two with using the NLTK in a search engine project is the search engine design will still require assembly of many other components. I recently did a Google search on search engines. The first hour of reading didn't really turn up a good search engine design article.
Would the NLTK be useful if I wanted to figure out the vocabulary used by a specific group of people to talk about a specific subject? A really fascinating item in this book in chapter 6 is the "Maximium Entropy Classifier". Here is the first occurrence in print of a formula for entropy that I can understand and duplicate with a pocket calculator.
Entropy is a key concept discussed by Shannon in his classic information theory article. I sometimes feel very disappointed that computers are not doing much with information. That fascinating parallel between entropy in information theory and entropy in physics and thermodynamics doesn't seem to be a boundary leading to developments.
Rather, computers and the Internet are indexing words and moving data very well. But the computers are not doing much in the way of "information processing" as in changing the entropy of a block of text.
In any case, the Natural Language Toolkit book and program suite is a guide to the classic computer science based approach of analyzing natural language text.
This review is also posted on slashdot.org in my user Journal with the user name beachdog