Create your own natural language training corpus for machine learning. Whether you’re working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle—the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You don’t need any programming or linguistics experience to get started.
Using detailed examples at every step, you’ll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.
Define a clear annotation goal before collecting your dataset (corpus)
Learn tools for analyzing the linguistic content of your corpus
Build a model and specification for your annotation project
Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
Create a gold standard corpus that can be used to train and test ML algorithms
Select the ML algorithms that will process your annotated data
Evaluate the test results and revise your annotation task
Learn how to use lightweight software for annotating texts and adjudicating the annotations
This book is a perfect companion to O’Reilly’s Natural Language Processing with Python.
Chapter 1 The Basics
The Importance of Language Annotation
A Brief History of Corpus Linguistics
Language Data and Machine Learning
The Annotation Development Cycle
Chapter 2 Defining Your Goal and Dataset
Defining Your Goal
Assembling Your Dataset
The Size of Your Corpus
Chapter 3 Corpus Analytics
Basic Probability for Corpus Analytics
Chapter 4 Building Your Model and Specification
Some Example Models and Specs
Adopting (or Not Adopting) Existing Models
Different Kinds of Standards
Chapter 5 Applying and Adopting Annotation Standards
James Pustejovsky teaches and does research in Artificial Intelligence and Computational Linguistics in the Computer Science Department at Brandeis University. His main areas of interest include: lexical meaning, computational semantics, temporal and spatial reasoning, and corpus linguistics. He is active in the development of standards for interoperability between language processing applications, and lead the creation of the recently adopted ISO standard for time annotation, ISO-TimeML. He is currently heading the development of a standard for annotating spatial information in language. More information on publications and research activities can be found at his webpage: pusto.com.
Amber Stubbs recently completed her Ph.D. in Computer Science at Brandeis University, and is currently a Postdoctoral Associate at SUNY Albany. Her dissertation focused on creating an annotation methodology to aid in extracting high-level information from natural language files, particularly biomedical texts. Her website can be found at http://pages.cs.brandeis.edu/~astubbs/
para>The animal on the cover of Natural Language Annotation for Machine Learning is the cockatiel (Nymphicus hollandicus). Their scientific name came about from European travelers who found the birds so beautiful, they named them for mythical nymphs. Hollandicus refers to “New Holland,” an older name for Australia, the continent to which these birds are native. In the wild, cockatiels can be found in arid habitats like brushland or the outback, yet they remain close to water. They are usually seen in pairs, though flocks will congregate around a single body of water.
Until six to nine months after hatching, female and male cockatiels are indistinguishable, as both have horizontal yellow stripes on the surface of their tail feathers and a dull orange patch on each cheek. When molting begins, males lose some white or yellow feathers and gain brighter yellow feathers. In addition, the orange patches on the face become much more prominent. The lifespan of a cockatiel in captivity is typically 15–20 years, but they generally live between 10–30 years in the wild.
The cockatiel was considered either a parrot or a cockatoo for some time, as scientists and biologists hotly debated which bird it actually was. It is now classified as part of the cockatoo family because they both have the same biological features—namely, upright crests, gallbladders, and powder down (a special type of feather where the tips of barbules disintegrate, forming a fine dust among the feathers).
The cover image is from Johnson’s Natural History.
Comments about oreilly Natural Language Annotation for Machine Learning:
Programming languages have a very strict syntax. When you see "I am a sentence I am another sentence," you know that you're really looking at two different sentences even though the period between "sentence" and "I" is missing. If you try something similar with the computer (try leaving the semi-colon off in C or miss an indent in Python, for example), you'll get a nasty error message. This book aims to teach you how to program your computer to work with the looser languages used by humans (like English) instead of the stricter counterparts used by machines.
The content available so far gives you a brief background on the relevant parts of language -- grammar, pragmatics, discourse analysis, etc. The authors go on to talk about setting up an annotation project: determining your goal, creating your model/specification, and creating/storing your annotations in a flexible but easy to create (by annotators) manner.
Though a bit dry, the writing is clear and simple. I had no previous experience in this area, but I had no trouble understanding the subject matter for the most part.
Disclosure: I received this book for free through the O'Reilly Blogger program.
Comments about oreilly Natural Language Annotation for Machine Learning:
The book's title is misleading. Its subtitle - A Guide to Corpus Building for Applications - is more descriptive. I believe that not only machine learners, but linguists (esp. corpus and computational linguists), practitioners of the digital humanities and others who are using and/or collecting linguistic data can deepen their knowledge with the help of this terrific book.
Although O'Reilly will publish the book in Sept. 2012, it is already available as an Early Release in electronic format. Keep in mind that this is a "work in progress" version when you come across sentences starting with lower case letters and references to Chapter??? and Appendix???. Also, you will find references to chapters that are not included in the book yet. These are parts of an early release and their number doesn't distract the reading experience.
It is hard to define the target group of this title. Of course you can read it without any previous knowledge of linguistics and/or natural language processing according to the preface, but I think when you read such things in a book from a publisher of technical books, you can assume that the authors' hands were led by someone in the marketing department. You shouldn't be a linguist or an nlp guru to understand the content, but you need to have some background in the field. Previous exposure to NLTK (and the NLTK book), some basic knowledge of corpus linguistics (e.g. Corpus Linguistics by McEnry and Wilson, Corpus Linguistics by McEnry and Hardie, or Gries brilliant Quantitative Corpus Linguistics with R) is essential to understand the role of corpora in applied and academic research.
The first chapter ("The Basics") gives a detailed review of what is corpus linguistics and what is a corpus and its relation to machine learning tasks. But if you want to get a broader overview of the theory and the historical aspects of corpus linguistics, I recommend the first chapter of McEnry and Wilson. However Leech's name was mentioned in this chapter, I miss mentioning his seven maxims of annotation (again McEnry-Wilson help you out in this question). Also, we got a brief summary of the MATTER methodology, which is the main topic of the book. MATTER stands for Model, Annotate, Train, Test, Evaluate, Revise - the steps of corpus development cycle. This high level intro puts the method into context which helps to understand the following chapters - and I think it can serve as an "executive summary" too. I loved the brief section on relevance testing (precision, recall, F-measure) as these are vitally important in real world applications.
The second chapter (Defining Your Goal and Dataset) is about the 'M' in the MATTER cycle. It gives practical advices for defining the statement of purpose and expanding it to see how you can reach your goals. I like the pragmatic tone of the chapter. Sure, you have a great idea, but you have to consider the task, the available resources and you have to collect some data - so think it over and define why do you collect data, what kind of data you want to collect and how do you process the data. This process involves lot of thinking and weighting possibilities, and the book helps with going through these steps.
Chapter three (Building your Model and Specification) stays at the 'M', but it gets more realistic. It is about the formal definition of models and how to implement them (in XML). The topic - XML and various standards - seems to be boring but it is a great job and it is very refreshing to see the fragmented pieces information being complied into a compact yet enjoyable chapter (ok, maybe only linguists think this is not boring).
The fourth chapter (Applying and Adopting Annotation Standards to Your Model) gives hints about bending standards and resources to your needs. It considers technological considerations along with human factors (aka annotators), and shows best practices serves both sides well.
I do hope more chapters will be available soon. The practical focus and the vivid real world examples (e.g. named entity recognition, semantic role labeling, etc.) makes the book very accessible for a wider audience. It contains valuable information that was almost unaccessible and it took long time to collect the knowledge necessary to build corpora before. I think this title will be a great success in just like the Semantic Web for the Working Ontologist in the semantic web and enterprise ontologist community.
Bottom Line Yes, I would recommend this to a friend