Natural Language Annotation for Machine Learning
A Guide to Corpus-Building for Applications
Publisher: O'Reilly Media
Final Release Date: October 2012
Pages: 342

Create your own natural language training corpus for machine learning. Whether you’re working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle—the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You don’t need any programming or linguistics experience to get started.

Using detailed examples at every step, you’ll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.

  • Define a clear annotation goal before collecting your dataset (corpus)
  • Learn tools for analyzing the linguistic content of your corpus
  • Build a model and specification for your annotation project
  • Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
  • Create a gold standard corpus that can be used to train and test ML algorithms
  • Select the ML algorithms that will process your annotated data
  • Evaluate the test results and revise your annotation task
  • Learn how to use lightweight software for annotating texts and adjudicating the annotations

This book is a perfect companion to O’Reilly’s Natural Language Processing with Python.

Table of Contents
Product Details
About the Author
Colophon
Recommended for You
Customer Reviews

REVIEW SNAPSHOT®

by PowerReviews
oreillyNatural Language Annotation for Machine Learning
 
4.0

(based on 3 reviews)

Ratings Distribution

  • 5 Stars

     

    (1)

  • 4 Stars

     

    (1)

  • 3 Stars

     

    (1)

  • 2 Stars

     

    (0)

  • 1 Stars

     

    (0)

Pros

    Cons

      Best Uses

      • Intermediate (3)

      Reviewed by 3 customers

      Sort by

      Displaying reviews 1-3

      Back to top

      (0 of 3 customers found this review helpful)

       
      5.0

      Terrific!

      By mmateva

      from Sofia, Bulgaria

      About Me Developer

      Pros

      • Accurate
      • Easy to understand
      • Helpful examples
      • Well-written

      Cons

        Best Uses

        • Intermediate
        • Novice
        • Student

        Comments about oreilly Natural Language Annotation for Machine Learning:

        Deep, easy-to-read, thorough.

        (0 of 6 customers found this review helpful)

         
        3.0

        A Bit Dry but Interesting

        By Anna

        from US

        About Me Developer

        Verified Reviewer

        Pros

          Cons

            Best Uses

            • Intermediate

            Comments about oreilly Natural Language Annotation for Machine Learning:

            Programming languages have a very strict syntax. When you see "I am a sentence I am another sentence," you know that you're really looking at two different sentences even though the period between "sentence" and "I" is missing. If you try something similar with the computer (try leaving the semi-colon off in C or miss an indent in Python, for example), you'll get a nasty error message. This book aims to teach you how to program your computer to work with the looser languages used by humans (like English) instead of the stricter counterparts used by machines.

            The content available so far gives you a brief background on the relevant parts of language -- grammar, pragmatics, discourse analysis, etc. The authors go on to talk about setting up an annotation project: determining your goal, creating your model/specification, and creating/storing your annotations in a flexible but easy to create (by annotators) manner.

            Though a bit dry, the writing is clear and simple. I had no previous experience in this area, but I had no trouble understanding the subject matter for the most part.

            Disclosure: I received this book for free through the O'Reilly Blogger program.

            (5 of 6 customers found this review helpful)

             
            4.0

            Book review: Natural Language Annotation

            By Zoltan Varju

            from Szikszo, Hungary

            About Me Computational linguist, Researcher

            Verified Reviewer

            Pros

            • Concise
            • Helpful examples

            Cons

            • Not comprehensive enough

            Best Uses

            • Intermediate
            • Student

            Comments about oreilly Natural Language Annotation for Machine Learning:

            The book's title is misleading. Its subtitle - A Guide to Corpus Building for Applications - is more descriptive. I believe that not only machine learners, but linguists (esp. corpus and computational linguists), practitioners of the digital humanities and others who are using and/or collecting linguistic data can deepen their knowledge with the help of this terrific book.

            Although O'Reilly will publish the book in Sept. 2012, it is already available as an Early Release in electronic format. Keep in mind that this is a "work in progress" version when you come across sentences starting with lower case letters and references to Chapter??? and Appendix???. Also, you will find references to chapters that are not included in the book yet. These are parts of an early release and their number doesn't distract the reading experience.

            It is hard to define the target group of this title. Of course you can read it without any previous knowledge of linguistics and/or natural language processing according to the preface, but I think when you read such things in a book from a publisher of technical books, you can assume that the authors' hands were led by someone in the marketing department. You shouldn't be a linguist or an nlp guru to understand the content, but you need to have some background in the field. Previous exposure to NLTK (and the NLTK book), some basic knowledge of corpus linguistics (e.g. Corpus Linguistics by McEnry and Wilson, Corpus Linguistics by McEnry and Hardie, or Gries brilliant Quantitative Corpus Linguistics with R) is essential to understand the role of corpora in applied and academic research.

            The first chapter ("The Basics") gives a detailed review of what is corpus linguistics and what is a corpus and its relation to machine learning tasks. But if you want to get a broader overview of the theory and the historical aspects of corpus linguistics, I recommend the first chapter of McEnry and Wilson. However Leech's name was mentioned in this chapter, I miss mentioning his seven maxims of annotation (again McEnry-Wilson help you out in this question). Also, we got a brief summary of the MATTER methodology, which is the main topic of the book. MATTER stands for Model, Annotate, Train, Test, Evaluate, Revise - the steps of corpus development cycle. This high level intro puts the method into context which helps to understand the following chapters - and I think it can serve as an "executive summary" too. I loved the brief section on relevance testing (precision, recall, F-measure) as these are vitally important in real world applications.

            The second chapter (Defining Your Goal and Dataset) is about the 'M' in the MATTER cycle. It gives practical advices for defining the statement of purpose and expanding it to see how you can reach your goals. I like the pragmatic tone of the chapter. Sure, you have a great idea, but you have to consider the task, the available resources and you have to collect some data - so think it over and define why do you collect data, what kind of data you want to collect and how do you process the data. This process involves lot of thinking and weighting possibilities, and the book helps with going through these steps.

            Chapter three (Building your Model and Specification) stays at the 'M', but it gets more realistic. It is about the formal definition of models and how to implement them (in XML). The topic - XML and various standards - seems to be boring but it is a great job and it is very refreshing to see the fragmented pieces information being complied into a compact yet enjoyable chapter (ok, maybe only linguists think this is not boring).

            The fourth chapter (Applying and Adopting Annotation Standards to Your Model) gives hints about bending standards and resources to your needs. It considers technological considerations along with human factors (aka annotators), and shows best practices serves both sides well.

            I do hope more chapters will be available soon. The practical focus and the vivid real world examples (e.g. named entity recognition, semantic role labeling, etc.) makes the book very accessible for a wider audience. It contains valuable information that was almost unaccessible and it took long time to collect the knowledge necessary to build corpora before. I think this title will be a great success in just like the Semantic Web for the Working Ontologist in the semantic web and enterprise ontologist community.

            Displaying reviews 1-3

            Back to top

             
            Buy 2 Get 1 Free Free Shipping Guarantee
            Buying Options
            Immediate Access - Go Digital what's this?
            Ebook: $33.99
            Formats:  DAISY, ePub, Mobi, PDF
            Print & Ebook: $43.99
            Print: $39.99