Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, you’ll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. You’ll learn how tidytext and other tidy tools in R can make text analysis easier and more effective.
The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You’ll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media.
Learn how to apply the tidy text format to NLP
Use sentiment analysis to mine the emotional content of text
Identify a document’s most important terms with frequency measurements
Explore relationships and connections between words with the ggraph and widyr packages
Convert back and forth between R’s tidy and non-tidy text formats
Use topic modeling to classify document collections into natural groups
Examine case studies that compare Twitter archives, dig into NASA metadata, and analyze thousands of Usenet messages
Chapter 1The Tidy Text Format
Contrasting Tidy Text with Other Data Structures
The unnest_tokens Function
Tidying the Works of Jane Austen
The gutenbergr Package
Chapter 2Sentiment Analysis with Tidy Data
The sentiments Dataset
Sentiment Analysis with Inner Join
Comparing the Three Sentiment Dictionaries
Most Common Positive and Negative Words
Looking at Units Beyond Just Words
Chapter 3Analyzing Word and Document Frequency: tf-idf
Term Frequency in Jane Austen’s Novels
The bind_tf_idf Function
A Corpus of Physics Texts
Chapter 4Relationships Between Words: N-grams and Correlations
Tokenizing by N-gram
Counting and Correlating Pairs of Words with the widyr Package
Julia Silge is a data scientist at Stack Overflow; her work involves analyzing complex datasets and communicating about technical topics with diverse audiences. She has a PhD in astrophysics and loves Jane Austen and making beautiful charts. Julia worked in academia and ed tech before moving into data science and discovering the statistical programming language R.
David Robinson is a data scientist at Stack Overflow with a PhD in Quantitative and Computational Biology from Princeton University. He enjoys developing open source R packages, including broom, gganimate, fuzzyjoin and widyr, as well as blogging about statistics, R, and text mining on his blog, Variance Explained.
The animal on the cover of Text Mining with R is the European rabbit (Oryctolagus cuniculus), a small mammal native to Spain, Portugal, and North Africa. They are now found throughout the world, having been introduced by European settlers. Due to a lack of natural predators, they are classified as an invasive species in some regions.
European rabbits are generally grey-brown in color and range from 34 to 50 centimeters in length. They have powerful hind legs with heavily padded feet that allow them to quickly hop from place to place. As social animals, European rabbits live together in small groups known as warrens. They eat grass, seeds, bark, roots, and vegetables.
European rabbits have been domesticated for several centuries, going back to the Roman Empire. Raising rabbits for their meat, wool, or fur is known as cuniculture. They are also commonly kept as pets. Over time, several different breeds have been developed, such as the Angora or the Holland Lop.