Bioinformatics Data Skills

Book description

Learn the data skills necessary for turning large sequencing datasets into reproducible and robust biological findings. With this practical guide, you’ll learn how to use freely available open source tools to extract meaning from large complex biological data sets.

At no other point in human history has our ability to understand life’s complexities been so dependent on our skills to work with and analyze data. This intermediate-level book teaches the general computational and data skills you need to analyze biological data. If you have experience with a scripting language like Python, you’re ready to get started.

  • Go from handling small problems with messy scripts to tackling large problems with clever methods and tools
  • Process bioinformatics data with powerful Unix pipelines and data tools
  • Learn how to use exploratory data analysis techniques in the R language
  • Use efficient methods to work with genomic range data and range operations
  • Work with common genomics data file formats like FASTA, FASTQ, SAM, and BAM
  • Manage your bioinformatics project with the Git version control system
  • Tackle tedious data processing tasks with with Bash scripts and Makefiles

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. The Approach of This Book
    2. Why This Book Focuses on Sequencing Data
    3. Audience
    4. The Difficulty Level of Bioinformatics Data Skills
    5. Assumptions This Book Makes
    6. Supplementary Material on GitHub
    7. Computing Resources and Setup
    8. Organization of This Book
    9. Code Conventions
    10. Conventions Used in This Book
    11. Using Code Examples
    12. Safari® Books Online
    13. How to Contact Us
    14. Acknowledgments
  2. I. Ideology: Data Skills for Robust and Reproducible Bioinformatics
  3. 1. How to Learn Bioinformatics
    1. Why Bioinformatics? Biology’s Growing Data
    2. Learning Data Skills to Learn Bioinformatics
    3. New Challenges for Reproducible and Robust Research
    4. Reproducible Research
    5. Robust Research and the Golden Rule of Bioinformatics
    6. Adopting Robust and Reproducible Practices Will Make Your Life Easier, Too
    7. Recommendations for Robust Research
      1. Pay Attention to Experimental Design
      2. Write Code for Humans, Write Data for Computers
      3. Let Your Computer Do the Work For You
      4. Make Assertions and Be Loud, in Code and in Your Methods
      5. Test Code, or Better Yet, Let Code Test Code
      6. Use Existing Libraries Whenever Possible
      7. Treat Data as Read-Only
      8. Spend Time Developing Frequently Used Scripts into Tools
      9. Let Data Prove That It’s High Quality
    8. Recommendations for Reproducible Research
      1. Release Your Code and Data
      2. Document Everything
      3. Make Figures and Statistics the Results of Scripts
      4. Use Code as Documentation
    9. Continually Improving Your Bioinformatics Data Skills
  4. II. Prerequisites: Essential Skills for Getting Started with a Bioinformatics Project
  5. 2. Setting Up and Managing a Bioinformatics Project
    1. Project Directories and Directory Structures
    2. Project Documentation
    3. Use Directories to Divide Up Your Project into Subprojects
    4. Organizing Data to Automate File Processing Tasks
    5. Markdown for Project Notebooks
      1. Markdown Formatting Basics
      2. Using Pandoc to Render Markdown to HTML
  6. 3. Remedial Unix Shell
    1. Why Do We Use Unix in Bioinformatics? Modularity and the Unix Philosophy
    2. Working with Streams and Redirection
      1. Redirecting Standard Out to a File
      2. Redirecting Standard Error
      3. Using Standard Input Redirection
    3. The Almighty Unix Pipe: Speed and Beauty in One
      1. Pipes in Action: Creating Simple Programs with Grep and Pipes
      2. Combining Pipes and Redirection
      3. Even More Redirection: A tee in Your Pipe
    4. Managing and Interacting with Processes
      1. Background Processes
      2. Killing Processes
      3. Exit Status: How to Programmatically Tell Whether Your Command Worked
    5. Command Substitution
  7. 4. Working with Remote Machines
    1. Connecting to Remote Machines with SSH
    2. Quick Authentication with SSH Keys
    3. Maintaining Long-Running Jobs with nohup and tmux
      1. nohup
    4. Working with Remote Machines Through Tmux
      1. Installing and Configuring Tmux
      2. Creating, Detaching, and Attaching Tmux Sessions
      3. Working with Tmux Windows
  8. 5. Git for Scientists
    1. Why Git Is Necessary in Bioinformatics Projects
      1. Git Allows You to Keep Snapshots of Your Project
      2. Git Helps You Keep Track of Important Changes to Code
      3. Git Helps Keep Software Organized and Available After People Leave
    2. Installing Git
    3. Basic Git: Creating Repositories, Tracking Files, and Staging and Committing Changes
      1. Git Setup: Telling Git Who You Are
      2. git init and git clone: Creating Repositories
      3. Tracking Files in Git: git add and git status Part I
      4. Staging Files in Git: git add and git status Part II
      5. git commit: Taking a Snapshot of Your Project
      6. Seeing File Differences: git diff
      7. Seeing Your Commit History: git log
      8. Moving and Removing Files: git mv and git rm
      9. Telling Git What to Ignore: .gitignore
      10. Undoing a Stage: git reset
    4. Collaborating with Git: Git Remotes, git push, and git pull
      1. Creating a Shared Central Repository with GitHub
      2. Authenticating with Git Remotes
      3. Connecting with Git Remotes: git remote
      4. Pushing Commits to a Remote Repository with git push
      5. Pulling Commits from a Remote Repository with git pull
      6. Working with Your Collaborators: Pushing and Pulling
      7. Merge Conflicts
      8. More GitHub Workflows: Forking and Pull Requests
    5. Using Git to Make Life Easier: Working with Past Commits
      1. Getting Files from the Past: git checkout
      2. Stashing Your Changes: git stash
      3. More git diff: Comparing Commits and Files
      4. Undoing and Editing Commits: git commit --amend
    6. Working with Branches
      1. Creating and Working with Branches: git branch and git checkout
      2. Merging Branches: git merge
      3. Branches and Remotes
    7. Continuing Your Git Education
  9. 6. Bioinformatics Data
    1. Retrieving Bioinformatics Data
      1. Downloading Data with wget and curl
      2. Rsync and Secure Copy (scp)
    2. Data Integrity
      1. SHA and MD5 Checksums
    3. Looking at Differences Between Data
    4. Compressing Data and Working with Compressed Data
      1. gzip
      2. Working with Gzipped Compressed Files
    5. Case Study: Reproducibly Downloading Data
  10. III. Practice: Bioinformatics Data Skills
  11. 7. Unix Data Tools
    1. Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls
    2. When to Use the Unix Pipeline Approach and How to Use It Safely
    3. Inspecting and Manipulating Text Data with Unix Tools
      1. Inspecting Data with Head and Tail
      2. less
      3. Plain-Text Data Summary Information with wc, ls, and awk
      4. Working with Column Data with cut and Columns
      5. Formatting Tabular Data with column
      6. The All-Powerful Grep
      7. Decoding Plain-Text Data: hexdump
      8. Sorting Plain-Text Data with Sort
      9. Finding Unique Values in Uniq
      10. Join
      11. Text Processing with Awk
      12. Bioawk: An Awk for Biological Formats
      13. Stream Editing with Sed
    4. Advanced Shell Tricks
      1. Subshells
      2. Named Pipes and Process Substitution
    5. The Unix Philosophy Revisited
  12. 8. A Rapid Introduction to the R Language
    1. Getting Started with R and RStudio
    2. R Language Basics
      1. Simple Calculations in R, Calling Functions, and Getting Help in R
      2. Variables and Assignment
      3. Vectors, Vectorization, and Indexing
    3. Working with and Visualizing Data in R
      1. Loading Data into R
      2. Exploring and Transforming Dataframes
      3. Exploring Data Through Slicing and Dicing: Subsetting Dataframes
      4. Exploring Data Visually with ggplot2 I: Scatterplots and Densities
      5. Exploring Data Visually with ggplot2 II: Smoothing
      6. Binning Data with cut() and Bar Plots with ggplot2
      7. Merging and Combining Data: Matching Vectors and Merging Dataframes
      8. Using ggplot2 Facets
      9. More R Data Structures: Lists
      10. Writing and Applying Functions to Lists with lapply() and sapply()
      11. Working with the Split-Apply-Combine Pattern
      12. Exploring Dataframes with dplyr
      13. Working with Strings
    4. Developing Workflows with R Scripts
      1. Control Flow: if, for, and while
      2. Working with R Scripts
      3. Workflows for Loading and Combining Multiple Files
      4. Exporting Data
    5. Further R Directions and Resources
  13. 9. Working with Range Data
    1. A Crash Course in Genomic Ranges and Coordinate Systems
    2. An Interactive Introduction to Range Data with GenomicRanges
      1. Installing and Working with Bioconductor Packages
      2. Storing Generic Ranges with IRanges
      3. Basic Range Operations: Arithmetic, Transformations, and Set Operations
      4. Finding Overlapping Ranges
      5. Finding Nearest Ranges and Calculating Distance
      6. Run Length Encoding and Views
      7. Storing Genomic Ranges with GenomicRanges
      8. Grouping Data with GRangesList
      9. Working with Annotation Data: GenomicFeatures and rtracklayer
      10. Retrieving Promoter Regions: Flank and Promoters
      11. Retrieving Promoter Sequence: Connection GenomicRanges with Sequence Data
      12. Getting Intergenic and Intronic Regions: Gaps, Reduce, and Setdiffs in Practice
      13. Finding and Working with Overlapping Ranges
      14. Calculating Coverage of GRanges Objects
    3. Working with Ranges Data on the Command Line with BEDTools
      1. Computing Overlaps with BEDTools Intersect
      2. BEDTools Slop and Flank
      3. Coverage with BEDTools
      4. Other BEDTools Subcommands and pybedtools
  14. 10. Working with Sequence Data
    1. The FASTA Format
    2. The FASTQ Format
    3. Nucleotide Codes
    4. Base Qualities
    5. Example: Inspecting and Trimming Low-Quality Bases
    6. A FASTA/FASTQ Parsing Example: Counting Nucleotides
    7. Indexed FASTA Files
  15. 11. Working with Alignment Data
    1. Getting to Know Alignment Formats: SAM and BAM
      1. The SAM Header
      2. The SAM Alignment Section
      3. Bitwise Flags
      4. CIGAR Strings
      5. Mapping Qualities
    2. Command-Line Tools for Working with Alignments in the SAM Format
      1. Using samtools view to Convert between SAM and BAM
      2. Samtools Sort and Index
      3. Extracting and Filtering Alignments with samtools view
    3. Visualizing Alignments with samtools tview and the Integrated Genomics Viewer
      1. Pileups with samtools pileup, Variant Calling, and Base Alignment Quality
    4. Creating Your Own SAM/BAM Processing Tools with Pysam
      1. Opening BAM Files, Fetching Alignments from a Region, and Iterating Across Reads
      2. Extracting SAM/BAM Header Information from an AlignmentFile Object
      3. Working with AlignedSegment Objects
      4. Writing a Program to Record Alignment Statistics
      5. Additional Pysam Features and Other SAM/BAM APIs
  16. 12. Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks
    1. Basic Bash Scripting
      1. Writing and Running Robust Bash Scripts
      2. Variables and Command Arguments
      3. Conditionals in a Bash Script: if Statements
      4. Processing Files with Bash Using for Loops and Globbing
    2. Automating File-Processing with find and xargs
      1. Using find and xargs
      2. Finding Files with find
      3. find’s Expressions
      4. find’s -exec: Running Commands on find’s Results
      5. xargs: A Unix Powertool
      6. Using xargs with Replacement Strings to Apply Commands to Files
      7. xargs and Parallelization
    3. Make and Makefiles: Another Option for Pipelines
  17. 13. Out-of-Memory Approaches: Tabix and SQLite
    1. Fast Access to Indexed Tab-Delimited Files with BGZF and Tabix
      1. Compressing Files for Tabix with Bgzip
      2. Indexing Files with Tabix
      3. Using Tabix
    2. Introducing Relational Databases Through SQLite
      1. When to Use Relational Databases in Bioinformatics
      2. Installing SQLite
      3. Exploring SQLite Databases with the Command-Line Interface
      4. Querying Out Data: The Almighty SELECT Command
      5. SQLite Functions
      6. SQLite Aggregate Functions
      7. Subqueries
      8. Organizing Relational Databases and Joins
      9. Writing to Databases
      10. Dropping Tables and Deleting Databases
      11. Interacting with SQLite from Python
      12. Dumping Databases
  18. 14. Conclusion
    1. Where to Go From Here?
  19. Glossary
  20. Bibliography
  21. Index

Product information

  • Title: Bioinformatics Data Skills
  • Author(s): Vince Buffalo
  • Release date: July 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449367503