A text mining system must go way beyond indexing and search to appear truly intelligent. First, it should understand language beyond keyword matching. For example, it should be able to distinguish the critical difference between “Jane has the flu” and “Jane had the flu when she was 9.” Second, it should be capable of making likely inferences even if they’re not explicitly written. For example, inferring that Jane may have the flu if she has had a fever, headache, fatigue, and runny nose for three days. And third, it should do its work as part of a robust, scalable, efficient, and easy to extend system. This course teaches software engineers and data scientists how to build intelligent natural language understanding (NLU) based text mining systems at scale using Java, Scala, and Spark for distributed processing.
Learn the meaning of natural language understanding (NLU) and its use in text mining
Discover how to build a natural language processing (NLP) pipeline within a big data framework
Recognize the differences between NLP pipelines and other approaches to semantic text mining
Learn about standard UIMA annotators, custom annotators, and machine learned annotators
Discover how different types of annotators are composed into a text processing pipeline
Use machine learning to generate annotators and apply them within a data pipeline
See pipeline architectures that incorporate Kafka, Spark, SparkSQL, Cassandra, and ElasticSearch
David Talby (PhD , Computer Science, Hebrew University) and Claudio Branzan (Masters, Industrial Intelligent Systems, Polytechnic University of Timișoara) work for big data analytics firm Atigeo. David is CTO and Claudio runs the Modeling and Predictive Analytics team. David and Claudio co-presented on text mining and natural language understanding at O'Reilly's Strata+Hadoop World London 2016 conference.
David Talby is Atigeo’s senior vice president of engineering, leading the development of its cloud big data analytics platform to solve real-world problems in healthcare, energy, and cybersecurity. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier he worked at Amazon, both in Seattle and the UK, where he built and ran distributed teams which helped scale Amazon’s financial systems. David holds a Ph.D. in Computer Science along with two masters degrees, in Computer Science and Business Administration.
Claudiu Branzan is a principal engineering lead at Atigeo, leading a team of data scientists and software engineers who tackle complex challenges in machine learning, data mining, information retrieval, and statistics. Claudiu has over 10 years of real-world data science experience across industries including finance, healthcare, legal, mobile, and retail. He has co-authored multiple patents, and holds a master’s degree in industrial intelligent systems from the Polytechnic University of Timișoara.