Best practices, how-to's, and internals from Cloudera Engineering and the community
109 followers 0 articles/week
How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale. Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large...

Mon Oct 19, 2015 19:25
Untangling Apache Hadoop YARN, Part 2

A new installment in the series about the tangled ball of thread that is YARN In Part 1 of this series, we covered the fundamentals of clusters of YARN. In Part 2, you’ll learn about other components than can run on a cluster and how they affect YARN cluster configuration. Ideal YARN Allocation As shown in the previous post, a YARN cluster can be configured...

Sat Oct 17, 2015 19:29
How-to: Use Apache Solr to Query Indexed Data for Analytics

Bet you didn’t know this: In some cases, Solr offers lightning-fast response times for business-style queries. If you were to ask well informed technical people about use cases for Solr, the most likely response would be that Solr (in combination with Apache Lucene) is an open source text search engine: one can use Solr to index documents, and after...

Wed Oct 14, 2015 19:21
How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark

Thanks to Michal Malohlava, Amy Wang, and Avni Wadhwa of H20.ai for providing the following guest post about building ML apps using Sparkling Water and Apache Spark on CDH. The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The...

Fri Oct 9, 2015 01:14
Continuous Distribution Goodness-of-Fit in MLlib: Kolmogorov-Smirnov Testing in Apache Spark

Thanks to former Cloudera intern Jose Cambronero for the post below about his summer project, which involved contributions to MLlib in Apache Spark. Data can come in many shapes and forms, and can be described in many ways. Statistics like the mean and standard deviation of a sample provide descriptions of some of its important qualities. Less commonly...

Wed Oct 7, 2015 09:13
Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data

This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures. The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used...

Mon Sep 28, 2015 16:37

Build your own newsfeed

Ready to give it a go?
Start a 14-day trial, no credit card required.

Create account