Here’s a really helpful video that shows you how to leverage Hadoop through Azure HDInsight:
Below are my notes I took throughout the video:
- Can do analytics with Hadoop to get petabytes worth of data down into a manageable size that can be consumed by Excel
- Hive is very similar to SQL
- Good for data that is structured
- Most customers use this
- Schema on read
- Mahout is a machine learning library
- Pig is a data scripting language
- Good for unstructured/semi-structured data
- Can handle missing columns, project things
- Can do data cleansing
- Pegasus & Giraph
- Used for graph processing
- Cascading
- Dataflow API (similar to Pig) but in Java
- Can use Visual Studio to write Hive queries (new with latest SDK)
- HBase cluster is used for NoSQL storage
- A distributed, non-relational database
- Large scale (billions of rows X millions of columns)
- Low latency
- Open Source
If you have time, make sure to check it out! I’ll be going through some demos of HDInsight and will post any helpful tips I may come across.
Thanks for sharing nice information about Hadoop and azure technologies integration. I am expecting a video in this blog post about “Hadoop through Azure HDInsight”. If possible please share video about this topic.