Open Source big data software as how I currently understand it.

Since I’m moving away from relational databases and data warehousing (so 2008).
I’m self learning Big Data Architecture.

This Open Source big data software list as how I currently understand it.

Hadoop : hdfs is the basic distributed file store; like a database but spread about lots of linux boxes. Many things run on top of it. Best of which is Spark which is gaining popularity.
There’s variants of hadoop similar to different linux os distros: Apache Hadoop (vanilla), MapR, Hbase, Cloudera, MongoDB. I don’t know the differences too well.

MapReduce: language/job to take run queries against hadoop. It’s syntax is hard; basically script how to jump between boxes and split data set and reduce the problem from a DAG. Lot of people prefer abstraction layer to do it. Hive is Sql based , which translates to map-reduce jobs. MR jobs are generally slow and does in big batches.

Apache Spark: is an memory Haddoop App that does apparently everything. Spark is fast, probably why people like it. Spark Sql Spark Stream, everything’s growing.
It also has spark ML (machine learning) library.

Apache Kafka: a popular open source pub/sub data brokering service also used for streaming. It does real-time processing and log aggregation and monitoring.

Apache Zookeeper: a cluster management software, open sourced. To manage all the different nodes’s and linux servers.

Apache Mesos: The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with resource management and scheduling across entire datacenter and cloud environments. so basically == Yarn 2.0.

Apache Storm: Big time streaming processing of incoming big data. It does lambda function or what used to be transforms on datasets fast.

Apache Druid : In memory slice and dice cube that sits on top big data stores.

Kibana : to have data visualization and reporting from the distributed systems.

Airflow : Open source data engineering pipeline tool. Used to parallel etl data from nodes to target stores. AirBnb made it.

Apache Nifi : Log collection service, used to collect IOT or similar volumes of data.

Apache Hue: web app gui for interfacing with Hadoop data stores.

Apache Lucene Search:
Apache Solr : open source search through all the stuff.
Elastic Search : distributed searching of nosql json same as Solr mostly

NoSql key-value stores:

Cassandra : Most reliable and super scalable data store of key-values. Very hard to analytics on.
Redis : in memory data store , good for caching hot data like websites, not a long term data store.

One or more familarity with these languages needed to be hand one with Big Data:
Java,Scala,Python,Go,Kafka,Storm, SQL

My next post will cover:
Drill, Flink, Yarn, Pig, Mahout, Tez, parquet, Impala, Sqoop, Avro, Oozie, Bigtop, HCatalog, Luigi

This entry was posted in big data and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s