博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Awesome Big Data List
阅读量:6905 次
发布时间:2019-06-27

本文共 37033 字,大约阅读时间需要 123 分钟。

https://github.com/onurakpolat/awesome-bigdata

A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by , , ,  & .

Your contributions are always welcome!

RDBMS

  •  The world's most popular open source database.
  •  The world's most advanced open source database.
  •  - object-relational database management system.

Frameworks

  •  - platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)
  •  - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
  •  - High Throughput Real-time Stream Processing Framework.
  •  - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.

Distributed Programming

  •  - distributed data processing and storage system originally developed at AddThis.
  •  - run Spark on Hadoop MapReduce v1.
  •  - a unified, enterprise platform for big data stream and batch processing.
  •  - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
  •  - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  •  - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  •  - high-performance runtime, and automatic program optimization.
  •  - real-time big data streaming engine based on Akka.
  •  - framework for in-memory data model and persistence.
  •  - BSP (Bulk Synchronous Parallel) computing framework.
  •  - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  •  - high level language to express data analysis programs for Hadoop.
  •  - retainable evaluator execution framework to simplify and unify the lower layers of big data systems.
  •  - framework for stream processing, implementation of S4.
  •  - framework for in-memory cluster computing.
  •  - framework for stream processing, part of Spark.
  •  - framework for stream processing by Twitter also on YARN.
  •  - stream processing framework, based on Kafka and YARN.
  •  - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  •  - abstraction over YARN that reduces the complexity of developing distributed applications.
  •  - data processing and querying library.
  •  - High Performance, Custom Data Warehouse on Top of MapReduce.
  •  - framework for data management/analytics on Hadoop.
  •  - MapReduce library for Clojure.
  •  - alternative MapReduce paradigm.
  •  - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
  •  - Hadoop enhancement which removes single point of failure.
  •  - Map Reduce framework.
  •  - distributed in-memory datastore.
  •  - create data pipelines to help themæingest, transform and analyze data.
  •  - map reduce framework.
  •  - fault tolerant stream processing framework.
  •  - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
  •  - declarative programming language for working with structured, semi-structured and unstructured data.
  •  - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  •  - framework for real-time analysis of large datasets.
  •  - map-reduce for Clojure which compiles to Apache Pig.
  •  - MapReduce framework developed by Nokia.
  •  - Distributed computation for the cloud.
  •  - asynchronous job execution system.
  •  - Python MapReduce and HDFS API for Hadoop.
  •  - multi-tenant distributed metric processing system
  •  - High performance distributed data processing in NodeJS.
  •  - general purpose cluster computing framework.
  •  - useful for counting activities of event streams over different time windows and finding the most active one.
  •  - Libraries to enable building IBM Streams application in Java, Python or Scala.
  •  - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
  •  - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
  •  - Scala library for Map Reduce jobs, built on Cascading.
  •  - Streaming MapReduce with Scalding and Storm, by Twitter.
  •  - TimeSeries AggregatoR by Twitter.

Distributed Filesystem

  •  - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
  •  - a way to store large files across multiple machines.
  •  - formerly FhGFS, parallel distributed file system.
  •  - software storage platform designed.
  •  - distributed filesystem.
  •  - object storage system.
  •  - distributed filesystem (GFS2).
  •  - distributed filesystem.
  •  - scalable, highly available storage.
  •  - GGFS, Hadoop compliant in-memory file system.
  •  - high-performance distributed filesystem.
  •  - HDFS-compatible storage in Azure cloud
  •  - open-source distributed file system.
  •  - scale-out network-attached storage file system.
  •  - simple and highly scalable distributed file system.
  •  - reliable file sharing at memory speed across cluster frameworks.
  •  - decentralized cloud storage system.
  •  - distributed filesystem.

Document Data Model

  •  - commercial object-oriented database management systems .
  •  - is an open source massively scalable data store. It requires zero administration.
  •  - Facebook’s Paxos-like NoSQL database.
  •  - document oriented datastore over Hadoop.
  •  - horizontally scalable document-oriented NoSQL data store.
  •  - Schema-agnostic Enterprise NoSQL database technology.
  •  - NoSQL cloud database service with protocol support for MongoDB
  •  - Document-oriented database system.
  •  - A transactional, open-source Document Database.
  •  - document database that supports queries like table joins and group by.

Key Map Data Model

Note: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").

Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all column values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.

The former group is referred to as "key map data model" here. The line between these and the stores is fairly blurry.

The latter, being more about the storage format than about the data model, is listed under .

You can read more about this distinction on Prof. Daniel Abadi's blog: .

  •  - distributed key/value store, built on Hadoop.
  •  - column-oriented distributed datastore, inspired by BigTable.
  •  - column-oriented distributed datastore, inspired by BigTable.
  •  - an Internet-scale database, inspired by BigTable.
  •  - evolution of HBase made by Facebook.
  •  - column-oriented distributed datastore.
  •  - is a fully managed, schemaless database for storing non-relational data over BigTable.
  •  - column-oriented distributed datastore, inspired by BigTable.
  •  - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
  •  - Transactions for HBase.
  •  - real-time, multi-tenant distributed database for Twitter scale.
  •  - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.

Key-value Data Model

  •  - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
  •  - distributed key/value store, implementation of Dynamo paper.
  •  - a fast, simple, efficient, and persistent key-value store written natively in Go.
  •  - an embedded key-value database for Go.
  •  - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
  •  - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
  •  - is a protocol-compatible Server replacement for Redis.
  •  - Distributed database specialized in exporting data from Hadoop.
  •  - distributed time series database.
  •  - suitable for sensor data stored in a timeseries.
  •  - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
  •  - is a simple persistent data store with very low latency and high throughput.
  •  - distributed key/value storage system.
  •  - distributed key-value database by Oracle Corporation.
  •  - in memory key value datastore.
  •  - a decentralized datastore.
  •  - library to work with asynchronous key value stores, by Twitter.
  •  - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
  •  - an efficient NoSQL database and a Lua application server.
  •  - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
  •  - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
  •  - key-value store that's replicated and sharded and provides atomic multirow writes.

Graph Data Model

  •  - a new generation multi-model graph database for the modern complex data environment.
  •  - implementation of Pregel, based on Hadoop.
  •  - implementation of Pregel, part of Spark.
  •  - multi model distributed database.
  •  - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
  •  - a lightweight graph based database that does not require any third-party libraries.
  •  - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
  •  - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
  •  - open-source graph database.
  •  - graph processing framework.
  •  - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  •  - resilient Distributed Graph System on Spark.
  •  - graph traversal Language.
  •  - RDF-centric Map/Reduce framework.
  •  - tools to construct large-scale graphs on top of Hadoop.
  •  - Massively Parallel Graph processing on GPUs.
  •  - graph database writting entirely in Java.
  •  - document and graph database.
  •  - framework for large scale graph processing.
  •  - distributed graph database, built over Cassandra.
  •  - distributed graph database.

Columnar Databases

Note please read the note on  section.

  •  - an explanation of what columnar storage is and when you might want it.
  •  - column-oriented analytic database.
  •  - column oriented DBMS.
  •  - an open-source column-oriented database management system that allows generating analytical data reports in real time.
  •  - a distributed, column-oriented database built for large-scale event collection and analytics.
  •  - column store database.
  •  - columnar storage format for Hadoop.
  •  - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
  •  - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
  •  - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
  •  Google's cloud offering backed by their pioneering work on Dremel.
  •  Amazon's cloud offering, also based on a columnar datastore backend.
  •  an open-source columnar storage format for fast & realtime analytic with big data.

NewSQL Databases

  •  - commercially supported, open-source SQL relational database management system.
  •  - data warehouse service, based on PostgreSQL.
  •  - statistic oriented SQL database.
  •  - a simple, modular, networked and distributed transaction layer built atop SQLite.
  •  - scales out PostgreSQL through sharding and replication.
  •  - Scalable, Geo-Replicated, Transactional Datastore.
  •  - a clustered RDBMS built on optimistic concurrency control techniques.
  •  - distributed database designed to enable scalable, flexible and intelligent applications.
  •  - distributed database, inspired by F1.
  •  - distributed SQL database built on Spanner.
  •  - globally distributed semi-relational database.
  •  - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  •  - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
  •  - NoSQL plugin for MySQL/MariaDB.
  •  - infinity scalable RDBMS.
  •  - in memory SQL database witho optimized columnar storage on flash.
  •  - SQL/ACID compliant distributed database.
  •  - in-memory, relational database management system with persistence and recoverability.
  •  - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
  •  - is an in-memory, column-oriented, relational database management system.
  •  - distributed, realtime, semi-structured database.
  •  - database used for flexible, high performance analysis of behavioral data.
  •  - open source software for both file and database synchronization.
  •  - GPU in-memory database, big data analysis and visualization platform
  •  - TiDB is a distributed SQL database. Inspired by the design of Google F1.
  •  - claims to be fastest in-memory database

Time-Series Databases

  •  - distributed time series database on top of HBase. Includes built-in Rule Engine, data forecasting and visualization.
  •  - a time series storage built to store time series highly compressed and for fast access times.
  •  - uses MongoDB to store time series data.
  •  - is a scalable time series database based on Cassandra and Elasticsearch.
  •  - distributed time series database.
  •  - similar to OpenTSDB but allows for Cassandra.
  •  - a time series database based on Apache Cassandra.
  •  - distributed time series database on top of HBase.
  •  - a time series database and service monitoring system.
  •  - Facebook's in-memory time-series database.
  •  - an efficient tool for storing and querying series of events.
  •  Column oriented distributed data store ideal for powering interactive applications
  •  Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
  •  Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
  •  A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
  •  Fast distributed metrics database
  •  A distributed system designed to ingest and process time series data
  •  Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.

SQL-like processing

  •  - high performance interactive SQL access to all Hadoop data.
  •  - framework for interactive analysis, inspired by Dremel.
  •  - table and storage management layer for Hadoop.
  •  - SQL-like data warehouse system for Hadoop.
  •  - framework that allows efficient translation of queries involving heterogeneous and federated data.
  •  - SQL skin over HBase.
  •  - framework for interactive analysis, Inspired by Dremel.
  •  - SQL-like query language for Cascading.
  •  - full SQL query engine for big datasets.
  •  - distributed SQL query engine.
  •  - framework for interactive analysis, implementation of Dremel.
  •  - an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.
  •  - SQL-like data warehouse system for Hadoop.
  •  - database for storing petabyte-scale volumes of structured and semi-structured data.
  •  - is a Query Optimization Framework for Spark and Shark.
  •  - Manipulating Structured Data Using Spark.
  •  - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
  •  - interactive query for Hive.
  •  - distributed data warehouse system on Hadoop.
  •  - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Data Ingestion

  •  - real-time processing of streaming data at massive scale.
  •  - data collection system.
  •  - service to manage large amount of log data.
  •  - distributed publish-subscribe messaging system.
  •  - tool to transfer data between Hadoop and a structured datastore.
  •  - framework that help ETL to Solr, HBase and HDFS.
  •  - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
  •  - streamed log data aggregator.
  •  - tool to collect events and logs.
  •  - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  •  - open source stream processing software system.
  •  - framework for connecting disparate data sources with Hadoop.
  •  - distributed message queue system.
  •  - stream of change capture events for a database.
  •  - utility package for compressing sorted integer arrays.
  •  - log aggregator and dashboard.
  •  - a tool for managing events and logs.
  •  - log agregattor like Storm and Samza based on Chukwa.
  •  - is a service implementing Kafka log persistance.
  •  - linkedin's universal data ingestion framework.
  •  - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
  •  - continuous big data ingest infrastructure with a simple to use IDE.
  •  - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
  •  - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.

Service Programming

  •  - runtime for distributed, and fault tolerant event-driven applications on the JVM.
  •  - data serialization system.
  •  - Java libaries for Apache ZooKeeper.
  •  - OSGi runtime that runs on top of any OSGi framework.
  •  - framework to build binary protocols.
  •  - centralized service for process management.
  •  - a lock service for loosely-coupled distributed systems.
  •  - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
  •  - cluster manager.
  •  - message passing framework.
  •  - decentralized solution for service discovery and orchestration.
  •  - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
  •  - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
  •  - libraries for working with LZOP-compressed data.
  •  - asynchronous network stack for the JVM.

Scheduling

  •  - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
  •  - a platform to programmatically author, schedule and monitor workflows.
  •  - is a service scheduler that runs on top of Apache Mesos.
  •  - data management framework.
  •  - workflow job scheduler.
  •  - distributed and fault-tolerant scheduler.
  •  - batch workflow job scheduler.
  •  - Scala DSL for agile scheduling of Hadoop jobs.
  •  - scheduling platform.

Machine Learning

  •  - Cloud-based AzureML, R, Python Machine Learning platform
  •  - Neural networks in JavaScript.
  •  - real-time large-scale machine learning.
  •  - machine learning library for Cascading.
  •  - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
  •  - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
  •  - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
  •  - Flexible and Extensible Machine Learning in Ruby.
  •  - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
  •  - text classification with machine learning.
  •  - scalable Machine Learning in Scalding.
  •  - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
  •  - statistical, machine learning and math runtime with Hadoop. R and Python.
  •  - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
  •  - An Apache-backed machine learning library for Hadoop.
  •  - distributed machine learning libraries for the BDAS stack.
  •  - Fast multilayer perceptron neural network library for iOS and Mac OS X.
  •  - MOA performs big data stream mining in real time, and large scale machine learning.
  •  - Text mining made easy. Extract and classify data from text.
  •  - A matrix library for the JVM. Numpy for Java.
  •  - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  •  - machine learning server buit on Hadoop, Mahout and Cascading.
  •  - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
  •  - distributed streaming machine learning framework.
  •  - scikit-learn: machine learning in Python.
  •  - a Spark implementation of some common machine learning (ML) functionality.
  •  - System for Large Scale Machine Learning at Google.
  •  - Library from Google for machine learning using data flow graphs.
  •  - A Python-focused machine learning library supported by the University of Montreal.
  •  - A deep learning library with a Lua API, supported by NYU and Facebook.
  •  - System for serving machine learning predictions.
  •  - learning system sponsored by Microsoft and Yahoo!.
  •  - suite of machine learning software.
  •  - CPU and GPU-accelerated Machine Learning Library.

Benchmarking

  •  - micro-benchmarks for testing Hadoop performances.
  •  - real-world big data workload benchmark.
  •  - a Hadoop benchmark suite.
  •  - benchmark suite for MapReduce applications.
  •  - Hadoop cluster benchmarking from Yahoo engineer team.

Security

  •  - Central security admin & fine-grained authorization for Hadoop
  •  - real time monitoring solution
  •  - single point of secure access for Hadoop clusters.
  •  - security module for data stored in Hadoop.

System Deployment

  •  - operational framework for Hadoop mangement.
  •  - system deployment framework for the Hadoop ecosystem.
  •  - cluster management framework.
  •  - cluster manager.
  •  - is a YARN application to deploy existing distributed applications on YARN.
  •  - set of libraries for running cloud services.
  •  - Cluster manager.
  •  - library that simplifies application deployment and management.
  •  - Similar to Apache BigTop based on Groovy language.
  •  - web application for interacting with Hadoop.
  •  - multi datacenters replication system.
  •  - job scheduling and monitoring system.
  •  - job scheduling and monitoring system.
  •  - application that can deploy HBase cluster on YARN.
  •  - Mesos framework for long-running services.

Applications

  •  - an web application for alert management resulting from scheduled searches into Elasticsearch.
  •  - Next-generation web analytics processing with Scala, Spark, and Parquet.
  •  - framework to collect and analyze data in real-time, based on HBase.
  •  - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
  •  - open source web crawler.
  •  - capturing, processing and sharing of data for NASA's scientific archives.
  •  - content analysis toolkit.
  •  - Time series monitoring and alerting platform.
  •  - a backend for managing dimensional time series data.
  •  - open source mobile and web analytics platform, based on Node.js & MongoDB.
  •  - Run, scale, share, and deploy models — without any infrastructure.
  •  - Eclipse-based reporting system.
  •  - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
  •  - open source event analytics platform.
  •  - asynchronous message broker built on top of Kafka.
  •  - API for performing image processing tasks on Hadoop's MapReduce.
  •  - Splunk analytics for Hadoop.
  •  - Large scale analytics platform by indeed.
  •  - data-processing library of an RDBMS to analyze data.
  •  - an open source framework for processing, monitoring, and alerting on time series data.
  •  - open source Distributed Analytics Engine from eBay.
  •  - R on Pivotal HD / HAWQ and PostgreSQL.
  •  - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
  •  - auto-scaling Hadoop cluster, built-in data connectors.
  •  - Cloud Platform for Data Science and Big Data Analytics.
  •  - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
  •  - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
  •  - R frontend for Spark.
  •  - analyzer for machine-generated data.
  •  - cloud based analyzer for machine-generated data.
  •  - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
  •  - query by example tool for big data (OS X app)

Search engine and framework

  •  - Search engine library.
  •  - Search platform for Apache Lucene.
  •  - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
  •  - Search and analytics engine based on Apache Lucene.
  •  – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
  •  - social graph search platform.
  •  - continuous indexing system.
  •  - continuous indexing system.
  •  - large search index.
  •  - implementation of Percolator, part of HBase.
  •  - quickly and easily search for any content stored in HBase.
  •  - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
  •  - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
  •  - search architecture at LinkedIn.
  •  - is a realtime search/indexing system written in Java.
  •  - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
  •  - fulltext search engine.

MySQL forks and evolutions

  •  - MySQL databases in Amazon's cloud.
  •  - evolution of MySQL 6.0.
  •  - MySQL databases in Google's cloud.
  •  - enhanced, drop-in replacement for MySQL.
  •  - MySQL implementation using NDB Cluster storage engine.
  •  - enhanced, drop-in replacement for MySQL.
  •  - High Performance Proxy for MySQL.
  •  - TokuDB is a storage engine for MySQL and MariaDB.
  •  - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

PostgreSQL forks and evolutions

  •  - hybrid of MapReduce and DBMS.
  •  - high-performance data warehouse appliances.
  •  - Scalable Open Source PostgreSQL-based Database Cluster.
  •  - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
  •  - open source MPP database system solely targeted at data warehousing and data mart applications.
  •  - multi-peta-byte database / MPP derived by PostgreSQL.
  •  - An open-source time-series database optimized for fast ingest and complex queries
  •  - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables

Memcached forks and evolutions

  •  - key/value cache for flash storage.
  •  - fork of Memcache.
  •  - A fast, light-weight proxy for memcached and redis.
  •  - key/value cache for flash storage.
  •  - fork of Memcache.

Embedded Databases

  •  - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
  •  - a software library that provides a high-performance embedded database for key/value data.
  •  - Erlang LSM BTree Storage.
  •  - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
  •  - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  •  - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

  •  - business intelligence platform in the cloud.
  •  - lean business intelligence platform to visualize and explore your data.
  •  - self-service business intelligence tool in the cloud.
  •  - platform for data products and embedded analytics.
  •  - powerful business intelligence suite.
  •  - customisable Business Intelligence platform.
  •  - Interactive Big Data Analytics.
  •  - business intelligence software and platform.
  •  - software platforms for business intelligence, mobile intelligence, and network applications.
  •  - business intelligence platform.
  •  - business intelligence and analytics platform.
  •  - Open source business intelligence platform, supporting multiple data sources and planned queries.
  •  - open source analytics platform.
  •  - open source business intelligence platform.
  •  - business intelligence platform.
  •  - Big Data Analytics.
  •  - The simplest, fastest way to get business intelligence and analytics to everyone in your company

Data Visualization

  •  - Web UI for PrestoDB.
  •  - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.
  •  - graph visualization library using web workers and jQuery.
  •  - visualize logs and time-stamped data stored in Solr. Port of Kibana.
  •  - Web UI for Impala.
  •  - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
  •  - D3-based reusable chart library
  •  - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
  •  - responsive, retina-compatible charts with just an img tag.
  •  - open source HTML5 Charts visualizations.
  •  - another open source HTML5 Charts visualization.
  •  - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
  •  - JavaScript library for time series visualization.
  •  - JavaScript library for visualizing complex networks.
  •  - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
  •  - javaScript library for manipulating documents.
  •  - Compose complex, data-driven visualizations from reusable charts and components.
  •  - A fairly robust set of reusable charts and styles for d3.js.
  •  - Baidus enterprise charts.
  •  - dynamic HTML5 visualization.
  •  - write SQL queries that return SVG charts rather than tables
  •  - pen source real-time dashboard builder for IOT and other web mashups.
  •  - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
  •  - simple charting API.
  •  - graphite dashboard frontend, editor and graph composer.
  •  - scalable Realtime Graphing.
  •  - simple and flexible charting API.
  •  - provides a rich architecture for interactive computing.
  •  - visualize logs and time-stamped data
  •  - open source big data analysis and visualization platform
  •  - plotting with Python.
  •  - a library built on top of D3 that is optimized for time-series data
  •  - chart components for d3.js.
  •  - Progressive SVG bar, line and pie charts.
  •  - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
  •  The open source javascript graphing library that powers plotly.
  •  - simple but powerful library for building data applications in pure Javascript and HTML.
  •  - open-source platform to query and visualize data.
  •  - A composable charting library built on React components
  •  - a web application framework for R.
  •  - JavaScript library dedicated to graph drawing.
  •  - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
  •  - a visualization grammar.
  •  - a notebook-style collaborative data analysis.
  •  - JavaScript charting library for big data.

Internet of things and sensor data

  •  - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
  •  - Cloud-based bi-directional monitoring and messaging hub
  •  - Cloud-based sensor analytics.
  •  - Platform for Internet of things.
  •  - Data stream network
  •  - Rapid development and connection of intelligent systems
  •  - If this then that
  • - Making products smart

Interesting Readings

  •  - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
  •  - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
  •  - Guide to monitoring Apache Kafka, including native methods for metrics collection.
  •  - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.

Interesting Papers

2015 - 2016

  •  - Facebook - One Trillion Edges: Graph Processing at Facebook-Scale.

2013 - 2014

  •  - Stanford - Mining of Massive Datasets.
  •  - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
  •  - AMPLab - MLbase: A Distributed Machine-learning System.
  •  - AMPLab - Shark: SQL and Rich Analytics at Scale.
  •  - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
  •  - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  •  - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
  •  - Metamarkets - Druid: A Real-time Analytical Data Store.
  •  - Google - Online, Asynchronous Schema Change in F1.
  •  - Google - F1: A Distributed SQL Database That Scales.
  •  - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
  •  - Facebook - Scuba: Diving into Data at Facebook.
  •  - Facebook - Unicorn: A System for Searching the Social Graph.
  •  - Facebook - Scaling Memcache at Facebook.

2011 - 2012

  •  - Twitter - The Unified Logging Infrastructure for Data Analytics at Twitter.
  •  - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
  •  - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
  •  - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
  •  - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
  •  - Microsoft - Paxos Made Parallel.
  •  - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  •  - Google - Processing a trillion cells per mouse click.
  •  - Google - Spanner: Google’s Globally-Distributed Database.
  •  - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
  •  - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
  •  - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 - 2010

  •  - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
  •  - AMPLab - Spark: Cluster Computing with Working Sets.
  •  - Google - Pregel: A System for Large-Scale Graph Processing.
  •  - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
  •  - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
  •  - Yahoo - S4: Distributed Stream Computing Platform.
  •  - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  •  - AMPLab - Chukwa: A large-scale monitoring system.
  •  - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
  •  - Google - The Chubby lock service for loosely-coupled distributed systems.
  •  - Google - Bigtable: A Distributed Storage System for Structured Data.
  •  - Google - MapReduce: Simplied Data Processing on Large Clusters.
  •  - Google - The Google File System.

Videos

Books

Streaming

  •  - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
  •  - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
  •  - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
  •  - Presents a new paradigm suitable for stream and complex event processing.
  •  - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business

Distributed systems

  •  – Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.

Data Visualization

Other Awesome Lists

  • Other awesome lists .
  • Even more lists .
  • Another list? .
  • WTF! .
  • Analytics .

转载地址:http://gnldl.baihongyu.com/

你可能感兴趣的文章
lvs 隧道模式请求没有回应的解决
查看>>
字符设备驱动笔记——查询方式按键驱动(三)
查看>>
跨平台日志清理工具 Log-Cutter v2.0.1 正式发布
查看>>
平面最近点对题目
查看>>
通过一个模拟程序让你明白ASP.NET MVC是如何运行的
查看>>
本系列love2d示例代码错误集中整理
查看>>
Scapy:局域网MAC地址扫描脚本
查看>>
Ifvisible.js – 判断网页中的用户是闲置还是活动状态
查看>>
VMware Workstation 10.0 简中绿色精简版
查看>>
安装CiscoWorks LMS
查看>>
RMQ LAC 入门
查看>>
spring-session之二:简单配置
查看>>
jsp自定义标签分析
查看>>
ACCESS的System.Data.OleDb.OleDbException: INSERT INTO 语句的语法错误
查看>>
ie启动不了的解决办法,win7,win8都可以
查看>>
ECshop 快捷登录插件 支持QQ 支付宝 微博
查看>>
HTML转义字符大全
查看>>
高级进程间通信之UNIX域套接字
查看>>
HTML5打造的炫酷本地音乐播放器-喵喵Player
查看>>
WPF命中测试示例(二)——几何区域命中测试
查看>>