> For the complete documentation index, see [llms.txt](https://docs.dh3.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.dh3.io/specs/data-streams/storage.md).

# Storage

### Choice of the storage backend

Hadoop has become the backbone of many organizations' big data operations due to its scalable, flexible, and cost-effective nature. Born out of a need to process an ever-growing amount of data, as we see now in WEB3.

Advantages of Hadoop:

* **Scalability**: Hadoop clusters can be easily scaled up by simply adding more nodes. This means providers can start with the data they have and scale up as their data grows.
* **Cost-effective**: Hadoop uses commodity hardware to store large quantities of data, which dramatically reduces the cost per terabyte of storage.
* **Flexibility**: Hadoop is not limited to MapReduce, and we use our own Kubernetes operator which can schedule jobs inside and outside the Hadoop cluster letting us maximize all available CPU and network resources.

### DH3 Hadoop operator

Deploying Hadoop to Kubernetes can enhance Hadoop's scalability and deployment flexibility further. Kubernetes, an open-source platform for managing containerized workloads and services, facilitates both declarative configuration and automation.

Implementing Hadoop on Kubernetes involves containerizing Hadoop services, configuring persistent storage solutions to manage HDFS, and setting up network configurations for seamless communication between the Hadoop components.&#x20;

Data providers already running a custom Kubernetes operator specifically designed for Hadoop deployments, ensuring a smooth deployment process across different bare-metal environments. This allows [providers](/specs/providers.md) to shift their emphasis towards the expansion of data lakes, thereby minimizing the redundancy of infra.

In conclusion, Hadoop's design philosophy of high fault tolerance, cost-effectiveness, and scalability, combined with Kubernetes' dynamic and flexible nature, presents a formidable solution for managing big data.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.dh3.io/specs/data-streams/storage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
