Tôi là Duyệt

2024-11-01Data

In Data Platform System with ClickHouse, rather than extracting data from ClickHouse for processing in external systems, we can perform transformations directly within ClickHouse itself. ClickHouse can call any external executable program or script to process data. My idea is using custom **User-Defined Functions (UDFs) written in Rust** to handle data transformations between tables.

2024-06-23Data

ReplicatedReplacingMergeTree

Now you have a large single node cluster with a ReplacingMergeTree table that can deduplicate itself. This time, you need more replicated nodes to serve more data users or improve the high availability.

2024-06-01Data

ReplacingMergeTree

My favorite ClickHouse table engine is `ReplacingMergeTree`. The main reason is that it is similar to `MergeTree` but can automatically deduplicate based on columns in the `ORDER BY` clause, which is very useful.

2024-05-31Data

MergeTree

After starting this series ClickHouse on Kubernetes, you can now configure your first single-node ClickHouse server. Let's dive into creating your first table and understanding the basic concepts behind the ClickHouse engine, its data storage, and some cool features

2024-03-27Data

Monitoring ClickHouse on Kubernetes

Now that you have your first ClickHouse instance on Kubernetes and are starting to use it, you need to monitoring and observing what happens on it is an important task to achieve stability.

2024-03-26Data

ClickHouse SELECT Advances

Dynamic column selection (also known as a `COLUMNS` expression) allows you to match some columns in a result with a re2 regular expression.

2024-03-13Data

ClickHouse on Kubernetes

ClickHouse has been both exciting and incredibly challenging based on my experience migrating and scaling from Iceberg to ClickHouse, zero to a large cluster of trillions of rows. I have had to deal with many of use cases and resolve issues. I have been trying to take notes every day for myself, although it takes time to publish them as a series of blog posts. I hope I can do so on this ClickHouse on Kubernetes series.

2023-09-03Data

DuckDB

In this post, I want to explore the features and capabilities of DuckDB, an open-source, in-process SQL OLAP database management system written in C++11 that has been gaining popularity recently. According to what people have said, DuckDB is designed to be easy to use and flexible, allowing you to run complex queries on relational datasets using either local, file-based DuckDB instances or the cloud service MotherDuck.

2023-05-07Data

Running Spark in GitHub Actions

This post provides a quick and easy guide on how to run Apache Spark in GitHub Actions for testing purposes

2023-04-15Data

Why does Helm Charts interpret 0777 as 511?

Why does Helm Charts interpret 0777 to 511? It took me quite some time to debug it.

2023-04-01Data

GPT vs Traditional NLP Models

The field of Natural Language Processing (NLP) has seen remarkable advancements in recent years, and the emergence of the Generative Pre-trained Transformer (GPT) has revolutionized the way NLP models operate. GPT is a cutting-edge language model that employs deep learning to generate human-like text. Unlike conventional NLP models, which required extensive training on specific tasks, GPT is pre-trained on vast amounts of data and can be fine-tuned for various NLP tasks

2023-02-26Data

Ask ChatGPT about 20 important concepts of Apache Spark

I asked ChatGPT to explain 20 important concepts of Apache Spark. Let's see what it has to say!

2023-01-22Data

Data Engineering Tools written in Rust

This blog post will provide an overview of the data engineering tools available in Rust, their advantages and benefits, as well as a discussion on why Rust is a great choice for data engineering.

2023-01-10Data

Why ClickHouse Should Be the Go-To Choice for Your Next Data Platform?

Recently, I was working on building a new Logs dashboard at Fossil to serve our internal team for log retrieval, and I found ClickHouse to be a very interesting and fast engine for this purpose. In this post, I'll share my experience with using ClickHouse as the foundation of a light-weight data platform and how it compares to another popular choice, Athena. We'll also explore how ClickHouse can be integrated with other tools such as Kafka to create a robust and efficient data pipeline.

2022-09-27Data

Airflow Dataset (Data-aware scheduling)

Airflow since 2.4, in addition to scheduling DAGs based upon time, they can also be scheduled based upon a task updating a dataset. This will change the way you schedule DAGs.

2022-03-09Data

Spark on Kubernetes tại Fossil 🤔

Apache Spark được chọn làm công nghệ cho Batch layer bởi khả năng xử lý một lượng lớn data cùng một lúc. Ở thiết kế ban đầu, team data chọn sử dụng Apache Spark trên AWS EMR do có sẵn và triển khai nhanh chóng. Dần dần, AWS EMR bộc lộ một số điểm hạn chế trên môi trường Production. Trong bài viết này, mình sẽ nói về tại sao và làm thế nào team Data chuyển từ Spark trên AWS EMR sang Kubernetes.

2022-02-24Data

Manage Redshift/Postgres Privileges GitOps Style

The grant project aims to manage Postgres and Redshift database roles and privileges in GitOps style. Grant is the culmination of my learning of Rust for data engineering tools.

2021-11-27Data

Rust và Data Engineering? 🤔

Đối với một Data Engineer như mình, ưu tiên chọn một ngôn ngữ dựa trên việc nó có giải quyết được hết hầu hết các nhu cầu và bài toán của mình hay không: Data Engineering, Distributed System và Web Development. Và cuối cùng mình dự định sẽ bắt đầu với Rust, bởi vì ...

2021-11-22Data

Spark on Kubernetes - better handling for node shutdown

Spark 3.1 on the Kubernetes project is now officially declared as production-ready and Generally Available. Spot instances in Kubernetes can cut your bill by up to 70-80% if you are willing to trade in reliability. The new feature - SPIP: Add better handling for node shutdown (SPARK-20624) was implemented to deal with the problem of losing an executor when working with spot nodes - the need to recompute the shuffle or cached data.

2021-08-29Data

Good reasons to use ClickHouse

More than 200+ companies are using ClickHouse today. With many features support, it's equally powerful for both Analytics and Big Data service backend.

2021-07-04Data

Postgres Full Text Search

Postgres has built-in functions to handle Full Text Search queries. This is like a "search engine" within Postgres.

2017-05-31Data

Cài Apache Spark standalone bản pre-built

Mình nhận được nhiều phản hồi từ bài viết BigData - Cài đặt Apache Spark trên Ubuntu 14.04 rằng sao cài khó và phức tạp thế. Thực ra bài viết đó mình hướng dẫn cách build và install từ source.

2016-09-20Data

Chạy Apache Spark với Jupyter Notebook

IPython Notebook là một công cụ tiện lợi cho Python. Ta có thể Debug chương trình PySpark Line-by-line trên IPython Notebook một cách dễ dàng, tiết kiệm được nhiều thời gian.

2016-09-08Data

PySpark - Thiếu thư viện Python trên Worker

Apache Spark chạy trên Cluster, với Java thì đơn giản. Với Python thì package python phải được cài trên từng Node của Worker. Nếu không bạn sẽ gặp phải lỗi thiếu thư viện.

2016-06-29Data

Tìm hiểu về dữ liệu trong thể thao hiện đại

Tìm hiểu về dữ liệu trong thể thao hiện đại. Một trong những câu trả lời cho câu hỏi: Dân hệ thống thông tin thì làm gì?

2016-02-03Data

Bigdata - Columnar Database và Graph Database

Như đã nói về big data, chúng ta có các loại dữ liệu khác nhau và chúng ta cần lưu trữ trong database. Bigdata có thể xử lý và lưu trữ trên nhiều loại CSDL khác nhau. Sau đây tôi sẽ nói 1 ít về columnar Database và Graph Database.

2016-02-03Data

Graph Database

Bài trước tôi có nói về Columnar Database và Graph Database. Mục đích là so sánh và đi sâu vào Graph Database. Tiếp đến là xử lý Graph Database với Big Data.

2015-12-12Data

Apache Spark on Docker

Docker and Spark are two technologies which are very hyped these days

2015-04-18Data

PySpark Getting Started

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

2015-04-12Data

Big Data Explained in Less Than 2 Minutes

There are some things that are so big that they have implications for everyone, whether we want them to or not. Big Data is one of those concepts, and is completely transforming the way we do business and is impacting most other parts of our lives.

2015-04-09Data

MongoDB - Cách thiết lập để App Server kết nối đến MongoDb Server

Thông thường, chúng ta thường thiết lập để Code và phần Database chung 1 server. Với những ứng dụng lớn để quản lý, chúng ta phải tách riêng biệt chúng trên nhiều server khác nhau. Bởi vì mặc định MongoDb không cho phép remote connections mà chỉ cho phép kết nối nội bộ. Mình sẽ hướng dẫn cách thiết lập sao cho từ App Server (server chứa code) kết nối được tới MongoDb Server (hoặc cụm MongoDb Server)

2015-04-06Data

Redis

Redis là 1 trong số các hệ quản trị cơ sở dữ liệu phát triển mang phong cách NoSQL. Redis là hệ thống lưu trữ key-value với rất nhiều tính năng và được sử dụng rộng rãi. Redis nổi bật bởi việc hỗ trợ nhiều cấu trúc dữ liệu cơ bản (hash, list, set, sorted set, string), đồng thời cho phép scripting bằng ngôn ngữ lua.

Resources

me@duyet.net