How GIT uses Key Value Store Concept

How git works

GIT is one of the best-known SCM systems. It is as flexible as powerful and allows adaption to several workflows. But it is complex and there are many pitfalls for less experienced users. Behind the scenes GIT is a simple database with a lot of front end operations. Understanding GIT core concepts is prerequisite to understand higher level operations. This article explains how GIT works behind the scenes. In this way it gives you a basic understanding of GIT core concepts.

GIT is a key value store

GIT uses a database to store file artifacts. This database is a key value store and is the heart of GIT SCM. Files, directories and commit messages become archived to this store. But how does this work? GIT core concept is pretty simple. Adding a file to GIT means to add a key value pair to store. Keys are hex strings while values are file contents. Sounds simple and is simple!

Add single file to GIT key value store

A basic and fundamental operation is archiving a single file. GIT calculates a unique hash value for file content. At next step GIT creates a key value pair and adds it to key value store. The pair uses the hash as key and file content as value. At this point key value store holds one entry for the file. But up to here file name is missing. So file name GIT adds it in next step.

Add file names and directories to GIT key value store

Storing file names is pretty simple because GIT treads a directory as file. It is similar to list directory and pipe output into text file. In addition, GIT creates tuples by associating file name with hash value. Think of a text file with one line per file. Each line contains file name and corresponding hash value. The whole file is a snapshot of directory. It contains hash references to versioned files.
In this way GIT divides file names and their contents. An advantage is automatic detection of file movements. Moving a file to different directory has no effect to hash value. For GIT a file name does not matter because it has no effect to hash value. But moving a file or directory effects directory listing. In this way the file listings of source and target directory change. They require a new file with different hash values.

Add directories to GIT key value store

GIT recursively scans directory tree from leafs to root. It processes any directory in same way as described in last paragraph. Lets review the example step by step:

archive file A.java with hash value a1

archive file B.java with hash value b1

archive ./src with file names A.java and B.java and use ash value c1

archive ./reame.txt with hash value d1

archive ./ with ./src and ./readme.txt and use hash value e1

Archive Commit to GIT key value store

Up to here archiving seems to be complete. But stop, an important detail is missing. What about commit operation and comments? It is not surprising, GIT creates a file for commit operation. At first the file contains commit message. At second file contains entries for author and time stamp. Not the least file contains reference to root directory.

On-Premise? IaaS vs. PaaS vs. SaaS?

by Markus Breuer | Aug 14, 2022 | Big Data, Software Architecture

What does it mean to run an application in the cloud? What types of clouds are there, and what responsibilities can they take away from me? Or conversely, what does it mean not to go to the cloud? To clarify these questions, we first need to identify the...

Apache Spark & Delta Lake Examples

by Markus Breuer | Jun 11, 2022 | Big Data

Here are some Apache Spark and Delta Lake examples. Actually, they are always the same problems. But searching and finding the solutions on the internet costs a lot of time. Is something still missing? Then just let me know! Preparations The Apache Spark distribution...

Apache Spark fails autocomplete and loading packages

by Markus Breuer | Jun 7, 2022 | Big Data

Running Apache Spark from the Docker image causes problems with autocomplete in the shell. In that way, the apache spark-shell autocomplete is broken. Suspicion falls on the terminal settings and the JLine configuration. In fact, the cause lies in a completely...

Docker Topics – A collection of notes

by Markus Breuer | Aug 7, 2021 | Big Data

A list of common Docker Topics (more): The Haskell Dockerfile Linter helps to build best practice Docker images. The Dockerfile Lint allows the building of custom policies to build best practice Docker images. Use Makefiles for Docker building Docker Images. How...

Make it easy: Apache Spark, Data Frames and Regex Power

by Markus Breuer | Aug 5, 2020 | Big Data

Regular Expressions are a powerful tool to split texts into fragments. Furthermore, Apache Spark is an analytics engine and capable of processing large amounts of data sets. The feature of naming capturing groups makes the usage of regular expressions more accessible....

Openshift mount files to pods

by Markus Breuer | Dec 15, 2019 | Big Data

Openshift offers many possibilities to embed files in pods. Furthermore, there are many reasons to include files in pods. So, embedding configuration files is a powerful mechanism. In this way, unchangeable containers become populated with dynamic content. In brief,...

How GIT uses Key Value Store Concept

How git works

GIT is a key value store

Add single file to GIT key value store

Add file names and directories to GIT key value store

Add directories to GIT key value store

Archive Commit to GIT key value store

Summarize

blog

News & Updates

On-Premise? IaaS vs. PaaS vs. SaaS?

Apache Spark & Delta Lake Examples

Apache Spark fails autocomplete and loading packages

Docker Topics – A collection of notes

Make it easy: Apache Spark, Data Frames and Regex Power

Openshift mount files to pods

How GIT uses Key Value Store Concept

How git works

GIT is a key value store

Add single file to GIT key value store

Add file names and directories to GIT key value store

Add directories to GIT key value store

Archive Commit to GIT key value store

Summarize

Sharen with:

On-Premise? IaaS vs. PaaS vs. SaaS?

Apache Spark & Delta Lake Examples

Apache Spark fails autocomplete and loading packages

Docker Topics – A collection of notes

Make it easy: Apache Spark, Data Frames and Regex Power

Openshift mount files to pods