How to store large amount of files in file system?
by Markus Breuer | Jun 30, 2019 | Big Data | 0 comments
Storing large amounts of files become a challenge. Modern file systems are powerful. But when filling them up with many files they reach limits. The smaller file size the worse is disk performance. The more files archived the worse is disk performance. This is the basic constraint when using a file system as archive space. But where is the limit? When should I use a file system? When should think about alternatives? How to store large amount of files in file system?
Our instinct says they are fast. Browsing files and folders is pretty fast. Copy files from here to there succeeds within seconds. Depending on operating system and used antivirus software. But it is hard to define a hard limit. In test lab performance looks great. But when moving to production performance decreases. Available working memory has an impact on performance. Operating system use the free memory to cache disk access. So unused memory has added value.
How file system work and store amount of files
There exist a lot of different file systems. On windows FAT and NTFS are popular. And linux uses ext3 or ext4 file system. There exists a lot more implementations of file systems. Any of them has specifics features. A common ground are files and directories. A directory is a container and hold a list of entries. An entry may be a file or a directory. A file is a reference to its contents. So, a file system is a hierarchical list of files and directories.
An operating system manages access to file system contents. Accessing file system means to lookup references. What happens when opening a file? Firstly the operating system starts a search for containing directory. In next step it searches for the file entry. The file offers the reference to file contents. So, a file system splits disk space into two areas. One area for meta data with references to files and directories. And second area to files contents. Accessing a file means to access both areas. How to store large amount of files in file system
File system and block devices
File systems base on block devices. A block device manages disk space as sequence of blocks. Any block has same size and a block is the smallest work unit. So, reading file system means to read blocks from disk. The operating system serves block devices. It is connection between disk device and application.
The access time to disk is much slower than to memory. Reading a block from memory is much faster than loading it from disk. Operating system use an intelligent cache to hold blocks in memory. Memory is fast but limited in size. So, the cache tries to hold often used blocks in memory. Applications are main memory consumers. Operating system assigns unused memory to disk cache. Depending on application memory usage the cache grows or shrinks. At this point file system meta information becomes important. It holds file and directory references. These blocks are common used blocks.
There is many meta information in a single block. References to several entries are hold in a single block. So, querying parts of file system you will touch only a few blocks. These blocks are hot spots in cache. Probability that such block exists in cache is high. In normal operation your file system access benefits from cache.
Putting together file system, block device and caches to store large amount of files
Let us summarize. File systems operate on block devices. Block devices manage access to disk space. Operating system are glue between disk and applications. And caching is a performance booster to block devices. Also caches use operating systems free memory. But how to store large amount of files in file system
Worst case scenarios when storing files in file system
- Starting more and more application consumes memory. Less free memory is available to cache. Probability of cache miss is high and physical disk access is required.
- Increasing number of files and directories increases meta space. Very many file take use of much cache. Probably cache misses cause physical disk access.
- Putting many files into single directory. Opening a file requires to query directories entries. Searching in a directory causes access to directory related blocks. In worst case later block will kick earlier blocks from cache. When accessing directory again, physical disk access is required.
- Creating many directories is a very similar problem. Changing working directory requires operating system to reload meta information. If meta information does not fit to cache than physical disk operations are required.
File system experiences on windows and linux
On windows workstations seems to be a difference from gui to shell. Reading directories with >10k file may block for seconds. Meanwhile accessing same directory in shell often is faster. Probably shell extension are an explanation for differences. Most windows hosts use antivirus software, which also slows down performance.
On linux machine shell access is faster. Directories with >100.000 become slow when using commands like ls or rm. Reaching more than 1.000.000 files in single directory becomes significantly slow. Putting so many files is not a good idea. Quering directory contents reads many blocks. It is time consuming.
Conclusion
Keep in mind file systems have limits. Newer hardware or recent operating system versions may raise limits. But limits are existing. Storing a growing number of files may reach these limits in the future. Random access on a large set of files of directories may cause cache misses. In many cases a host serves many applications. A single application causing heavy i/o-traffic may slow down the host. Estimate the number of files. Keep number of files per directory less than 1000-5000. Spread files to directories. Be aware of flooding file system with many directories.