Best practices for designing the document library structure

When creating your library structure for FileHold, you need carefully plan the library design for the Cabinet, Drawer, and Folder structure. A large number of Cabinets and Folders can lead to performance issues (slow performance down) so scalability and future growth needs to be taken into account. There are many factors that affect the performance of the system so the general guidelines provided are based on calculations, tests, and experience.

Below are the general guidelines for creating the library structure and the technical background information on the effect of library size on performance. Sizes are approximate.

Guidelines for the document library structure - summary

  1. It is important to predict the size of the library not only at the time of designing the structure, but also taking into account how it will grow in the following years, in order to avoid having to change the design in the future.

  2. The best way to design the FileHold document library structure is to have a small library structure, even when the number of documents is very large. It is always better to rely on metadata values and search facilities in order to find relevant documents. Such approach is more flexible than using a fixed library structure. Separate cabinets and folders should only be used to control permissions to various parts of the system (for example, to separate Accounting department from Engineering documents) and to divide documents into large chunks (for example, a separate folder for each accounting year).

  3. If there is a need to use a large number of folders (for example, one folder per client), the number of drawers and folder needs to be properly balanced, so that the total number of drawers is less than 500 and the number of folders in each drawer is less than 200. This can be achieved, for example, by distributing folders into separate drawers based on the first letter (or several letters) of their name. Folder groups may also help, although they do not improve performance, but only make the structure clearer.

  4. The cost of calculating permissions for cabinets is relatively high, so there should only be as many cabinets as necessary. It’s generally better to have 5 cabinets with 100 drawers each than 50 cabinets with 10 drawers each, even though the page size is similar (see the table below). It is also easier to manage such structure if permissions need to be changed. If more granular control over permissions is necessary, they can be controlled on folder level.

  5. It is also very important to keep only as many drawers expanded as necessary. Drawers that are no longer needed should be collapsed. This also makes it easier to navigate the library tree, as there is no need to scroll through a large list of folders. As a general rule, no more than 1,000 items should be visible at any given time. It is also important to remember to collapse drawers before logging out from FileHold; this will make logging back in much faster.

Effect of library size on performance - Technical information

The size of the library affects performance in many different ways, including, but not limited to:

  • The cost of SQL queries that retrieve data from database and calculate permissions.

  • The cost of transferring data to the client (Web Client and FileHold Desktop Application).

  • The size of HTML markup and JavaScript code that needs to be processed by the browser (Web Client).

Each of these factors may affect performance to a certain degree, but the overall performance will be as good (or as bad) as the weakest link in this chain.

To some degree, performance of the SQL queries can be improved by placing the database on a machine with a lot of RAM and processing power. The cost of transferring data to the client can be reduced by using HTTP compression (which FileHold uses) and broadband connections. However, the size of HTML markup will always affect the amount of memory used be the browser and the time required to process and display the page. That cost is difficult to avoid.

In FileHold, the structure of the library is retrieved in two steps:

  • First, all cabinets and drawers are retrieved whether they are expanded in the tree structure or not.

  • Then, folder groups and folders from all expanded drawers are retrieved.

This means that having a lot of cabinets and/or drawers is not a good idea, as they all have to be loaded and sent to the client on every page load (in case of the Web Client). Even though loading drawers from the database is relatively cheap, as they don’t have advanced permission settings, the amount of generated HTML markup may be very large. For each drawer it’s about 2,500 bytes, so for 1,000 drawers the size of each page is at least 2.5 MB.

Having lots of folders in a single drawer can also seriously affect performance. Retrieving folders from the database is quite costly, because permissions must be calculated individually for each folder. Also the amount of HTML markup is 2,000 bytes per folder, so each expanded drawer with 1,000 folders is an additional 2 MB of page size. This cost grows dramatically as more drawers are expanded at the same time.

Page size can be a good estimate for performance, because it affects not only the amount of data that need to be transferred over the network (which is usually compressed). Generating HTML markup requires lots of memory and computing power on the server. Parsing and storing the data in the web browser is even more costly, because the browser needs many times more memory to store the data than the size of raw HTML markup. Although when using the FDA, the page size is no longer relevant, it is still a good measure of the amount of data that FDA needs to keep in memory and retrieve from the FileHold server. The FDA doesn’t need to retrieve those data upon each operation, but loading them at startup, when logging on to the server, may still take a significant amount of time.

Assuming that there are C cabinets, D drawers in each cabinet and F folders in each drawer, and that E drawers are expanded (opened, showing the folder list), the size of the library page in bytes (without anything else that the library tree) can be estimated using the following equation:

Page Size = C * 2,500 + C * D * 2,500 + E * F * 2,000

The total number of folders in the library equals:

            Total Folders = C * D * F

Let’s assume that there are 25,000 folders in the library, and we divide them into cabinets and drawers in three different ways:

  • Case 1: 5 cabinets, 10 drawers each, 500 folders per drawer
  • Case 2: 5 cabinets, 100 drawers each, 50 folders per drawer
  • Case 3: 5 cabinets, 1,000 drawers each, 5 folders per drawer

Depending on the distribution of folders into drawers, page size will change significantly:

C

D

E

F

Page Size

Total Folders

5

10

1

500

1,137,500

25,000

5

10

2

500

2,137,500

25,000

5

100

1

50

1,362,500

25,000

5

100

2

50

1,462,500

25,000

5

1,000

1

5

12,522,500

25,000

5

1,000

2

5

12,532,500

25,000

Each case is shown with one expanded drawer (E = 1) and two expanded drawers (E = 2).

In case of 1,000 drawers per cabinet, the page size is always over 12 MB, no matter how many drawers are expanded.

In case of 10 drawers per cabinet, the page size with one expanded drawer is slightly over 1 MB, but it grows very quickly when more drawers are expanded.

The case with 100 drawers per cabinet and 50 folders per drawer is most balanced. The initial page size is not very large compared to the third case and it does not grow as rapidly as in the first case.

Folder Groups do not make a significant difference in the math since all folders from folder groups are retrieved at the same time when a drawer is expanded, even when the folder groups are not expanded. For example, instead of a flat list of 500 folders per drawer, we could have 10 groups with 50 folders each, but that would not change the page size significantly.