The problem: listing folders and drives
Recently while working on a project, a colleague asked whether one could list the content of drives in Python. Of course, you can. Moreover, since this isn’t at all complicated, I’d like to take this case to illustrate key best practices recommended for working with paths on drives.
Step 1: How do I input the right path?
Assuming that you wish to get a listing of a particular path accurately, we start by selecting a user directory on a Windows 10 system, which is basically a reproducible example:
path_dir: str = "C:\Users\sselt\Documents\blog_demo"
The variables assigned upon execution immediately cause an error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
The interpreter doesn’t understand the character sequence \U, since this initiates Unicode characters of a similar sequence. This problem arises because the Windows system uses the backslash “\” as a path separator and Linux uses the slash “/”. Unfortunately, since the Windows separator is also the initiator for diverse special characters or escape in Unicode, it obviously confuses everything. Just like we don’t expect any coherence soon in the use of decimal separators in various countries, our only choice is to go for one of three solutions.
Solution 1 – the hideous variant
Simply avoid the Windows separator and instead write the path using Linux separators only:
path_dir: str = "C:/Users/sselt/Documents/blog_demo"
The interpreter then recognizes the correct path, believing it were a Linux system to start with.
Solution 2 – the even more hideous variant
Use escape sequences.
path_dir: str = "C:\\Users\sselt\Documents\\blog_demo"
What bothers me besides the illegibility of this is that one does not use escape sequences at every character-separator combination, only before the “U” and “b”.
Solution 3 – the elegant one
Use raw strings with “r” as a prefix to indicate that special characters should not be evaluated.
path_dir: str = r"C:\Users\sselt\Documents\blog_demo"
Step 2: Scanning the files
Back to our task of wanting to list all elements in a folder. We already know the path.
The simple command os.listdir lists all strings, i.e., only the path filenames. Here and in all other examples, I use type hinting for additional code documentation. This syntax became available from Python 3.5 onwards.
import os
from typing import List
path_dir: str = r"C:\Users\sselt\Documents\blog_demo"
content_dir: List[str] = os.listdir(path_dir)
The file is okay, but I’m more interested in file statistics, for which we have os.stat.
Step 3: Catenating paths
To transfer the file path, we must first combine the filename and path. I have often seen the following constructs in the wild, and even used them when starting out. For example:
path_file: str = path_dir + "/" + filename
path_file: str = path_dir + "\\" + filename
path_file: str = "{}/{}".format(path_dir, filename)
path_file: str = f"{path_dir}/{filename}"
A and B are hideous, because they catenate strings with a “+” sign – which is unnecessary in Python.
B is especially hideous, because one needs a double separator in Windows, or it will be evaluated as an escape sequence for the closing quotation mark.
C and D are somewhat better, since they use string formatting, but they still do not resolve the system-dependence problem. If I apply the result under Windows, I get a functional, but inconsistent path with a mixture of separators.
filename = "some_file"
print("{}/{}".format(path_dir, filename))
...: 'C:\\Users\\sselt\\Documents\\blog_demo/some_file'
A solution independent of the OS
A solution from Python is os.sep or os.path.sep. Both return the path separator of the respective system. They are functionally identical, but the second, more explicit syntax immediately shows the separator involved.
This means, one can write:
path_file = "{}{}{}".format(path_dir, os.sep, filename)
The result is better, but at the expense of a complicated code, if you were to combine several path segments.
Therefore, the convention is to combine path elements via string catenation. This is even shorter and more generic:
path_file = os.sep.join([path_dir, filename])
The first full run
Let’s go to the directory:
for filename in os.listdir(path_dir):
path_file = os.sep.join([path_dir, filename])
print(os.stat(path_file))
One of the results (not shown) is st_atime, the last time it was accessed, st_mtime for the last modification, and st_ctime for the creation time. Also, st_size gives the file size in bytes. At the moment, all I want to know is the size and last modification date, and so I choose to save a simple list format.
import os
from typing import List, Tuple
filesurvey: List[Tuple] = []
content_dir: List[str] = os.listdir(path_dir)
for filename in content_dir:
path_file = os.sep.join([path_dir, filename])
stats = os.stat(path_file)
filesurvey.append((path_dir, filename, stats.st_mtime, stats.st_size))
The final function with recursion
The resulting outcome appears satisfactory at first, but two new problems arise. Listdir does not differentiate between files and folders, addresses only the folder level and does not process subfolders. Hence, we need a recursive function that differentiates between files and folders. os.path.isdir checks for us whether there is a folder below a path.
def collect_fileinfos(path_directory: str, filesurvey: List[Tuple]):
content_dir: List[str] = os.listdir(path_directory)
for filename in content_dir:
path_file = os.sep.join([path_directory, filename])
if os.path.isdir(path_file):
collect_fileinfos(path_file, filesurvey)
else:
stats = os.stat(path_file)
filesurvey.append((path_directory, filename, stats.st_mtime, stats.st_size))
filesurvey: List[Tuple] = []
collect_fileinfos(path_dir, filesurvey)
Making the results useful as a data frame
Done! We have resolved the problem in less than 10 lines. Since I planned to have filesurvey as a list of tuples, I can easily transfer the result into the panda data frame and analyze it there to compute the totals saved in folders, etc.
import pandas as pd
df: pd.DataFrame = pd.DataFrame(filesurvey, columns=('path_directory', 'filename', 'st_mtime', 'st_size'))
...but unfortunately, it’s not the very best practice
I know, the blog promised to solve the problem using best practices.
A few years ago, my blogs would have earned some repute, but although Python keeps being developed it’s possible to improve even such simple use cases.
In the next part, I’m going to address this use case again and solve it elegantly.
Read here the second part of the blog post.