The same problem: listing folders and drives
In the last blog, we used a recursive function for a solution with less than 10 lines to scan folders and allow file evaluation by modification date and size.
Now I’m going to raise the bar somewhat for this example by showing even better alternatives.
Catenate the path with Pathlib
Old wine in new bottles?
The solution to the earlier example by catenating the paths was:
path_file = os.sep.join([path_dir, filename])
The advantage of this is that the solution is independent of the operating system, and one does not have to combine strings with a “+” sign or string formatting.
Yet, this is error prone in that one could inadvertently or mistakenly define the directory path with a closing path separator.
path_dir: str = r"C:/Users/sselt/Documents/blog_demo/" # abschließender Trenner
filename: str = "some_file"
path_file = os.sep.join([path_dir, filename])
# C:/Users/sselt/Documents/blog_demo/\some_file
Although this example shows a functioning code, the wrong separator leads to an error when calling up the path. Such errors can occur whenever users manage the path in config files, far from the code, without paying attention to the convention.
A better solution has emerged since Python 3.4, as a pathlib module. This handles file and folder functions of Python’s os module with an object-oriented approach.
To repeat, here’s the old variant:
import os
path = "C:/Users/sselt/Documents/blog_demo/"
os.path.isdir(path)
os.path.isfile(path)
os.path.getsize(path)
And here is the new alternative:
from pathlib import Path
path: Path = Path("C:/Users/sselt/Documents/blog_demo/")
path.is_dir()
path.is_file()
path.stat().st_size
Both deliver exactly the same result. So, why is the second one much better?
Object oriented and more error tolerant
The call-ups are basically object oriented, and it may or may not be your preference – but I like this a lot more. We have an object here, like the path definition, which has attributes and methods.
However, the example applied here to overload operators is more exciting:
filename: Path = Path("some_file.txt")
path: Path = Path("C:/Users/sselt/Documents/blog_demo")
print( path / filename )
# C:\Users\sselt\Documents\blog_demo\some_file.txt
At first, the two-path division appears to be an invalid code. However, the path object was simply overloaded in such a manner that it functions like a catenated path.
In addition to this syntactic sugar, the path objects will intercept other typical errors:
filename: Path = Path("some_file.txt")
# hier path mit überflüssigem Trenner am Schluss
path: Path = Path("C:/Users/sselt/Documents/blog_demo/")
# hier path mit doppeltem Trenner
path: Path = Path("C:/Users/sselt/Documents/blog_demo//")
# hier path völlig durcheinander
path: Path = Path("C:\\Users/sselt\\Documents/blog_demo") # hier ein wilder Mix
# alle Varianten führen zum selben Ergebnis
print(path/filename)
# C:\Users\sselt\Documents\blog_demo\some_file.txt
This variant is not only nicer, but also more robust against false inputs. In addition to other advantages, the code is also independent of the operating system. One defines only a generic path object, which manifests itself in a Windows system as a WindowsPath and in a Linux system as a PosixPath.
Most functions that typically expect a string as a path can work directly with a path. Rarely, you may need to resolve the object simply with str(Path).
Processing the path with os.walk
In my last blog’s solution, I used os.listdir, os.path.isdir and a recursive function to iterate through the path tree and differentiate between folders and files.
But os.walk offers a better solution. This method does not create a list, but instead an iterator that you can call up line by line. The results contain the respective folder path and a list of all data files within the path. All this occurs by itself recursively, such that you get all the files with one call up.
The better solution with os.walk and Pathlib
If you combine the two aforementioned techniques, you get a solution that is simpler, fully independent of the OS, more robust against inconsequent path formats, and free of explicit recursions:
filesurvey = []
for row in os.walk(path): # row beinhaltet jeweils einen Ordnerinhalt
for filename in row[2]: # row[2] ist ein tupel aus Dateinamen
full_path: Path = Path(row[0]) / Path(filename) # row[0] ist der Ordnerpfad
filesurvey.append([path, filename, full_path.stat().st_mtime, full_path.stat().st_size])
If you can top this with a best practice, don’t hesitate to get in touch. I’d love your feedback!
Read here the first part of the blog post.