pyspark.SparkContext.addFile#
- SparkContext.addFile(path, recursive=False)[source]#
- Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. - To access the file in Spark jobs, use - SparkFiles.get()with the filename to find its download location.- A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems. - New in version 0.7.0. - Parameters
- pathstr
- can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use - SparkFiles.get()to find its download location.
- recursivebool, default False
- whether to recursively add files in the input directory 
 
 - Notes - A path can be added only once. Subsequent additions of the same path are ignored. - Examples - >>> import os >>> import tempfile >>> from pyspark import SparkFiles - >>> with tempfile.TemporaryDirectory(prefix="addFile") as d: ... path1 = os.path.join(d, "test1.txt") ... with open(path1, "w") as f: ... _ = f.write("100") ... ... path2 = os.path.join(d, "test2.txt") ... with open(path2, "w") as f: ... _ = f.write("200") ... ... sc.addFile(path1) ... file_list1 = sorted(sc.listFiles) ... ... sc.addFile(path2) ... file_list2 = sorted(sc.listFiles) ... ... # add path2 twice, this addition will be ignored ... sc.addFile(path2) ... file_list3 = sorted(sc.listFiles) ... ... def func(iterator): ... with open(SparkFiles.get("test1.txt")) as f: ... mul = int(f.readline()) ... return [x * mul for x in iterator] ... ... collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect() - >>> file_list1 ['file:/.../test1.txt'] >>> file_list2 ['file:/.../test1.txt', 'file:/.../test2.txt'] >>> file_list3 ['file:/.../test1.txt', 'file:/.../test2.txt'] >>> collected [100, 200, 300, 400]