pyspark.RDD.takeSample#
- RDD.takeSample(withReplacement, num, seed=None)[source]#
- Return a fixed-size sampled subset of this RDD. - New in version 1.3.0. - Parameters
- withReplacementbool
- whether sampling is done with replacement 
- numint
- size of the returned sample 
- seedint, optional
- random seed 
 
- Returns
- list
- a fixed-size sampled subset of this - RDDin an array
 
 - See also - Notes - This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory. - Examples - >>> import sys >>> rdd = sc.parallelize(range(0, 10)) >>> len(rdd.takeSample(True, 20, 1)) 20 >>> len(rdd.takeSample(False, 5, 2)) 5 >>> len(rdd.takeSample(False, 15, 3)) 10 >>> sc.range(0, 10).takeSample(False, sys.maxsize) Traceback (most recent call last): ... ValueError: Sample size cannot be greater than ...