faninsar.datasets.RasterDataset#
- class faninsar.datasets.RasterDataset(root_dir: str = 'data', paths: Sequence[str] | None = None, crs: CRS | None = None, res: float | tuple[float, float] | None = None, dtype: np.dtype | None = None, nodata: float | None = None, roi: BoundingBox | None = None, bands: Sequence[str] | None = None, cache: bool = True, resampling: Resampling = Resampling.nearest, fill_nodata: bool = False, verbose: bool = True, ds_name: str = '')[source]#
Bases:
GeoDatasetA base class for raster datasets.
Examples
>>> from pathlib import Path >>> from faninsar.datasets import RasterDataset >>> from faninsar.query import BoundingBox, GeoQuery, Points, >>> home_dir = Path("./work/data") >>> files = list(home_dir.rglob("*unw_phase.tif"))
initialize a RasterDataset and GeoQuery object
>>> ds = RasterDataset(paths=files) >>> points = Points( [(490357, 4283413), (491048, 4283411), (490317, 4284829)] ) >>> query = GeoQuery(points=points, boxes=[ds.bounds, ds.bounds])
use the GeoQuery object to index the RasterDataset
>>> sample = ds[query]
output the samples shapes:
>>> print("boxes result shape:", sample.boxes.data.shape) boxes result shape: (2, 7, 68, 80)
>>> print("points result shape:", sample.points.data.shape) points result shape: (7, 3)
of course, you can also use the BoundingBox or Points directly to index the RasterDataset. Those two types will be automatically converted to GeoQuery object.
>>> sample = ds[points] >>> sample {'query': GeoQuery( boxes=None points=Points(count=3) ), 'boxes': None, 'points': array([...], dtype=float32)}
>>> sample = ds[ds.bounds] query': GeoQuery( boxes=[1 BoundingBox] points=None ), 'boxes': array([...], dtype=float32), 'points': None}
- __init__(root_dir: str = 'data', paths: Sequence[str] | None = None, crs: CRS | None = None, res: float | tuple[float, float] | None = None, dtype: np.dtype | None = None, nodata: float | None = None, roi: BoundingBox | None = None, bands: Sequence[str] | None = None, cache: bool = True, resampling: Resampling = Resampling.nearest, fill_nodata: bool = False, verbose: bool = True, ds_name: str = '') None[source]#
Initialize a new raster dataset instance.
- Parameters:
root_dir (str or Path) – root_dir directory where dataset can be found.
paths (list of str, optional) – list of file paths to use instead of searching for files in
root_dir. If None, files will be searched for inroot_dir.crs (CRS, optional) – the output term:coordinate reference system (CRS) of the dataset. If None, the CRS of the first file found will be used.
res (float, optional) – resolution of the output dataset in units of CRS. If None, the resolution of the first file found will be used.
dtype (numpy.dtype, optional) – data type of the output dataset. If None, the data type of the first file found will be used.
nodata (float or int, optional) – no data value of the dataset. If None, the no data value of the first file found will be used. This parameter is useful when the no data value is not stored in the file.
roi (BoundingBox, optional) – region of interest to load from the dataset. If None, the union of all files bounds in the dataset will be used.
bands (list of str, optional) – names of bands to return (defaults to all bands)
cache (bool, optional) – if True, cache file handle to speed up repeated sampling
resampling (Resampling, optional) – Resampling algorithm used when reading input files. Default: Resampling.nearest.
fill_nodata (bool, optional) –
Whether to fill holes in the queried data by interpolating them using inverse distance weighting method provided by the
rasterio.fill.fillnodata(). Default: False.Note
This parameter is only used when sampling data using bounding boxes or polygons queries, and will not work for points queries.
verbose (bool, optional) – if True, print verbose output, default: True
ds_name (str, optional) – name of the dataset. used for printing verbose output, default: “”
- Raises:
FileNotFoundError – if no files are found in
root_dir:
Methods
__init__([root_dir, paths, crs, res, dtype, ...])Initialize a new raster dataset instance.
array2kml(arr, out_file[, bounds, ...])Write a numpy array into a kml file.
array2kmz(arr, out_file[, bounds, ...])Write a numpy array into a kmz file.
array2tiff(arr, filename[, bounds, bbox, ...])Save a numpy array to a tiff file using the geoinformation of dataset.
get_profile([bbox])Get profile information of dataset for the given bounding box type.
load_mask(mask_path[, bbox])Load a mask from a tiff mask file (.msk).
parse_mask(percent[, bbox, seed])Parse the mask of the dataset.
reproject(new_crs[, resampling, nodata])Reproject the dataset to a new CRS.
resample(new_res[, resampling, nodata])Resample the dataset to a new resolution.
row_col(xy[, crs, bbox])Convert x, y coordinates to row, col in the dataset.
show(arr, **kwargs)Show the array using the dataset's geo information.
to_netcdf(filename[, roi])Save the dataset to a netCDF file for given region of interest.
to_tiffs(out_dir[, roi])Save the dataset to a directory of tiff files for given region of interest.
xy(row_col[, crs, bbox])Convert row, col in the dataset to x, y coordinates.
Attributes
Names of all available bands in the dataset
Bounds of the overall dataset.
Color map for the dataset, used for plotting
Number of valid files in the dataset.
Coordinate reference system (CRS) of the dataset.
Date format string used to parse date from filename.
Data type of the dataset.
When
separate_filesis True, the following additional groups are searched for to find other files:Return a list of all files in the dataset.
No data value of the dataset.
Glob expression used to search for files.
Return the resolution of the dataset.
Names of RGB bands in the dataset, used for plotting
Return the region of interest of the dataset.
Whether all files in the dataset have the same CRS with the desired CRS.
Shape of the dataset.
Return a boolean array indicating which files are valid.
- array2kml(arr: ndarray, out_file: str | Path, bounds: BoundingBox | None = None, img_kwargs: dict | None = None, cbar_kwargs: dict | None = None, verbose: bool = True) None[source]#
Write a numpy array into a kml file.
- Parameters:
arr (numpy.ndarray) – the numpy array to be written into kml file.
out_file (str or Path) – the path of the kml file.
bounds (BoundingBox, optional) – the bounds of the arr. Default is None, which means the roi of the dataset will be used.
img_kwargs (dict) – the keyword arguments for
matplotlib.pyplot.imshow()function.cbar_kwargs (dict) – the keyword arguments for
save_colorbar()function, except for the out_file and mappable argument.verbose (bool) – whether to print the information of the kml file. Default is verbose.
- array2kmz(arr: ndarray, out_file: str | Path, bounds: BoundingBox | None = None, img_kwargs: dict | None = None, cbar_kwargs: dict | None = None, keep_kml: bool = False, verbose: bool = True) None[source]#
Write a numpy array into a kmz file.
- Parameters:
arr (numpy.ndarray) – the numpy array to be written into kmz file.
out_file (str or Path) – the path of the kmz file.
bounds (BoundingBox, optional) – the bounds of the arr. Default is None, which means the roi of the dataset will be used.
img_kwargs (dict) – the keyword arguments for
matplotlib.pyplot.imshow()function.cbar_kwargs (dict) – the keyword arguments for
save_colorbar()function, except for the out_file and mappable argument.keep_kml (bool) – whether to keep the kml file. Default is False.
verbose (bool) – whether to print the information of the kmz file. Default is verbose.
- array2tiff(arr: np.ndarray, filename: str | Path, bounds: BoundingBox | None = None, bbox: BoundingBox | None = None, band_names: Sequence[str] | None = None, arr_type: Literal['data', 'mask'] = 'data', nodata: float | None = None, overwrite: bool = False) None[source]#
Save a numpy array to a tiff file using the geoinformation of dataset.
- Parameters:
arr (numpy.ndarray) – numpy array to save. arr can be a 2D array or a 3D array. If arr is a 3D array, the first dimension should be the band dimension.
filename (str or Path) – path to the tiff file to save
bounds (BoundingBox, optional) – the bounds of the arr. Default is None, which means the roi of the dataset will be used.
bbox (BoundingBox, optional) – if specified, the input array will be saved to the given part/bbox of dataset. Default is None, which means the array will be saved to the entire dataset.
band_names (Sequence of str, optional) – names of bands to save. Default is None, which will use the band indexes.
arr_type (str, one of ['data', 'mask'], optional) – type of the array to save. Default is ‘data’.
nodata (float or int, optional) – no data value of the dataset. If None, will automatically parse the a proper no data value for the array.
overwrite (bool, optional) – if True, overwrite the existing file. Default is False, which means the array will be saved in append mode (r+ mode).
- get_profile(bbox: BoundingBox | Literal['roi', 'bounds'] = 'roi') Profile | None[source]#
Get profile information of dataset for the given bounding box type.
- load_mask(mask_path: str | Path, bbox: BoundingBox | Literal['roi', 'bounds'] = 'roi') ndarray[source]#
Load a mask from a tiff mask file (.msk).
- parse_mask(percent: float, bbox: BoundingBox | Literal['roi', 'bounds'] = 'roi', seed: int = 0) ndarray[source]#
Parse the mask of the dataset.
The mask is a boolean array where True indicates valid data and False indicates invalid data, which keeps in line with the GDAL/rasterio strategy.
- reproject(new_crs: CRS | str, resampling: Resampling = Resampling.nearest, nodata: float | None = None) Self[source]#
Reproject the dataset to a new CRS.
- Parameters:
new_crs (CRS or str) – new coordinate reference system (CRS) of the dataset. It can be a CRS object or a string, which will be parsed to a CRS object. The string can be in any format supported by
pyproj.crs.CRS.from_user_input().resampling (Resampling, optional) – resampling method to use when reprojecting the dataset. Default is Resampling.nearest.
nodata (float or int, optional) – no data value of the dataset. If None, the no data value of the dataset will be used.
- resample(new_res: float | tuple[float, float], resampling: Resampling = Resampling.nearest, nodata: float | None = None) Self[source]#
Resample the dataset to a new resolution.
- Parameters:
new_res (float or tuple of float) – new resolution of the dataset in units of CRS. If a single float is provided, it will be used for both x and y dimensions.
resampling (Resampling, optional) – resampling method to use when resampling the dataset. Default is Resampling.nearest.
nodata (float or int, optional) – no data value of the dataset. If None, the no data value of the dataset will be used.
- row_col(xy: Sequence, crs: CRS | str | None = None, bbox: BoundingBox | Literal['roi', 'bounds'] = 'roi') np.ndarray[source]#
Convert x, y coordinates to row, col in the dataset.
- Parameters:
xy (Sequence) – Pairs of x, y coordinates (floats)
crs (CRS or str, optional) – The CRS of the points. If None, the CRS of the dataset will be used. allowed CRS formats are the same as those supported by rasterio.
bbox (str, one of ['bounds', 'roi'], optional) – the bounding box used to calculate the
width,heightandtransformof the dataset for the profile. Default is ‘roi’.
- Returns:
row_col – row, col in the dataset for the given points(xy)
- Return type:
np.ndarray
- show(arr: ndarray, **kwargs) Self[source]#
Show the array using the dataset’s geo information.
- Parameters:
arr (np.ndarray) – The array with same shape as the dataset to show. The geo information of the dataset will be used to plot the array.
kwargs (key value pairs, optional) – Additional keyword arguments to pass to the
rasterio.plot.show()function.
- to_netcdf(filename: str | Path, roi: BoundingBox | None = None) None[source]#
Save the dataset to a netCDF file for given region of interest.
- Parameters:
filename (str) – path to the netCDF file to save
roi (BoundingBox, optional) – region of interest to save. If None, the roi of the dataset will be used.
- to_tiffs(out_dir: str | Path, roi: BoundingBox | None = None) None[source]#
Save the dataset to a directory of tiff files for given region of interest.
- Parameters:
out_dir (str or Path) – path to the directory to save the tiff files
roi (BoundingBox, optional) – region of interest to save. If None, the roi of the dataset will be used.
- xy(row_col: Sequence, crs: CRS | str | None = None, bbox: BoundingBox | Literal['roi', 'bounds'] = 'roi') np.ndarray[source]#
Convert row, col in the dataset to x, y coordinates.
- Parameters:
row_col (Sequence) – Pairs of row, col in the dataset (floats)
crs (CRS or str, optional) – The CRS of output points. If None, the CRS of the dataset will be used. Can be any of the formats supported by
pyproj.CRS.from_user_input().bbox (str, one of ['bounds', 'roi'], optional) – the bounding box used to calculate the
width,heightandtransformof the dataset for the profile. Default is ‘roi’.
- Returns:
xy – x, y coordinates in the given CRS (default is the CRS of the dataset)
- Return type:
np.ndarray
- property bounds: BoundingBox#
Bounds of the overall dataset.
It is the union of all the files in the dataset.
- Returns:
bounds – (minx, right, bottom, top) of the dataset
- Return type:
BoundingBox object
- cmap: ClassVar[dict[int, tuple[int, int, int, int]]] = {}#
Color map for the dataset, used for plotting
- property count: int#
Number of valid files in the dataset.
Note
This is different from the length of the dataset
len(GeoDataset), which is the total number of files in the dataset, including invalid files that cannot be read by rasterio.- Returns:
count – number of valid files in the dataset
- Return type:
- property crs: CRS | None#
Coordinate reference system (CRS) of the dataset.
- Return type:
The coordinate reference system (CRS).
- date_format = '%Y%m%d'#
Date format string used to parse date from filename.
Not used if
filename_regexdoes not contain adategroup.
- property dtype: dtype | None#
Data type of the dataset.
- Returns:
dtype – data type of the dataset
- Return type:
numpy.dtype object or None
- filename_regex = '.*'#
When
separate_filesis True, the following additional groups are searched for to find other files:band: replaced with requested band name
- property files: DataFrame#
Return a list of all files in the dataset.
- Return type:
list of all files in the dataset
- pattern = '*'#
Glob expression used to search for files.
This expression should be specific enough that it will not pick up files from other datasets. It should not include a file extension, as the dataset may be in a different file format than what it was originally downloaded as.
- property res: tuple[float, float]#
Return the resolution of the dataset.
- Returns:
res – resolution of the dataset in x and y directions.
- Return type:
tuple of floats
- property roi: BoundingBox | None#
Return the region of interest of the dataset.
- Returns:
roi – region of interest of the dataset. If None, the bounds of entire dataset will be used.
- Return type:
BoundingBox object
- property shape: tuple[int, int]#
Shape of the dataset.
- Returns:
shape – shape of the dataset in (height, width) format
- Return type:
tuple of ints