Pymicra’s auto-generated docs

pymicra.constants

Defines some useful constants

pymicra.core

Defines classes that are the basis of Pymicra

class pymicra.core.Notation(*args, **kwargs)

Bases: object

Holds the notation used in every function of pymicra except when told otherwise.

build(from_level=0)

This useful method builds the full notation based on the base notation.

Given notation for means, fluctuations, and etc, along with names of variables, this method builds the notation for mean h2o concentration, virtual temperature fluctuations and so on.

Parameters:
  • self (pymicra.Notation) – notation to be built
  • from_level (int) – level from which to build. If 0, build everything from scratch and higher notations will be overwritten. If 1, skip one step in building process. Still to be implemented!
Returns:

Notation object with built notation

Return type:

pymicra.Notation

class pymicra.core.fileConfig(*args, **kwargs)

Bases: object

This class defines a specific configuration of a data file

Parameters:
  • from_file (str) – path of .cfg file (configuration file) to read from. This will ignore all other keywords.
  • variables (list of strings or dict) – If a list: should be a list of strings with the names of the variables. If the variable is part if the date, then it should be provided as a datetime directive, so if the columns is only the year, its name must be %Y and so forth. While if it is the date in YYYY/MM/DD format, it should be %Y/%m/%d. For more info see https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior If a dict: the keys should be the numbers of the columns and the items should follow the rules for a list.
  • date_cols (list of ints) – should be indexes of the subset of varNames that corresponds to the variables that compose the timestamp. If it is not provided the program will try to guess by getting all variable names that have a percentage sign (%).
  • date_connector (string) – generally not really necessary. It is used to join and then parse the date_cols.
  • columns_separator (string) – used to assemble the date. If the file is tabular-separated then this should be “whitespace”.
  • header_lines (int or list) – up to which line of the file is a header. See pandas.read_csv header option.
  • filename_format (string) – tells the format of the file with the standard notation for date and time and with variable parts as “?”. E.g. if the files are 56_20150101.csv, 57_20150102.csv etc filename_format should be: ??_%Y%m%d.csv this is useful primarily for the quality control feature.
  • units (dictionary) – very important: a dictionary whose keys are the columns of the file and whose items are the units in which they appear.
  • description (string) – brief description of the datalogger configuration file.
  • varNames (DEPRECATED) – use variables now.
get_date_cols()

Guesses what are the columns that contain the dates by searching for percentage signs in them

class pymicra.core.siteConfig(*args, **kwargs)

Bases: object

Keeps the configurations and constants of an experiment. (such as height of instruments, location, canopy height and etc)

Check help(pm.siteConfig.__init__) for other parameters

Parameters:from_file (str) – path to .site file which contais other keywords

pymicra.data

pymicra.decorators

Defines useful decorators for Pymicra

pymicra.decorators.autoassign(*names, **kwargs)

Decorator that automatically assigns keywords as atributes

allow a method to assign (some of) its arguments as attributes of ‘self’ automatically. E.g.

To restrict autoassignment to ‘bar’ and ‘baz’, write:

@autoassign(‘bar’, ‘baz’) def method(self, foo, bar, baz): …

To prevent ‘foo’ and ‘baz’ from being autoassigned, use:

@autoassign(exclude=(‘foo’, ‘baz’)) def method(self, foo, bar, baz): …

pymicra.decorators.pdgeneral(convert_out=True)

Defines a decorator to make functions work on both pandas.Series and DataFrames

Parameters:convert_out (bool) – if True, also converts output back to Series if input is Series
pymicra.decorators.pdgeneral_in(func)

Defines a decorator that transforms Series into DataFrame

pymicra.decorators.pdgeneral_io(func)

If the input is a series transform it to a dtaframe then transform the output from dataframe back into a series. If the input is a series and the output is a one-element series, transform it to a float.

Currently the output functionality works only when the output is one variable, not an array of elements.

pymicra.io

Defines some useful functions to aid on the input/output of data

pymicra.io.readDataFile(fname, variables=None, only_named_cols=True, **kwargs)

Reads one datafile using pandas.read_csv()

Parameters:
  • variables (list or dict) – keys are columns and values are names of variable
  • only_named_columns (bool) – if True, don’t read columns that don’t appear on variables’ keys
  • kwargs (dict) – dictionary with kwargs of pandas’ read_csv function see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html for more detail
  • variables – list or dictionary containing the names of each variable in the file (if dict, the keys must be ints)
Returns:

pandas.DataFrame object

Return type:

pandas.DataFrame

pymicra.io.readDataFiles(flist, verbose=0, **kwargs)

Reads data from a list of files by calling readDataFile individually for each entry

Parameters:
  • flist (sequence of strings) – files to be parsed
  • verbose (bool) – whether to print
  • **kwargs – readDataFile kwargs
Returns:

data

Return type:

pandas.DataFrame

pymicra.io.readUnitsCsv(filename, **kwargs)

Reads a csv file in which the first line is the name of the variables and the second line contains the units

Parameters:
  • filename (string) – path of the csv file to read
  • **kwargs – to be passed to pandas.read_csv
Returns:

  • df (pandas.DataFrame) – dataframe with the data
  • unitsdic (dictionary) – dictionary with the variable names as keys and the units as values

pymicra.io.read_fileConfig(dlcfile)

Reads file (metadata) configuration file

WARNING! When defining the .config file note that by default columns that are enclosed between doublequotes will appear without the doublequotes. So if your file is of the form :

“2013-04-05 00:00:00”, .345, .344, …

Then the .config should have: variables={0:’%Y-%m-%d %H:%M:%S’,1:’u’,2:’v’}. This is the default csv format of CampbellSci dataloggers. To disable this feature, you should parse the file with read_csv using the kw: quoting=3.

pymicra.io.read_site(sitefile)

Reads .site configuration file, which holds siteConfig definitions

The .site should have definitions as regular python syntax (in meters!):
measurement_height = 10 canopy_height = 5 displacement_height = 3 roughness_length = 1.0
sitedile: str
path to .site file
Parameters:sitefile (str) – path to the site configuration file
Returns:pymicra site configuration object
Return type:pymicra.siteConfig
pymicra.io.timeSeries(flist, datalogger, parse_dates=True, verbose=False, read_data_kw={}, parse_dates_kw={}, clean_dates=True, return_units=True, only_named_cols=True)

Creates a micrometeorological time series from a file or list of files.

Parameters:
  • flist (list or string) – either list or names of files (dataFrame will be one concatenated dataframe) or the name of one file
  • datalogger (pymicra.fileConfig object) – configuration of the datalogger which is from where all the configurations of the file will be taken
  • parse_date (bool) – whether or not to index the data by date. Note that if this is False many of the functionalities of pymicra will be lost. (i.d. there are repeated timestamps)
  • verbose (int, bool) – verbose level
Returns:

  • pandas.DataFrame – data contained in the files in flist
  • dict (optional) – units of the data

pymicra.io.write_as_fconfig(data, fname, fileconfig)

Writes a pandas DataFrame in a format according to fileConfig object

pymicra.methods

Defines some methods. Some have functions defined here but most use functions defined elsewhere. This is done by monkey-patching Pandas.

pymicra.methods.binwrapper(self, clean_index=True, **kwargs)

Method to return binned data from a dataframe using the function classbin

pymicra.methods.bulk_corr(self)
pymicra.methods.polyfit(*args, **kwargs)

This method fits an n-degree polynomial to the dataset. The index can be a DateTimeIndex or not

Parameters:
  • data (pd.DataFrame, pd.Series) – dataframe whose columns have to be fitted
  • degree (int) – degree of the polynomial. Default is 1.
  • rule (str) – pandas offside string. Ex.: “10min”.
pymicra.methods.to_unitsCsv(self, units, filename, **kwargs)

Wrapper around toUnitsCsv to create a method to print the contents of a dataframe plus its units into a unitsCsv file.

Parameters:
  • self (dataframe) – dataframe to write
  • units (dict) – dictionary with the names of each column and their unit
  • filename (str) – path to which write the unitsCsv
  • kwargs – to be passed to pandas’ method .to_csv
pymicra.methods.xplot(self, xcol, reverse_x=False, return_ax=False, fixed_cols=[], fcols_styles=[], latexify=False, **kwargs)

A smarter way to plot things with the x axis being one of the columns. Very useful for comparison of models and results

Parameters:
  • self (pandas.DataFrame) – datframe to be plotted
  • xcol (str) – the name of the column you want in the x axis
  • reverse_x (bool) – whether to plot -xcol instead of xcol in the x-axis
  • return_ax (bool) – whther to return pyplot’s axis object for the plot
  • fixed_cols (list of strings) – columns to plot in every subplot (only if you use subplot=True on keywords)
  • fcols_styles (list of string) – styles to use for fixed_cols
  • latexify (cool) – whether to attempt to transform names of columns into latex format
  • **kwargs – kwargs to pass to pandas.plot method
  • IS STILL BUGGY (LATEXFY) –

pymicra.physics

Module that contains physical functions. They are all general use, but most are specially frequent in micrometeorology.

TO DO LIST:
  • ADD GENERAL SOLAR ZENITH CALCULATION
  • ADD FOOTPRINT CALCULATION?
pymicra.physics.R_moistAir(q)

Calculates the gas constant for umid air from the specific humidity q

Parameters:q (float) – the specific humidity in g(water)/g(air)
Returns:R_air – the specific gas constant for humid air in J/(g*K)
Return type:float
pymicra.physics.airDensity_from_theta(data, units, notation=None, inplace_units=True, use_means=False, theta=None, theta_unit=None)

Calculates moist air density using theta measurements

Parameters:
  • data (pandas.DataFrame) – dataset to add rho_air
  • units (dict) – units dictionary
  • notation (pymicra.notation) – notation to be used
  • inplace_units (bool) – whether or not to treat units inplace
  • use_means (bool) – use the mean of theta or not when calculating
  • theta (pandas.Series) – auxiliar theta measurement
  • theta_unit (pint.quantity) – auxiliar theta’s unit
pymicra.physics.airDensity_from_theta_v(data, units, notation=None, inplace_units=True, use_means=False, return_full_df=True)

Calculates moist air density using p = rho R_dry T_virtual

Parameters:
  • data (pandas.DataFrame) – data to use to calculate air density
  • units (dict) – dictionary of units
  • notation (pymicra.Notation) – notation to be used
  • inplace_units (bool) – whether or not to update the units inplace. If False, units are returns too
  • use_means (bool) – whether or not to use averages of pressure and virtual temperature, instead of the means plus fluctuations
pymicra.physics.dewPointTemp(theta, e)

Calculates the dew point temperature. theta has to be in Kelvin and e in kPa

pymicra.physics.dryAirDensity_from_p(data, units, notation=None, inplace_units=True)

Calculates dry air density NEEDS IMPROVEMENT REGARDING HANDLING OF UNITS

pymicra.physics.latent_heat_water(T)

Calculates the latent heat of evaporation for water

Receives T in Kelvin and returns the latent heat in J/g

pymicra.physics.perfGas(p=None, rho=None, R=None, T=None, gas=None)

Returns the only value that is not provided in the ideal gas law

P.S.: I’m using type to identify None objects because this way it works againt pandas objects

pymicra.physics.ppxv2density(data, units, notation=None, inplace_units=True, solutes=[])

Calculates density of solutes based on their molar concentration (ppmv, ppbv and etc), not to be confused with mass concentration (ppm, ppb and etc).

Uses the relation \(\rho_x = \frac{C p}{\theta R_x}\)

Parameters:
  • data (pandas.DataFrame) – dataset of micromet variables
  • units (dict) – dict of pint units
  • notation (pymicra.Notation) – notation to be used here
  • inplace_units (bool) – whether or not to treat the dict units in place
  • solutes (list or tuple) – solutes to consider when doing this conversion
Returns:

input data plus calculated density columns

Return type:

pandas.DataFrame

pymicra.physics.satWaterPressure(T, unit='kelvin')

Returns the saturated water vapor pressure according eq (3.97) of Wallace and Hobbes, page 99.

e0, b, T1 and T2 are constants specific for water vapor

Parameters:T (float) – thermodynamic temperature
Returns:
Return type:saturated vapor pressure of water (in kPa)
pymicra.physics.specific_humidity_from_ppxv(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates the specific humidity q from values of molar concentration of water (ppmv, ppthv and etc).

The equation is
mv x
q = —————-
(mv - md) x + md

where x is the molar concentration in ppxv.

Parameters:
  • data (pandas.dataframe) – dataset
  • units (dict) – units dictionary
  • notation (pymicra.Notation) – notation to be used
  • return_full_df (bool) – whether to return only the calculated series or the full df
  • inplace_units (bool) – whether to return only a dict with the units of the new variables or include them in “units”
Returns:

outdata – specific humidity

Return type:

pandas.Series, pandas.DataFrame

pymicra.physics.theta_fluc_from_theta_v_fluc(data, units, notation=None, return_full_df=True, inplace_units=True)

Derived from theta_v = theta(1 + 0.61 q)

Parameters:
  • data (pandas.dataframe) – dataframe with q, q’, theta, theta_v’
  • units (dict) – units dictionary
  • notation (pymicra.Notation) – Notation object or None
Returns:

standard deviation of the thermodynamic temperature

Return type:

float

pymicra.physics.theta_from_theta_s(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates thermodynamic temperature using sonic temperature measurements

From Schotanus, Nieuwstadt, de Bruin; DOI 10.1007/BF00164332

theta_s = theta (1 + 0.51 q) (1 - (vn/c)**2)^0.5

\(theta_s \approx theta (1 + 0.51 q)\)

Parameters:
  • data (pandas.dataframe) – dataset
  • units (dict) – units dictionary
  • notation (pymicra.Notation) –
Returns:

thermodynamic temperature

Return type:

pandas.Series

pymicra.physics.theta_from_theta_v(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates thermodynamic temperature from virtual temperature measurements

\(theta_v \approx theta (1 + 0.61 q)\)

Parameters:
  • data (pandas.dataframe) – dataset
  • units (dict) – units dictionary
  • notation (pymicra.Notation) –
Returns:

virtual temperature

Return type:

pandas.DataFrame or Series

pymicra.physics.theta_s_from_c(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates sonic temperature using speed of sound

From Schotanus, Nieuwstadt, de Bruin; DOI 10.1007/BF00164332

theta_s = 1/403 * c**2

\(theta_s \approx 1/403 c^2\)

Parameters:
  • data (pandas.dataframe) – dataset
  • units (dict) – units dictionary
  • notation (pymicra.Notation) –
Returns:

sonic temperature

Return type:

pandas.Series

pymicra.physics.theta_std_from_theta_v_fluc(data, units, notation=None)

Derived from theta_v = theta(1 + 0.61 q)

Parameters:
  • data (pandas.dataframe) – dataframe with q, q’, theta, theta_v’
  • units (dict) – units dictionary
  • notation (pymicra.Notation) – Notation object or None
Returns:

standard deviation of the thermodynamic temperature

Return type:

float

pymicra.tests

This module contains functions that test certain conditions on pandas.dataframes to be used with the qcontrol().

They all return True for the columns that pass the test and False for the columns that fail the test.

pymicra.tests.check_RA(data, detrend=True, detrend_kw={'how': 'linear'}, RAT_vars=None, RAT_points=50, RAT_significance=0.05)

Performs the Reverse Arrangement Test in each column of data

Parameters:
  • data (pandas.DataFrame) – to apply RAT to each column
  • detrend_kw (dict) – keywords to pass to pymicra.detrend
  • RAT_vars – list of variables to which apply the RAT
  • RAT_points (int) – if it’s an int N, then reduce each column to N points by averaging. If None, then the whole columns are used
  • RAT_significance (float) – significance with which to apply the RAT
Returns:

valid – True or False for each column. If True, column passed the test

Return type:

pd.Series

pymicra.tests.check_limits(data, tables, max_percent=1.0, replace_with='interpolation')

Checks dataframe for lower and upper limits. If found, they are substituted by the linear trend of the run. The number of faulty points is also checked for each column against the maximum percentage of accepted faults max_percent

Parameters:
  • data (pandas dataframe) – dataframe to be checked
  • tables (pandas.dataframe) – dataframe with the lower and upper limits for variables
  • max_percent (float) – number from 0 to 100 that represents the maximum percentage of faulty runs accepted by this test.
Returns:

  • df (pandas.DataFrame) – input data but with the faulty points substituted by the linear trend of the run.
  • valid (pandas.Series) – True for the columns that passed this test, False for the columns that didn’t.

pymicra.tests.check_nans(data, replace_with='interpolation', max_percent=100)

Checks data for NaN values

max_percent is here only for compatibility reasons but is deprecated

pymicra.tests.check_numlines(fname, numlines=18000, failverbose=False)

Checks length of file against a correct value. Returns False is length is wrong and True if length is right

Parameters:
  • fname (string) – path of the file to check
  • numlines (int) – correct number of lines that the file has to have
Returns:

Either with True or False

Return type:

pandas.Series

pymicra.tests.check_replaced(replaced, max_count=180)

Sums and checks if the number of replaced points is larger than the maximum accepted

pymicra.tests.check_spikes(data, chunk_size='2min', detrend={'how': 'linear'}, visualize=False, vis_col=1, max_consec_spikes=10, cut_func=<function <lambda>>, replace_with='interpolation', max_percent=1.0)

Applies spikes-check according to Vickers and Mahrt (1997)

Parameters:
  • data (pandas.dataframe) – data to de-spike
  • chunk_size (str, int) – size of chunks to consider. If str should be pandas offset string. If int, number of lines.
  • detrend (bool) – whether to detrend the data and work with the fluctuations or to work with the absolute series.
  • detrend_kw (dict) – dict of keywords to pass to pymicra.trend in order to detrend data (if detrend==True).
  • visualize (bool) – whether of not to visualize the interpolation ocurring
  • vis_col (str, int or list) – the column(s) to visualize when seeing the interpolation (only effective if visualize==True)
  • max_consec_spikes (int) – maximum number of consecutive spikes to actually be considered spikes and substituted
  • cut_func (function) – function used to define spikes
  • replace_with (str) – method to use when replacing spikes. Options are ‘interpolation’ or ‘trend’.
  • max_percent (float) – maximum percentage of spikes to allow.
pymicra.tests.check_stationarity(data, tables, detrend={'how': 'movingmean', 'window': 900}, trend={'how': 'movingmedian', 'window': '1min'})

Check difference between the maximum and minimum values of the run trend agaisnt an upper-limit. This aims to flag nonstationary runs

First detrends. (Then maybe takes the std depending on moving_std_kw.) Then checks the trend.

pymicra.tests.check_std(data, tables, detrend={'how': 'linear'}, chunk_size='2min', failverbose=False)

Checks dataframe for columns with too small of a standard deviation

Parameters:
  • data (pandas.DataFrame) – dataset whose standard deviation to check
  • tables (pandas.DataFrame) – dataset containing the standard deviation limits for each column
  • detrend (dict) – keywords to pass to pymicra.detrend with detrend==True. If empty, no detrending is done
  • chunk_size (str) – pandas datetime offset string
Returns:

valid – contatining True of False for each column. True means passed the test.

Return type:

pandas.Series

pymicra.tests.check_std_stationarity(data, tables, detrend={'how': 'movingmean', 'window': 900}, moving_std_kw={})

Check difference between the maximum and minimum values of the run trend agaisnt an upper-limit. This aims to flag nonstationary runs

First detrends. Then takes the std.

pymicra.util

Module for general utilities

  • INCLUDE DROPOUT TEST
  • INCLUDE THIRD MOMENT TEST
  • CHANGE NOTATION IN QCONTROL’S SUMMARY
pymicra.util.correctDrift(drifted, correct_drifted_vars=None, correct=None, get_fit=True, write_fit=True, fit_file='correctDrift_linfit.params', apply_fit=True, show_plot=False, return_plot=False, units={}, return_index=False)
Parameters:
  • correct (pandas.DataFrame) – dataset with the correct averages
  • drifted (pandas.DataFrame) – dataset with the averages that need to be corrected
  • correct_drifted_vars (dict) – dictionary where every key is a var in the right dataset and its value is its correspondent in the drifted dataset
  • get_fit (bool) – whether ot not to fit a linear relation between both datasets. Generally slow. Should only be done once
  • write_fit (bool) – if get_fit == True, whether or not to write the linear fit to a file (recommended)
  • fit_file (string) – where to write the linear fit (if one is written) or from where to read the linear fit (if no fit is written)
  • apply_fit (bool) – whether of not to apply the lineat fit and correct the data (at least get_fit and fit_file must be true)
  • show_plot (bool) – whether or not to show drifted vs correct plot, to see if it’s a good fit
  • units (dict) – if given, it creates a {file_file}.units file, to tell write down in which units data has to be in order to be correctly corrected
  • return_index (bool) – whether to return the indexes of the used points for the calculation. Serves to check the regression
Returns:

outdf – drifted dataset corrected with right dataset

Return type:

pandas.DataFrame

pymicra.util.qc_discard(files, fileconfig, read_files_kw={'clean_dates': False, 'only_named_cols': False, 'parse_dates': False, 'return_units': False}, std_limits={}, std_detrend={'how': 'linear'}, dif_limits={}, maxdif_detrend={}, maxdif_trend={'how': 'movingmedian', 'window': 600}, chunk_size=1200, passverbose=False, failverbose=True, failshow=False, passshow=False, passshow_vars=None, outdir='1_filtered', summary_file='filter_summary.csv', full_report=None)

Function that applies various tests quality control to a set of datafiles and re-writes the successful files in another directory. A list of currently-applied tests is found below in order of application. When some variable or set of points fails a test the whole file is discarded.

Tests are based on Vickers and Mahrt, Quality control and fux sampling problems for tower and aircraft data.

  • standard deviation (STD) check:
     runs with a standard deviation lower than a pre-determined value are left out. - keywords: std_limits, std_detrend
  • maximum difference (stationarity) test:
     runs whose trend have a maximum difference greater than a certain value are left out. This excludes non-stationary runs. Activate it by passing a dif_limits keyword. - keywords: dif_limits, maxdif_detrend, maxdif_trend
Parameters:
  • files (list) – list of filepaths
  • fileconfig (pymicra.fileConfig object or str) – datalogger configuration object used for all files in the list of files or path to a dlc file.
  • read_files_kw (dict) – keywords to pass to pymicra.timeSeries. Default is {‘parse_dates’:False} because parsing dates at every file is slow, so this makes the whole process faster. However, {‘parse_dates’:True, ‘clean_dates’:False} is recommended if time is not a problem because the window and chunk_size keywords may be used as, for example ‘2min’, instead of 1200, which is the equivalent number of points.
  • dif_limits (dict) – keys must be names of variables and values must be upper limits for the maximum difference of values that the linear trend of the run must have.
  • maxdif_detrend (dict) – keywords to pass to pymicra.detrend when detrending for max difference test. If it’s empty, no detrending is made.
  • maxdif_trend (dict) – Keywords to pass to pymicra.detrend when trending for max difference test (passed to pymicra.data.trend). If empty, no trending is used. This is used in the max difference test, since the difference is taken between the max and min values of the trend, not of the raw timeSeries.
  • chunk_size (str) – string representing time length of chunks used in the standard deviation check. Default is “2Min”. Putting None will not separate in chunks. It’s recommended to use rolling functions in this case (might be slow).
  • passverbose (bool) – whether or not to show details on the successful runs.
  • failverbose (bool) – whether or not to show details on the failed runs.
  • passshow (bool) – whether of not to plot the successful runs on screen.
  • passshow_vars (list) – list of columns to plot if run is successfull.
  • failshow (bool) – whether of not to plot the failed runs on screen.
  • outdir (str) – name of directory in which to write the successful runs. Directory must already exist.
  • summary_file (str) – path of file to be created with the summary of the runs. Will be overwriten if already exists.
Returns:

ext_summary – dict with the extended summary, which has the path of the files that got “stuck” in each test along with the successful ones

Return type:

pandas.DataFrame

pymicra.util.qc_replace(files, fileconfig, read_files_kw={'clean_dates': False, 'only_named_cols': False, 'parse_dates': False, 'return_units': False}, begin_date=None, end_date=None, file_lines=None, nans_test=True, lower_limits={}, upper_limits={}, spikes_test=True, visualize_spikes=False, chunk_size=1200, spikes_vis_col='u', spikes_detrend={'how': 'linear'}, spikes_func=<function <lambda>>, max_consec_spikes=10, max_replacement_count=180, replace_with='interpolation', passverbose=False, failverbose=True, passshow=False, failshow=False, passshow_vars=None, outdir='0_replaced', summary_file='control_replacement.csv', replaced_report='rreport.csv', full_report=None)

This function applies various quality control checks/tests to a set of datafiles and re-writes the successful files in another directory. This specific function focuses on point-analysis that can be fixed by replacements (removing spikes by interpolation, etc). A run fails this only if (1) it is beyond the accepted dates, (2) has a different number of lines than usual and (3) the number of points replaced is greater than max_replacement_count.

A list of applied tests is found below in order of application. The only test available by default is the spikes test. All others depend on their respective keywords.

Tests are based on Vickers and Mahrt, Quality control and fux sampling problems for tower and aircraft data.

  • date check:Files outside a date_range are left out. keywords: end_date, begin_date
  • lines test:Checks each file to see if they have a certain number of lines. Files with a different number of lines fail this test. keyworkds: file_lines
  • NaN’s filter:Checks for any NaN values. NaNs are replaced with interpolation or linear trend. Activate it by passing nans_test=True. - keywords: nans_test
  • boundaries filter:
     Checks for values in any column lower than a pre-determined lower limit or higher than a upper limit. If found, these points are replacted (interpolated or otherwise). - keywords: lower_limits, upper_limits
  • spikes filter:

    Search for spikes according to user definition. Spikes are replaced (interpolated or otherwise). - keywords: spikes_test, spikes_func, visualize_spikes, spikes_vis_col,

    max_consec_spikes and chunk_size keywords.

  • replacement count test:
     Checks the total amount of points that were replaced (including NaN, boundaries and spikes test) against the max_replacement_count keyword. Fails if any columns has more replacements than that. - keywords: max_replacement_count
Parameters:
  • files (list) – list of filepaths
  • fileconfig (pymicra.fileConfig object or str) – datalogger configuration object used for all files in the list of files or path to a dlc file.
  • read_files_kw (dict) – keywords to pass to pymicra.timeSeries. Default is {‘parse_dates’:False} because parsing dates at every file is slow, so this makes the whole process faster. However, {‘parse_dates’:True, ‘clean_dates’:False} is recommended if time is not a problem because the window and chunk_size keywords may be used as, for example ‘2min’, instead of 1200, which is the equivalent number of points.
  • file_lines (int) – number of line a “good” file must have. Fails if the run has any other number of lines.
  • begin_date (str) – dates before this automatically fail.
  • end_date (str) – dates after this automatically fail.
  • nans_test (bool) – whether or not to apply the nans test
  • lower_limits (dict) – keys must be names of variables and values must be lower absolute limits for the values of each var.
  • upper_limits (dict) – keys must be names of variables and values must be upper absolute limits for the values of each var.
  • spikes_test (bool) – whether or not to check for spikes.
  • spikes_detrend (dict) – keywords to pass to pymicra.detrend when detrending for spikes. Is it’s empty, no detrending is done.
  • visualize_spikes (bool) – whether or not to plot the spikes identification and interpolation (useful for calibration of spikes_func). Only one column is visualized at each time. This is set with the spikes_vis_col keyword.
  • spikes_vis_col (str) – column to use to visualize spikes.
  • spikes_func (function) – function used to look for spikes. Can be defined used numpy/pandas notation for methods with lambda functions. Default is: lambda x: (abs(x - x.mean()) > abs(x.std()*4.))
  • replace_with (str) – method to use when replacing the spikes. Options are ‘interpolation’ and ‘trend’.
  • max_consec_spikes (int) – Limit of consecutive spike points to be interpolated. If the number of consecutive “spikes” is more than this, then we take all those points as not actually being spikes and no replacement is done. So if max_consec_spikes=0, no spike replacements is ever done.
  • chunk_size (str) – string representing time length of chunks used in the spikes check. Default is “2Min”. Putting None will not separate in chunks. It’s recommended to use rolling functions in this case (might be slow).
  • max_replacement_count (int) – Maximum number of replaced point a variable can have in a run. If the replaced number of points is larger than this then the run fails and is discarded. Generally this should be about 1% of the file_lines.
  • passverbose (bool) – whether or not to show details on the successful runs.
  • failverbose (bool) – whether or not to show details on the failed runs.
  • passshow (bool) – whether of not to plot the successful runs on screen.
  • passshow_vars (list) – list of columns to plot if run is successfull.
  • failshow (bool) – whether of not to plot the failed runs on screen.
  • outdir (str) – name of directory in which to write the successful runs. Directory must already exist.
  • summary_file (str) – path of file to be created with the summary of the runs. Will be overwriten if already exists.
Returns:

ext_summary – dict with the extended summary, which has the path of the files that got “stuck” in each test along with the successful ones

Return type:

pandas.DataFrame

pymicra.util.qcontrol(files, fileconfig, read_files_kw={'clean_dates': False, 'only_named_cols': False, 'parse_dates': False, 'return_units': False}, begin_date=None, end_date=None, file_lines=None, nans_test=True, accepted_nans_percent=1.0, lower_limits={}, upper_limits={}, accepted_bound_percent=1.0, spikes_test=True, visualize_spikes=False, spikes_vis_col='u', spikes_detrend={'how': 'linear'}, spikes_func=<function <lambda>>, max_consec_spikes=3, accepted_spikes_percent=1.0, max_replacement_count=180, std_limits={}, std_detrend=True, std_detrend_kw={'how': 'movingmean', 'window': 900}, dif_limits={}, maxdif_detrend=True, maxdif_detrend_kw={'how': 'movingmean', 'window': 900}, maxdif_trend=True, maxdif_trend_kw={'how': 'movingmedian', 'window': 600}, RAT=False, RAT_vars=None, RAT_detrend=True, RAT_detrend_kw={'how': 'linear'}, RAT_points=50, RAT_significance=0.05, chunk_size=1200, replace_with='interpolation', trueverbose=False, falseverbose=True, falseshow=False, trueshow=False, trueshow_vars=None, outdir='quality_controlled', summary_file='qcontrol_summary.csv', replaced_report=None, full_report=None)

Function that applies various tests quality control to a set of datafiles and re-writes the successful files in another directory. A list of currently-applied tests is found below in order of application. The only test available by default is the spikes test. All others depend on their respective keywords.

  • date check:files outside a date_range are left out. keywords: end_date, begin_date
  • lines test:checks each file to see if they have a certain number of lines. Files with a different number of lines fail this test. keyworkds: file_lines
  • NaN’s test:checks for any NaN values. NaNs are replaced with interpolation or linear trend. If the percentage of NaNs is greater than accepted_nans_percent, run is discarded. Activate it by passing nans_test=True. - keywords: accepted_nans_percent, nans_test
  • boundaries test:
     runs with values in any column lower than a pre-determined lower limit or higher than a upper limits are left out. - keywords: lower_limits, upper_limits
  • spikes test:

    replace for spikes and replace them according to some keywords. runs with more than a certain percetage of spikes are left out. - keywords: spikes_test, spikes_func, visualize_spikes, spikes_vis_col,

    max_consec_spikes, accepted_spikes_percent and chunk_size keywords.

  • replacement count test:
     checks the total amount of points that were replaced (including NaN, boundaries and spikes test) against the max_replacement_count keyword. Fails if any columns has more replacements than that. - keywords: max_replacement_count
  • standard deviation (STD) check:
     runs with a standard deviation lower than a pre-determined value (generally close to the sensor precision) are left out. - keywords: std_limits, std_detrend, std_detrend_kw
  • maximum difference (stationarity) test:
     runs whose trend have a maximum difference greater than a certain value are left out. This excludes non-stationary runs. Activate it by passing a dif_limits keyword. - keywords: dif_limits, maxdif_detrend, maxdif_detrend_kw, maxdif_trend, maxdif_trend_kw
  • reverse arrangement test (RAT):
     runs that fail the reverse arrangement test for any variable are left out. - keywords: RAT, RAT_vars, RAT_detrend, RAT_detrend_kw, RAT_points, RAT_significance
Parameters:
  • files (list) – list of filepaths
  • fileconfig (pymicra.fileConfig object or str) – datalogger configuration object used for all files in the list of files or path to a dlc file.
  • read_files_kw (dict) – keywords to pass to pymicra.timeSeries. Default is {‘parse_dates’:False} because parsing dates at every file is slow, so this makes the whole process faster. However, {‘parse_dates’:True, ‘clean_dates’:False} is recommended if time is not a problem because the window and chunk_size keywords may be used as, for example ‘2min’, instead of 1200, which is the equivalent number of points.
  • file_lines (int) – number of line a “good” file must have. Fails if the run has any other number of lines.
  • begin_date (str) – dates before this automatically fail.
  • end_date (str) – dates after this automatically fail.
  • nans_test (bool) – whether or not to apply the nans test
  • accepted_nans_percent (float) – percentage of runs that is acceptable
  • std_limits (dict) – keys must be names of variables and values must be upper limits for the standard deviation.
  • std_detrend (bool) – whether or not to work with the fluctations of the data on the spikes and standard deviation test.
  • std_detrend_kw – keywords to be passed to pymicra.detrend specifically to be used on the STD test.
  • lower_limits (dict) – keys must be names of variables and values must be lower absolute limits for the values of each var.
  • upper_limits (dict) – keys must be names of variables and values must be upper absolute limits for the values of each var.
  • dif_limits (dict) – keys must be names of variables and values must be upper limits for the maximum difference of values that the linear trend of the run must have.
  • maxdif_detrend (bool) – whether to detrend data before checking for differences.
  • maxdif_detrend_kw (dict) – keywords to pass to pymicra.detrend when detrending for max difference test.
  • maxdif_trend (bool) – whether to check for differences using the trend, instead of raw points (which can be the fluctuations or the original absolute values of data, depending if maxdif_detrend==True or False).
  • maxdif_trend_kw (dict) – keywords to pass to pymicra.detrend when trending for max difference test. dictionary of keywords to pass to pymicra.data.trend. This is used in the max difference test, since the difference is taken between the max and min values of the trend, not of the series. Default = {‘how’:’linear’}.
  • spikes_test (bool) – whether or not to check for spikes.
  • spikes_detrend (dict) – keywords to pass to pymicra.detrend when detrending for spikes.
  • visualize_spikes (bool) – whether or not to plot the spikes identification and interpolation (useful for calibration of spikes_func). Only one column is visualized at each time. This is set with the spikes_vis_col keyword.
  • spikes_vis_col (str) – column to use to visualize spikes.
  • spikes_func (function) – function used to look for spikes. Can be defined used numpy/pandas notation for methods with lambda functions. Default is: lambda x: (abs(x - x.mean()) > abs(x.std()*4.))
  • replace_with (str) – method to use when replacing the spikes. Options are ‘interpolation’ and ‘trend’.
  • max_consec_spikes (int) – limit of consecutive spike points to be interpolated. After this spikes are left as they are in the output.
  • accepted_percent (float) – limit percentage of spike points in the data. If spike points represent a higher percentage than the run fails the spikes check.
  • chunk_size (str) – string representing time length of chunks used in the spikes and standard deviation check. Default is “2Min”. Putting None will not separate in chunks. It’s recommended to use rolling functions in this case (might be slow).
  • RAT (bool) – whether or not to perform the reverse arrangement test on data.
  • RAT_vars (list) – list containing the name of variables to go through the reverse arrangement test. If None, all variables are tested.
  • RAT_points (int) – number of final points to apply the RAT. If 50, the run will be averaged to a 50-points run.
  • RAT_significance – significance level to apply the RAT.
  • RAT_detrend_kw – keywords to be passed to pymicra.detrend specifically to be used on the RA test. {“how”:”linear”} is strongly recommended for this case.
  • trueverbose (bool) – whether or not to show details on the successful runs.
  • falseverbose (bool) – whether or not to show details on the failed runs.
  • trueshow (bool) – whether of not to plot the successful runs on screen.
  • trueshow_vars (list) – list of columns to plot if run is successfull.
  • falseshow (bool) – whether of not to plot the failed runs on screen.
  • outdir (str) – name of directory in which to write the successful runs. Directory must already exist.
  • summary_file (str) – path of file to be created with the summary of the runs. Will be overwriten if already exists.
Returns:

ext_summary – dict with the extended summary, which has the path of the files that got “stuck” in each test along with the successful ones

Return type:

pandas.DataFrame

pymicra.util.separateFiles(files, dlconfig, outformat='out_%Y-%m-%d_%H:%M.csv', outdir='', verbose=False, firstflag='.first', lastflag='.last', save_ram=False, frequency='30min', quoting=0, use_edges=False)

Separates files into (default) 30-minute smaller files. Useful for output files such as the ones by Campbell Sci, that can have days of data in one single file.

Parameters:
  • files (list) – list of file paths to be separated
  • dlconfig (pymicra datalogger configuration file) – to tell how the dates are displayed inside the file
  • outformat (str) – the format of the file names to output
  • outdir (str) – the path to the directory in which to output the files
  • verbose (bool) – whether to print to the screen
  • firstflag (str) – flag to put after the name of the file for the first file to be created
  • lastflag (str) – flag to put after the name of the fle for the last file to be created
  • save_ram (bool) – if you have an amount of files that are to big for pandas to load on your ram this should be set to true
  • frequency – the frequency in which to separate
  • quoting (int) – for pandas (see read_csv documentation)
  • edges (use) – use this carefully. This concatenates the last few lines of a file to the first few lines of the next file in case they don’t finish on a nice round time with respect to the frequency
Returns:

Return type:

None