# Pymicra’s auto-generated docs¶

## pymicra.constants¶

Defines some useful constants

## pymicra.core¶

Defines classes that are the basis of Pymicra

class pymicra.core.Notation(*args, **kwargs)

Bases: object

Holds the notation used in every function of pymicra except when told otherwise.

build(from_level=0)

This useful method builds the full notation based on the base notation.

Given notation for means, fluctuations, and etc, along with names of variables, this method builds the notation for mean h2o concentration, virtual temperature fluctuations and so on.

Parameters: self (pymicra.Notation) – notation to be built from_level (int) – level from which to build. If 0, build everything from scratch and higher notations will be overwritten. If 1, skip one step in building process. Still to be implemented! Notation object with built notation pymicra.Notation
class pymicra.core.fileConfig(*args, **kwargs)

Bases: object

This class defines a specific configuration of a data file

Parameters: from_file (str) – path of .cfg file (configuration file) to read from. This will ignore all other keywords. variables (list of strings or dict) – If a list: should be a list of strings with the names of the variables. If the variable is part if the date, then it should be provided as a datetime directive, so if the columns is only the year, its name must be %Y and so forth. While if it is the date in YYYY/MM/DD format, it should be %Y/%m/%d. For more info see https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior If a dict: the keys should be the numbers of the columns and the items should follow the rules for a list. date_cols (list of ints) – should be indexes of the subset of varNames that corresponds to the variables that compose the timestamp. If it is not provided the program will try to guess by getting all variable names that have a percentage sign (%). date_connector (string) – generally not really necessary. It is used to join and then parse the date_cols. columns_separator (string) – used to assemble the date. If the file is tabular-separated then this should be “whitespace”. header_lines (int or list) – up to which line of the file is a header. See pandas.read_csv header option. filename_format (string) – tells the format of the file with the standard notation for date and time and with variable parts as “?”. E.g. if the files are 56_20150101.csv, 57_20150102.csv etc filename_format should be: ??_%Y%m%d.csv this is useful primarily for the quality control feature. units (dictionary) – very important: a dictionary whose keys are the columns of the file and whose items are the units in which they appear. description (string) – brief description of the datalogger configuration file. varNames (DEPRECATED) – use variables now.
get_date_cols()

Guesses what are the columns that contain the dates by searching for percentage signs in them

class pymicra.core.siteConfig(*args, **kwargs)

Bases: object

Keeps the configurations and constants of an experiment. (such as height of instruments, location, canopy height and etc)

Check help(pm.siteConfig.__init__) for other parameters

Parameters: from_file (str) – path to .site file which contais other keywords

## pymicra.decorators¶

Defines useful decorators for Pymicra

pymicra.decorators.autoassign(*names, **kwargs)

Decorator that automatically assigns keywords as atributes

allow a method to assign (some of) its arguments as attributes of ‘self’ automatically. E.g.

To restrict autoassignment to ‘bar’ and ‘baz’, write:

@autoassign(‘bar’, ‘baz’) def method(self, foo, bar, baz): …

To prevent ‘foo’ and ‘baz’ from being autoassigned, use:

@autoassign(exclude=(‘foo’, ‘baz’)) def method(self, foo, bar, baz): …

pymicra.decorators.pdgeneral(convert_out=True)

Defines a decorator to make functions work on both pandas.Series and DataFrames

Parameters: convert_out (bool) – if True, also converts output back to Series if input is Series
pymicra.decorators.pdgeneral_in(func)

Defines a decorator that transforms Series into DataFrame

pymicra.decorators.pdgeneral_io(func)

If the input is a series transform it to a dtaframe then transform the output from dataframe back into a series. If the input is a series and the output is a one-element series, transform it to a float.

Currently the output functionality works only when the output is one variable, not an array of elements.

## pymicra.io¶

Defines some useful functions to aid on the input/output of data

pymicra.io.readDataFile(fname, variables=None, only_named_cols=True, **kwargs)

Parameters: variables (list or dict) – keys are columns and values are names of variable only_named_columns (bool) – if True, don’t read columns that don’t appear on variables’ keys kwargs (dict) – dictionary with kwargs of pandas’ read_csv function see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html for more detail variables – list or dictionary containing the names of each variable in the file (if dict, the keys must be ints) pandas.DataFrame object pandas.DataFrame
pymicra.io.readDataFiles(flist, verbose=0, **kwargs)

Reads data from a list of files by calling readDataFile individually for each entry

Parameters: flist (sequence of strings) – files to be parsed verbose (bool) – whether to print **kwargs – readDataFile kwargs data pandas.DataFrame
pymicra.io.readUnitsCsv(filename, **kwargs)

Reads a csv file in which the first line is the name of the variables and the second line contains the units

Parameters: filename (string) – path of the csv file to read **kwargs – to be passed to pandas.read_csv df (pandas.DataFrame) – dataframe with the data unitsdic (dictionary) – dictionary with the variable names as keys and the units as values
pymicra.io.read_fileConfig(dlcfile)

WARNING! When defining the .config file note that by default columns that are enclosed between doublequotes will appear without the doublequotes. So if your file is of the form :

“2013-04-05 00:00:00”, .345, .344, …

Then the .config should have: variables={0:’%Y-%m-%d %H:%M:%S’,1:’u’,2:’v’}. This is the default csv format of CampbellSci dataloggers. To disable this feature, you should parse the file with read_csv using the kw: quoting=3.

pymicra.io.read_site(sitefile)

Reads .site configuration file, which holds siteConfig definitions

The .site should have definitions as regular python syntax (in meters!):
measurement_height = 10 canopy_height = 5 displacement_height = 3 roughness_length = 1.0
sitedile: str
path to .site file
Parameters: sitefile (str) – path to the site configuration file pymicra site configuration object pymicra.siteConfig
pymicra.io.timeSeries(flist, datalogger, parse_dates=True, verbose=False, read_data_kw={}, parse_dates_kw={}, clean_dates=True, return_units=True, only_named_cols=True)

Creates a micrometeorological time series from a file or list of files.

Parameters: flist (list or string) – either list or names of files (dataFrame will be one concatenated dataframe) or the name of one file datalogger (pymicra.fileConfig object) – configuration of the datalogger which is from where all the configurations of the file will be taken parse_date (bool) – whether or not to index the data by date. Note that if this is False many of the functionalities of pymicra will be lost. (i.d. there are repeated timestamps) verbose (int, bool) – verbose level pandas.DataFrame – data contained in the files in flist dict (optional) – units of the data
pymicra.io.write_as_fconfig(data, fname, fileconfig)

Writes a pandas DataFrame in a format according to fileConfig object

## pymicra.methods¶

Defines some methods. Some have functions defined here but most use functions defined elsewhere. This is done by monkey-patching Pandas.

pymicra.methods.binwrapper(self, clean_index=True, **kwargs)

Method to return binned data from a dataframe using the function classbin

pymicra.methods.bulk_corr(self)
pymicra.methods.polyfit(*args, **kwargs)

This method fits an n-degree polynomial to the dataset. The index can be a DateTimeIndex or not

Parameters: data (pd.DataFrame, pd.Series) – dataframe whose columns have to be fitted degree (int) – degree of the polynomial. Default is 1. rule (str) – pandas offside string. Ex.: “10min”.
pymicra.methods.to_unitsCsv(self, units, filename, **kwargs)

Wrapper around toUnitsCsv to create a method to print the contents of a dataframe plus its units into a unitsCsv file.

Parameters: self (dataframe) – dataframe to write units (dict) – dictionary with the names of each column and their unit filename (str) – path to which write the unitsCsv kwargs – to be passed to pandas’ method .to_csv
pymicra.methods.xplot(self, xcol, reverse_x=False, return_ax=False, fixed_cols=[], fcols_styles=[], latexify=False, **kwargs)

A smarter way to plot things with the x axis being one of the columns. Very useful for comparison of models and results

Parameters: self (pandas.DataFrame) – datframe to be plotted xcol (str) – the name of the column you want in the x axis reverse_x (bool) – whether to plot -xcol instead of xcol in the x-axis return_ax (bool) – whther to return pyplot’s axis object for the plot fixed_cols (list of strings) – columns to plot in every subplot (only if you use subplot=True on keywords) fcols_styles (list of string) – styles to use for fixed_cols latexify (cool) – whether to attempt to transform names of columns into latex format **kwargs – kwargs to pass to pandas.plot method IS STILL BUGGY (LATEXFY) –

## pymicra.physics¶

Module that contains physical functions. They are all general use, but most are specially frequent in micrometeorology.

TO DO LIST:
• ADD GENERAL SOLAR ZENITH CALCULATION
pymicra.physics.R_moistAir(q)

Calculates the gas constant for umid air from the specific humidity q

Parameters: q (float) – the specific humidity in g(water)/g(air) R_air – the specific gas constant for humid air in J/(g*K) float
pymicra.physics.airDensity_from_theta(data, units, notation=None, inplace_units=True, use_means=False, theta=None, theta_unit=None)

Calculates moist air density using theta measurements

Parameters: data (pandas.DataFrame) – dataset to add rho_air units (dict) – units dictionary notation (pymicra.notation) – notation to be used inplace_units (bool) – whether or not to treat units inplace use_means (bool) – use the mean of theta or not when calculating theta (pandas.Series) – auxiliar theta measurement theta_unit (pint.quantity) – auxiliar theta’s unit
pymicra.physics.airDensity_from_theta_v(data, units, notation=None, inplace_units=True, use_means=False, return_full_df=True)

Calculates moist air density using p = rho R_dry T_virtual

Parameters: data (pandas.DataFrame) – data to use to calculate air density units (dict) – dictionary of units notation (pymicra.Notation) – notation to be used inplace_units (bool) – whether or not to update the units inplace. If False, units are returns too use_means (bool) – whether or not to use averages of pressure and virtual temperature, instead of the means plus fluctuations
pymicra.physics.dewPointTemp(theta, e)

Calculates the dew point temperature. theta has to be in Kelvin and e in kPa

pymicra.physics.dryAirDensity_from_p(data, units, notation=None, inplace_units=True)

Calculates dry air density NEEDS IMPROVEMENT REGARDING HANDLING OF UNITS

pymicra.physics.latent_heat_water(T)

Calculates the latent heat of evaporation for water

Receives T in Kelvin and returns the latent heat in J/g

pymicra.physics.perfGas(p=None, rho=None, R=None, T=None, gas=None)

Returns the only value that is not provided in the ideal gas law

P.S.: I’m using type to identify None objects because this way it works againt pandas objects

pymicra.physics.ppxv2density(data, units, notation=None, inplace_units=True, solutes=[])

Calculates density of solutes based on their molar concentration (ppmv, ppbv and etc), not to be confused with mass concentration (ppm, ppb and etc).

Uses the relation $$\rho_x = \frac{C p}{\theta R_x}$$

Parameters: data (pandas.DataFrame) – dataset of micromet variables units (dict) – dict of pint units notation (pymicra.Notation) – notation to be used here inplace_units (bool) – whether or not to treat the dict units in place solutes (list or tuple) – solutes to consider when doing this conversion input data plus calculated density columns pandas.DataFrame
pymicra.physics.satWaterPressure(T, unit='kelvin')

Returns the saturated water vapor pressure according eq (3.97) of Wallace and Hobbes, page 99.

e0, b, T1 and T2 are constants specific for water vapor

Parameters: T (float) – thermodynamic temperature saturated vapor pressure of water (in kPa)
pymicra.physics.specific_humidity_from_ppxv(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates the specific humidity q from values of molar concentration of water (ppmv, ppthv and etc).

The equation is
mv x
q = —————-
(mv - md) x + md

where x is the molar concentration in ppxv.

Parameters: data (pandas.dataframe) – dataset units (dict) – units dictionary notation (pymicra.Notation) – notation to be used return_full_df (bool) – whether to return only the calculated series or the full df inplace_units (bool) – whether to return only a dict with the units of the new variables or include them in “units” outdata – specific humidity pandas.Series, pandas.DataFrame
pymicra.physics.theta_fluc_from_theta_v_fluc(data, units, notation=None, return_full_df=True, inplace_units=True)

Derived from theta_v = theta(1 + 0.61 q)

Parameters: data (pandas.dataframe) – dataframe with q, q’, theta, theta_v’ units (dict) – units dictionary notation (pymicra.Notation) – Notation object or None standard deviation of the thermodynamic temperature float
pymicra.physics.theta_from_theta_s(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates thermodynamic temperature using sonic temperature measurements

From Schotanus, Nieuwstadt, de Bruin; DOI 10.1007/BF00164332

theta_s = theta (1 + 0.51 q) (1 - (vn/c)**2)^0.5

$$theta_s \approx theta (1 + 0.51 q)$$

Parameters: data (pandas.dataframe) – dataset units (dict) – units dictionary notation (pymicra.Notation) – thermodynamic temperature pandas.Series
pymicra.physics.theta_from_theta_v(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates thermodynamic temperature from virtual temperature measurements

$$theta_v \approx theta (1 + 0.61 q)$$

Parameters: data (pandas.dataframe) – dataset units (dict) – units dictionary notation (pymicra.Notation) – virtual temperature pandas.DataFrame or Series
pymicra.physics.theta_s_from_c(data, units, notation=None, return_full_df=True, inplace_units=True)

Calculates sonic temperature using speed of sound

From Schotanus, Nieuwstadt, de Bruin; DOI 10.1007/BF00164332

theta_s = 1/403 * c**2

$$theta_s \approx 1/403 c^2$$

Parameters: data (pandas.dataframe) – dataset units (dict) – units dictionary notation (pymicra.Notation) – sonic temperature pandas.Series
pymicra.physics.theta_std_from_theta_v_fluc(data, units, notation=None)

Derived from theta_v = theta(1 + 0.61 q)

Parameters: data (pandas.dataframe) – dataframe with q, q’, theta, theta_v’ units (dict) – units dictionary notation (pymicra.Notation) – Notation object or None standard deviation of the thermodynamic temperature float

## pymicra.tests¶

This module contains functions that test certain conditions on pandas.dataframes to be used with the qcontrol().

They all return True for the columns that pass the test and False for the columns that fail the test.

pymicra.tests.check_RA(data, detrend=True, detrend_kw={'how': 'linear'}, RAT_vars=None, RAT_points=50, RAT_significance=0.05)

Performs the Reverse Arrangement Test in each column of data

Parameters: data (pandas.DataFrame) – to apply RAT to each column detrend_kw (dict) – keywords to pass to pymicra.detrend RAT_vars – list of variables to which apply the RAT RAT_points (int) – if it’s an int N, then reduce each column to N points by averaging. If None, then the whole columns are used RAT_significance (float) – significance with which to apply the RAT valid – True or False for each column. If True, column passed the test pd.Series
pymicra.tests.check_limits(data, tables, max_percent=1.0, replace_with='interpolation')

Checks dataframe for lower and upper limits. If found, they are substituted by the linear trend of the run. The number of faulty points is also checked for each column against the maximum percentage of accepted faults max_percent

Parameters: data (pandas dataframe) – dataframe to be checked tables (pandas.dataframe) – dataframe with the lower and upper limits for variables max_percent (float) – number from 0 to 100 that represents the maximum percentage of faulty runs accepted by this test. df (pandas.DataFrame) – input data but with the faulty points substituted by the linear trend of the run. valid (pandas.Series) – True for the columns that passed this test, False for the columns that didn’t.
pymicra.tests.check_nans(data, replace_with='interpolation', max_percent=100)

Checks data for NaN values

max_percent is here only for compatibility reasons but is deprecated

pymicra.tests.check_numlines(fname, numlines=18000, failverbose=False)

Checks length of file against a correct value. Returns False is length is wrong and True if length is right

Parameters: fname (string) – path of the file to check numlines (int) – correct number of lines that the file has to have Either with True or False pandas.Series
pymicra.tests.check_replaced(replaced, max_count=180)

Sums and checks if the number of replaced points is larger than the maximum accepted

pymicra.tests.check_spikes(data, chunk_size='2min', detrend={'how': 'linear'}, visualize=False, vis_col=1, max_consec_spikes=10, cut_func=<function <lambda>>, replace_with='interpolation', max_percent=1.0)

Applies spikes-check according to Vickers and Mahrt (1997)

Parameters: data (pandas.dataframe) – data to de-spike chunk_size (str, int) – size of chunks to consider. If str should be pandas offset string. If int, number of lines. detrend (bool) – whether to detrend the data and work with the fluctuations or to work with the absolute series. detrend_kw (dict) – dict of keywords to pass to pymicra.trend in order to detrend data (if detrend==True). visualize (bool) – whether of not to visualize the interpolation ocurring vis_col (str, int or list) – the column(s) to visualize when seeing the interpolation (only effective if visualize==True) max_consec_spikes (int) – maximum number of consecutive spikes to actually be considered spikes and substituted cut_func (function) – function used to define spikes replace_with (str) – method to use when replacing spikes. Options are ‘interpolation’ or ‘trend’. max_percent (float) – maximum percentage of spikes to allow.
pymicra.tests.check_stationarity(data, tables, detrend={'how': 'movingmean', 'window': 900}, trend={'how': 'movingmedian', 'window': '1min'})

Check difference between the maximum and minimum values of the run trend agaisnt an upper-limit. This aims to flag nonstationary runs

First detrends. (Then maybe takes the std depending on moving_std_kw.) Then checks the trend.

pymicra.tests.check_std(data, tables, detrend={'how': 'linear'}, chunk_size='2min', failverbose=False)

Checks dataframe for columns with too small of a standard deviation

Parameters: data (pandas.DataFrame) – dataset whose standard deviation to check tables (pandas.DataFrame) – dataset containing the standard deviation limits for each column detrend (dict) – keywords to pass to pymicra.detrend with detrend==True. If empty, no detrending is done chunk_size (str) – pandas datetime offset string valid – contatining True of False for each column. True means passed the test. pandas.Series
pymicra.tests.check_std_stationarity(data, tables, detrend={'how': 'movingmean', 'window': 900}, moving_std_kw={})

Check difference between the maximum and minimum values of the run trend agaisnt an upper-limit. This aims to flag nonstationary runs

First detrends. Then takes the std.

## pymicra.util¶

Module for general utilities

• INCLUDE DROPOUT TEST
• INCLUDE THIRD MOMENT TEST
• CHANGE NOTATION IN QCONTROL’S SUMMARY
pymicra.util.correctDrift(drifted, correct_drifted_vars=None, correct=None, get_fit=True, write_fit=True, fit_file='correctDrift_linfit.params', apply_fit=True, show_plot=False, return_plot=False, units={}, return_index=False)
Parameters: correct (pandas.DataFrame) – dataset with the correct averages drifted (pandas.DataFrame) – dataset with the averages that need to be corrected correct_drifted_vars (dict) – dictionary where every key is a var in the right dataset and its value is its correspondent in the drifted dataset get_fit (bool) – whether ot not to fit a linear relation between both datasets. Generally slow. Should only be done once write_fit (bool) – if get_fit == True, whether or not to write the linear fit to a file (recommended) fit_file (string) – where to write the linear fit (if one is written) or from where to read the linear fit (if no fit is written) apply_fit (bool) – whether of not to apply the lineat fit and correct the data (at least get_fit and fit_file must be true) show_plot (bool) – whether or not to show drifted vs correct plot, to see if it’s a good fit units (dict) – if given, it creates a {file_file}.units file, to tell write down in which units data has to be in order to be correctly corrected return_index (bool) – whether to return the indexes of the used points for the calculation. Serves to check the regression outdf – drifted dataset corrected with right dataset pandas.DataFrame
pymicra.util.qc_discard(files, fileconfig, read_files_kw={'clean_dates': False, 'only_named_cols': False, 'parse_dates': False, 'return_units': False}, std_limits={}, std_detrend={'how': 'linear'}, dif_limits={}, maxdif_detrend={}, maxdif_trend={'how': 'movingmedian', 'window': 600}, chunk_size=1200, passverbose=False, failverbose=True, failshow=False, passshow=False, passshow_vars=None, outdir='1_filtered', summary_file='filter_summary.csv', full_report=None)

Function that applies various tests quality control to a set of datafiles and re-writes the successful files in another directory. A list of currently-applied tests is found below in order of application. When some variable or set of points fails a test the whole file is discarded.

Tests are based on Vickers and Mahrt, Quality control and fux sampling problems for tower and aircraft data.

• standard deviation (STD) check:
runs with a standard deviation lower than a pre-determined value are left out. - keywords: std_limits, std_detrend
• maximum difference (stationarity) test:
runs whose trend have a maximum difference greater than a certain value are left out. This excludes non-stationary runs. Activate it by passing a dif_limits keyword. - keywords: dif_limits, maxdif_detrend, maxdif_trend
Parameters: files (list) – list of filepaths fileconfig (pymicra.fileConfig object or str) – datalogger configuration object used for all files in the list of files or path to a dlc file. read_files_kw (dict) – keywords to pass to pymicra.timeSeries. Default is {‘parse_dates’:False} because parsing dates at every file is slow, so this makes the whole process faster. However, {‘parse_dates’:True, ‘clean_dates’:False} is recommended if time is not a problem because the window and chunk_size keywords may be used as, for example ‘2min’, instead of 1200, which is the equivalent number of points. dif_limits (dict) – keys must be names of variables and values must be upper limits for the maximum difference of values that the linear trend of the run must have. maxdif_detrend (dict) – keywords to pass to pymicra.detrend when detrending for max difference test. If it’s empty, no detrending is made. maxdif_trend (dict) – Keywords to pass to pymicra.detrend when trending for max difference test (passed to pymicra.data.trend). If empty, no trending is used. This is used in the max difference test, since the difference is taken between the max and min values of the trend, not of the raw timeSeries. chunk_size (str) – string representing time length of chunks used in the standard deviation check. Default is “2Min”. Putting None will not separate in chunks. It’s recommended to use rolling functions in this case (might be slow). passverbose (bool) – whether or not to show details on the successful runs. failverbose (bool) – whether or not to show details on the failed runs. passshow (bool) – whether of not to plot the successful runs on screen. passshow_vars (list) – list of columns to plot if run is successfull. failshow (bool) – whether of not to plot the failed runs on screen. outdir (str) – name of directory in which to write the successful runs. Directory must already exist. summary_file (str) – path of file to be created with the summary of the runs. Will be overwriten if already exists. ext_summary – dict with the extended summary, which has the path of the files that got “stuck” in each test along with the successful ones pandas.DataFrame
pymicra.util.qc_replace(files, fileconfig, read_files_kw={'clean_dates': False, 'only_named_cols': False, 'parse_dates': False, 'return_units': False}, begin_date=None, end_date=None, file_lines=None, nans_test=True, lower_limits={}, upper_limits={}, spikes_test=True, visualize_spikes=False, chunk_size=1200, spikes_vis_col='u', spikes_detrend={'how': 'linear'}, spikes_func=<function <lambda>>, max_consec_spikes=10, max_replacement_count=180, replace_with='interpolation', passverbose=False, failverbose=True, passshow=False, failshow=False, passshow_vars=None, outdir='0_replaced', summary_file='control_replacement.csv', replaced_report='rreport.csv', full_report=None)

This function applies various quality control checks/tests to a set of datafiles and re-writes the successful files in another directory. This specific function focuses on point-analysis that can be fixed by replacements (removing spikes by interpolation, etc). A run fails this only if (1) it is beyond the accepted dates, (2) has a different number of lines than usual and (3) the number of points replaced is greater than max_replacement_count.

A list of applied tests is found below in order of application. The only test available by default is the spikes test. All others depend on their respective keywords.

Tests are based on Vickers and Mahrt, Quality control and fux sampling problems for tower and aircraft data.

• date check: Files outside a date_range are left out. keywords: end_date, begin_date
• lines test: Checks each file to see if they have a certain number of lines. Files with a different number of lines fail this test. keyworkds: file_lines
• NaN’s filter: Checks for any NaN values. NaNs are replaced with interpolation or linear trend. Activate it by passing nans_test=True. - keywords: nans_test
• boundaries filter:
Checks for values in any column lower than a pre-determined lower limit or higher than a upper limit. If found, these points are replacted (interpolated or otherwise). - keywords: lower_limits, upper_limits
• spikes filter: Search for spikes according to user definition. Spikes are replaced (interpolated or otherwise). - keywords: spikes_test, spikes_func, visualize_spikes, spikes_vis_col, max_consec_spikes and chunk_size keywords.
• replacement count test:
Checks the total amount of points that were replaced (including NaN, boundaries and spikes test) against the max_replacement_count keyword. Fails if any columns has more replacements than that. - keywords: max_replacement_count
Parameters: files (list) – list of filepaths fileconfig (pymicra.fileConfig object or str) – datalogger configuration object used for all files in the list of files or path to a dlc file. read_files_kw (dict) – keywords to pass to pymicra.timeSeries. Default is {‘parse_dates’:False} because parsing dates at every file is slow, so this makes the whole process faster. However, {‘parse_dates’:True, ‘clean_dates’:False} is recommended if time is not a problem because the window and chunk_size keywords may be used as, for example ‘2min’, instead of 1200, which is the equivalent number of points. file_lines (int) – number of line a “good” file must have. Fails if the run has any other number of lines. begin_date (str) – dates before this automatically fail. end_date (str) – dates after this automatically fail. nans_test (bool) – whether or not to apply the nans test lower_limits (dict) – keys must be names of variables and values must be lower absolute limits for the values of each var. upper_limits (dict) – keys must be names of variables and values must be upper absolute limits for the values of each var. spikes_test (bool) – whether or not to check for spikes. spikes_detrend (dict) – keywords to pass to pymicra.detrend when detrending for spikes. Is it’s empty, no detrending is done. visualize_spikes (bool) – whether or not to plot the spikes identification and interpolation (useful for calibration of spikes_func). Only one column is visualized at each time. This is set with the spikes_vis_col keyword. spikes_vis_col (str) – column to use to visualize spikes. spikes_func (function) – function used to look for spikes. Can be defined used numpy/pandas notation for methods with lambda functions. Default is: lambda x: (abs(x - x.mean()) > abs(x.std()*4.)) replace_with (str) – method to use when replacing the spikes. Options are ‘interpolation’ and ‘trend’. max_consec_spikes (int) – Limit of consecutive spike points to be interpolated. If the number of consecutive “spikes” is more than this, then we take all those points as not actually being spikes and no replacement is done. So if max_consec_spikes=0, no spike replacements is ever done. chunk_size (str) – string representing time length of chunks used in the spikes check. Default is “2Min”. Putting None will not separate in chunks. It’s recommended to use rolling functions in this case (might be slow). max_replacement_count (int) – Maximum number of replaced point a variable can have in a run. If the replaced number of points is larger than this then the run fails and is discarded. Generally this should be about 1% of the file_lines. passverbose (bool) – whether or not to show details on the successful runs. failverbose (bool) – whether or not to show details on the failed runs. passshow (bool) – whether of not to plot the successful runs on screen. passshow_vars (list) – list of columns to plot if run is successfull. failshow (bool) – whether of not to plot the failed runs on screen. outdir (str) – name of directory in which to write the successful runs. Directory must already exist. summary_file (str) – path of file to be created with the summary of the runs. Will be overwriten if already exists. ext_summary – dict with the extended summary, which has the path of the files that got “stuck” in each test along with the successful ones pandas.DataFrame
pymicra.util.qcontrol(files, fileconfig, read_files_kw={'clean_dates': False, 'only_named_cols': False, 'parse_dates': False, 'return_units': False}, begin_date=None, end_date=None, file_lines=None, nans_test=True, accepted_nans_percent=1.0, lower_limits={}, upper_limits={}, accepted_bound_percent=1.0, spikes_test=True, visualize_spikes=False, spikes_vis_col='u', spikes_detrend={'how': 'linear'}, spikes_func=<function <lambda>>, max_consec_spikes=3, accepted_spikes_percent=1.0, max_replacement_count=180, std_limits={}, std_detrend=True, std_detrend_kw={'how': 'movingmean', 'window': 900}, dif_limits={}, maxdif_detrend=True, maxdif_detrend_kw={'how': 'movingmean', 'window': 900}, maxdif_trend=True, maxdif_trend_kw={'how': 'movingmedian', 'window': 600}, RAT=False, RAT_vars=None, RAT_detrend=True, RAT_detrend_kw={'how': 'linear'}, RAT_points=50, RAT_significance=0.05, chunk_size=1200, replace_with='interpolation', trueverbose=False, falseverbose=True, falseshow=False, trueshow=False, trueshow_vars=None, outdir='quality_controlled', summary_file='qcontrol_summary.csv', replaced_report=None, full_report=None)

Function that applies various tests quality control to a set of datafiles and re-writes the successful files in another directory. A list of currently-applied tests is found below in order of application. The only test available by default is the spikes test. All others depend on their respective keywords.

• date check: files outside a date_range are left out. keywords: end_date, begin_date
• lines test: checks each file to see if they have a certain number of lines. Files with a different number of lines fail this test. keyworkds: file_lines
• NaN’s test: checks for any NaN values. NaNs are replaced with interpolation or linear trend. If the percentage of NaNs is greater than accepted_nans_percent, run is discarded. Activate it by passing nans_test=True. - keywords: accepted_nans_percent, nans_test
• boundaries test:
runs with values in any column lower than a pre-determined lower limit or higher than a upper limits are left out. - keywords: lower_limits, upper_limits
• spikes test: replace for spikes and replace them according to some keywords. runs with more than a certain percetage of spikes are left out. - keywords: spikes_test, spikes_func, visualize_spikes, spikes_vis_col, max_consec_spikes, accepted_spikes_percent and chunk_size keywords.
• replacement count test:
checks the total amount of points that were replaced (including NaN, boundaries and spikes test) against the max_replacement_count keyword. Fails if any columns has more replacements than that. - keywords: max_replacement_count
• standard deviation (STD) check:
runs with a standard deviation lower than a pre-determined value (generally close to the sensor precision) are left out. - keywords: std_limits, std_detrend, std_detrend_kw
• maximum difference (stationarity) test:
runs whose trend have a maximum difference greater than a certain value are left out. This excludes non-stationary runs. Activate it by passing a dif_limits keyword. - keywords: dif_limits, maxdif_detrend, maxdif_detrend_kw, maxdif_trend, maxdif_trend_kw
• reverse arrangement test (RAT):
runs that fail the reverse arrangement test for any variable are left out. - keywords: RAT, RAT_vars, RAT_detrend, RAT_detrend_kw, RAT_points, RAT_significance
Parameters: files (list) – list of filepaths fileconfig (pymicra.fileConfig object or str) – datalogger configuration object used for all files in the list of files or path to a dlc file. read_files_kw (dict) – keywords to pass to pymicra.timeSeries. Default is {‘parse_dates’:False} because parsing dates at every file is slow, so this makes the whole process faster. However, {‘parse_dates’:True, ‘clean_dates’:False} is recommended if time is not a problem because the window and chunk_size keywords may be used as, for example ‘2min’, instead of 1200, which is the equivalent number of points. file_lines (int) – number of line a “good” file must have. Fails if the run has any other number of lines. begin_date (str) – dates before this automatically fail. end_date (str) – dates after this automatically fail. nans_test (bool) – whether or not to apply the nans test accepted_nans_percent (float) – percentage of runs that is acceptable std_limits (dict) – keys must be names of variables and values must be upper limits for the standard deviation. std_detrend (bool) – whether or not to work with the fluctations of the data on the spikes and standard deviation test. std_detrend_kw – keywords to be passed to pymicra.detrend specifically to be used on the STD test. lower_limits (dict) – keys must be names of variables and values must be lower absolute limits for the values of each var. upper_limits (dict) – keys must be names of variables and values must be upper absolute limits for the values of each var. dif_limits (dict) – keys must be names of variables and values must be upper limits for the maximum difference of values that the linear trend of the run must have. maxdif_detrend (bool) – whether to detrend data before checking for differences. maxdif_detrend_kw (dict) – keywords to pass to pymicra.detrend when detrending for max difference test. maxdif_trend (bool) – whether to check for differences using the trend, instead of raw points (which can be the fluctuations or the original absolute values of data, depending if maxdif_detrend==True or False). maxdif_trend_kw (dict) – keywords to pass to pymicra.detrend when trending for max difference test. dictionary of keywords to pass to pymicra.data.trend. This is used in the max difference test, since the difference is taken between the max and min values of the trend, not of the series. Default = {‘how’:’linear’}. spikes_test (bool) – whether or not to check for spikes. spikes_detrend (dict) – keywords to pass to pymicra.detrend when detrending for spikes. visualize_spikes (bool) – whether or not to plot the spikes identification and interpolation (useful for calibration of spikes_func). Only one column is visualized at each time. This is set with the spikes_vis_col keyword. spikes_vis_col (str) – column to use to visualize spikes. spikes_func (function) – function used to look for spikes. Can be defined used numpy/pandas notation for methods with lambda functions. Default is: lambda x: (abs(x - x.mean()) > abs(x.std()*4.)) replace_with (str) – method to use when replacing the spikes. Options are ‘interpolation’ and ‘trend’. max_consec_spikes (int) – limit of consecutive spike points to be interpolated. After this spikes are left as they are in the output. accepted_percent (float) – limit percentage of spike points in the data. If spike points represent a higher percentage than the run fails the spikes check. chunk_size (str) – string representing time length of chunks used in the spikes and standard deviation check. Default is “2Min”. Putting None will not separate in chunks. It’s recommended to use rolling functions in this case (might be slow). RAT (bool) – whether or not to perform the reverse arrangement test on data. RAT_vars (list) – list containing the name of variables to go through the reverse arrangement test. If None, all variables are tested. RAT_points (int) – number of final points to apply the RAT. If 50, the run will be averaged to a 50-points run. RAT_significance – significance level to apply the RAT. RAT_detrend_kw – keywords to be passed to pymicra.detrend specifically to be used on the RA test. {“how”:”linear”} is strongly recommended for this case. trueverbose (bool) – whether or not to show details on the successful runs. falseverbose (bool) – whether or not to show details on the failed runs. trueshow (bool) – whether of not to plot the successful runs on screen. trueshow_vars (list) – list of columns to plot if run is successfull. falseshow (bool) – whether of not to plot the failed runs on screen. outdir (str) – name of directory in which to write the successful runs. Directory must already exist. summary_file (str) – path of file to be created with the summary of the runs. Will be overwriten if already exists. ext_summary – dict with the extended summary, which has the path of the files that got “stuck” in each test along with the successful ones pandas.DataFrame
pymicra.util.separateFiles(files, dlconfig, outformat='out_%Y-%m-%d_%H:%M.csv', outdir='', verbose=False, firstflag='.first', lastflag='.last', save_ram=False, frequency='30min', quoting=0, use_edges=False)

Separates files into (default) 30-minute smaller files. Useful for output files such as the ones by Campbell Sci, that can have days of data in one single file.

Parameters: files (list) – list of file paths to be separated dlconfig (pymicra datalogger configuration file) – to tell how the dates are displayed inside the file outformat (str) – the format of the file names to output outdir (str) – the path to the directory in which to output the files verbose (bool) – whether to print to the screen firstflag (str) – flag to put after the name of the file for the first file to be created lastflag (str) – flag to put after the name of the fle for the last file to be created save_ram (bool) – if you have an amount of files that are to big for pandas to load on your ram this should be set to true frequency – the frequency in which to separate quoting (int) – for pandas (see read_csv documentation) edges (use) – use this carefully. This concatenates the last few lines of a file to the first few lines of the next file in case they don’t finish on a nice round time with respect to the frequency None