Unpacking The Data – An Analysis of USGS Workflow

Foreword

As an avid outdoor sports enthusiast, including white water kayaking, search and rescue work, winter hiking, rock climbing, and others, I can assure you one of the most important aspects is preparation. Of course, this includes physical training and proper equipment, but much of the skill is in awareness, and that means monitoring conditions like weather and streamflow.

In the latter case, there is also an ethical and conservation aspect, in that a passing review of runoff and flow indicates the health of a stream, and by extension the environment around it, up to and especially climate.

Noting the above caveats, I have selected the United States Geological Survey Streamflow, computed runoff for water years within the Commonwealth of Kentucky as a target setting for analysis. This data is updated in real time and available at the USGS site here.

Selection

United States Geological Survey Streamflow – Computed Runoff for water years within the Commonwealth of Kentucky

The United States Geological Survey (USGS) is a scientific agency of the U.S. government established in 1879 with a budget of 1.6 billion USD (H.R.2617 – 117th Congress, 2021-2022). Employing over 8,600 staff, it is tasked with studying natural resources, natural hazards, geology, geography, and the impacts of land use and climate change on these systems. It conducts research, monitors, and provides data on earthquakes, volcanoes, landslides, water resources, ecosystems, and mineral resources. The current USGS motto “science for a changing world” succinctly informs this purpose.

The USGS plays a crucial role in informing policymakers, resource managers, and the public about the Earth’s dynamic processes and how they affect society and the environment.

Collection

For water measurements, the USGS primarily relies on a series of automated stream level gauges placed at strategic points along waterways. The technology in use has remained unchanged since its inception, with only transmission and recording devices evolving to take advantage of digital storage and networking (Nielsen & Norris, 2007). Each station utilizes an air pump connected to a small hose routed to the stream bed. At regular intervals (typically every 15 minutes) throughout each day, air is forced into the hose, equivalent to blowing bubbles through a drink straw. As water depth increases, there is an equivalent increase in water pressure, and thus resistance to the airflow. Measuring the air pressure required to overcome this resistance produces a depth value. In turn, comparing depth to a cross section of the stream at point of sample enables a flow rate calculation. This technology may seem primitive or overly complex (why not just use a float or sonar gauge?). However, it offers distinct advantages over other methods:

Robustness – The compressor, recording, and transmission equipment can be mounted safely out of the river’s reach in a single, sturdy housing. This eliminates the need for immersed mechanical devices that quickly clog or degrade and protects more sensitive components from damage during large flow events. It also enables easier maintenance access by moving the housing unit close to or directly adjacent to nearby roadways. Additionally, the only exposed component (the hose) is self-maintaining by nature of its operation. Further resiliency is added by including a battery backup, ensuring continued measurements during extreme weather events.
Accuracy – Unlike a fixed gauge that may overtop or “bottom out,” there are no practical limits to measurement. Heavy floods or severe droughts do not affect operation.

Flow rates are typically expressed in cubic feet per second (cu ft/s) or cubic meters per second (m³/s), with one cubic foot equal to about 7.5 gallons, and one cubic meter equal to 1,000 liters (about the volume of a large refrigerator). These measurements can then approximate the total runoff in a particular area.

As an example of scale, consider the Licking River. Notable as the primary in and outflow of Cave Run Lake (Cave Run, 2023), this tributary of the Ohio River is a medium sized waterway at 488 kilometers (about 303 miles) in length with a basin of about 9,600 km2, covering most of the Gateway and Cumberland Plateau regions of Eastern Kentucky. At Alexandria Kentucky (just upstream from its confluence with the Ohio across from Cincinnati) the Licking produces an average discharge of 162 m³/s (United States Geological Survey, 2024). That is enough to fill an Olympic sized swimming pool every 20 seconds, or if we assume a modest head of 20 meters, a force equivalency of approximately 3.2 million watts (~4,300 horsepower). Equal to about 2,600 households or a large diesel locomotive at maximum tractive effort.

Kentucky presents a challenge by volume; in that it possesses the most running water of any contiguous state (it is only surpassed by Alaska). However, the methodology does not change. Kentucky’s measurements are taken from 258 stations placed throughout the commonwealth, providing a granular assessment of each of its major watersheds. All of which are in turn part of the Ohio River watershed, other than a small western area that drains directly to the Mississippi River. Note that the Ohio River is not only the largest tributary of the Mississippi River, it is over 25% larger in volume than the Mississippi itself at their confluence (average discharge 7,960 m³/s vs. 5,897 m³/s). This means in terms of hydrology the Mississippi is the tributary and the Ohio River is the true main stem of the Mississippi River system (Van der Leeden, Troise, & Todd, 1990), making all of Kentucky’s major watersheds (i.e., the previously noted Licking River) a second-tier oceanic drainage, and thus having long reaching environmental ramifications.

Data

Data is presented in a simple table layout. Each data point (a row) is expressed as a water year, starting from 1912 to the present date. See sample below:

Region	Year	Runoff (mm)	Runoff (in)	Rank	Percentile
KY	1901	578.24	22.77	31	75.00
KY	1902	536.85	21.14	44	64.52
KY	1903	694.83	27.36	11	91.13
KY	1904	331.53	13.05	104	16.13
KY	1905	346.66	13.65	98	20.97
KY	1906	471.94	18.58	70	43.55

From each water year, the total runoff is provided in millimeters and inches. The data is very straight forward, being a direct quantity measurement. I was briefly confused by the fields for rank and percentile, in that no context is provided (rank vs. other states, projected totals, etc.). However, it quickly becomes apparent the percentiles and ranks are measured against other water years. This provides a quick glance at which years are wetter, flood prone, or experienced drought.

There is no historical author given, nor any data available prior to 1912. This, and the simple measurements, suggest that the format has remained unchanged since initial creation, and that it is the product of several generations of staff.

Comparative

As the USGS is a national organization, it is necessary to find an organization beyond the USA borders to locate a similar collection that would not be outright identical. In this case, the United Kingdom provides us a comparative source of data though the National River Flow Archive (NRFA, 2024). The NRFA does not enable downloading of regions as the USGS does, instead requiring selection of individual station data. As seen in the sample below, the data does not include a ranking or percentile calculation, though we may acquire these manually.

Date	Runoff
1890-01-01	0.1
1890-01-02	0
1890-01-03	2.3
1890-01-04	2.8
1890-01-05	3.4

One markable difference is the granularity of data points. While the USGS aggregates runoff by water years, the NRFA breaks data points to each day of each month.

Uses

The USGS is itself a government agency and is often cited by other agencies for statistical data. An example of USGS data in use is drought analysis performed by the Environmental Protection Agency (Drought, 2023). In this example, the EPA uses a visualization to highlight areas with dramatically reduced runoff:

From this visualization, it becomes apparent that an alarming portion of the United States is experiencing reduced stream-flow and greater severity of water shortages.

Ethical Issues

Other than potentially misleading use of statistics by other entities (ex. omission of nearby regions that experienced increased runoff in charts promoting drought awareness), I was unable to find any ethical issues within the data. Stream flow and runoff are firmly established as public interests and are not physically concealable data. Collection efforts are largely automated, eliminating human elements other than analysis. While the data may be affected by human activities, it is not feasible to relate to a single individual or group.

Contextual Interview

Peter J. Cinotto
Branch Chief for Operations
Associate Director – Kentucky
USGS Ohio-Kentucky-Indiana Water Science Center

As is likely apparent, no one person is responsible for the USGS’s 100+ year country wide continuous data collection. Also, given the USGS is a large and fully bureaucratic agency, I was not confident I would be able to secure an interview within the available time frame (for this reason, I had prepared a standby data set and interview subject).

However, starting with a general missive to the USGS national information contact, and bouncing from several points (all of whom I note were quite affable and helpful), I was eventually able to secure an on-site interview with a Mr. Peter Cinotto, Branch Chief of Operations for the tri-state (Indiana / Kentucky / Ohio) regional office. Mr. Cinotto holds a Master of Science Degree in Geology from the University of Colorado, with a 30-year background in the USGS, and an extensive prior career as a well technician in various Texas oil fields. Mr. Cinotto continues to undergo professional development training at the USGS, much of it centered on in-house technologies and procedures.

Mr. Cinotto did not just provide his valuable time and insight for the interview. He also gave me an extensive tour of the lab and a look at the various instruments, fleet vehicles, submersibles, and other field equipment. All told, I spent over four hours on-site, and would have stayed longer were it not for another looming commitment. In truth, I learned far more from the tour and informal discussion after the interview than in the interview itself. This aspect is not particularly surprising, and that is why I would always recommend going on-site and developing a good rapport whenever possible.

One of the most intriguing aspects of the interview was discovering just how “open” the USGS is as an agency. I was already aware of the public data, but I was not aware at all (and quite shocked to learn) of the almost uncountable coding and technical developments they spearheaded, and then made open source:

High fidelity acoustics (now found in home and concert audio).
Statistical models.
Various unmanned surface and submersible devices.
Wide distribution real-time satellite transmission.

For the “official” portion of the interview, I utilized the provided questionnaire, and added a few of my own. Part of the skill of an interview is “reading the room”. It was obvious Mr. Cinotto would be more amenable to an informal approach, and so within the bounds of the questionnaire provided that is how I elected to comport myself.

Interview Recording

Interview Questionnaire Summary

Q1 – What is your role and/or relationship to the data? – 00:54

Ensure that USGS Louisville tri state office (Indiana / Kentucky / Ohio) collects defensible stream and runoff data. Engage with stakeholders (civilian public, stage agencies, Army Core of Engineers) to ensure process and data are meeting their water resource management needs (flood protection, water supply, water quality).

“Whatever the case is, it is my job to ensure they have defensible data to do that with.”

Q2 – What training or experience helps you interpret the data? – 02:15

Master of Science (Geology).
30-year background in well management from oil rigs.
USGS provides continual professional development training for all staff.
Environmental Statistics.
Electronics / Hardware construction.
Coding / Database.
Field procedures and safety.
Geology.
USGS has created many of the measuring techniques and modeling procedures in house, and so staff have instant access to problem solving information when questions arise.

“So if I have a question in statistics, I can call Bob Hurst who literally wrote the book on it, and he would, has spent an hour with me.”

Q3 – What would you like people to understand about the data and how to use it? – 04:53

All USGS data is publicly available, all the way to the first measurements taken in 1888.
Long term, highly granular (samples taken in 15-minute intervals at the longest), continuous data enables modeling long term trends that resist short or even generational term analysis (i.e. Climate change).
Scientists online are available to convert data into useful information.

“If you’re going to, say, look at climate change, a lot of times looking at these decadal cycles you need fifty years of data just to get at that. We’re one of the few people in the world that has the ability to go back and have continuous, defensible data sets to allow you to do that.”

Q4 – What kinds of errors can people expect to find in the data? – 07:10

As the USGS is a primary data source, multiple redundancy procedures are in place to ensure final certified data is error free. However, real time “provisional” data available online may contain errors from sensor maladies that are later corrected or omitted during data certification process.
Sensors must meet extremely stringent requirements for time series data.
Less stringent applications are allowed for high tolerance binary data (is this road flooded, yes/no).

“Anything less than 4 millimeters (in accuracy tolerance) is not approved for collection.”

Q5 – How do you handle the uneven geometry of streambeds when measuring flow? – 09:03

Modern instrumentation utilizes ultrasound to accurately map the cross section of a stream and particulate to determine flow rate.
Prior to ultrasound, hand operated “beeper” devices utilizing wire filaments and miniature turbines were drawn at intervals across the bed to create a cross section. These devices are still in use for verification and validation of modern systems.

Q6 – Why do errors appear and how can we compensate for them? – 11:20

Most errors that occur are due to hardware errors in the field, which are corrected, or if necessary, omitted by the various procedures and redundancies during the certification process.
Estimates of a value between sensor points are sometimes computed and will have a margin of error. These, however, are clearly marked in meta data is being computed estimates and not certified collection.
The USGS has created many of the procedures used by other entities for data review and accuracy certification in house, including several published books and texts.

“Back in the day a filament might have stuck on a recorder or something like that, but then that’s part of why we check and review things.”

Q7 – What essential information does the data obscure or leave out? Who is most likely to be affected by those omissions? – 12:41

Network (sensor) density is a primary concern. Across the tri-state (KY/OH/IN) area, the USGS monitors 800+ sensors, but these are spread throughout an area of over 315,000 square Kilometers. More sensors would be an obvious improvement, but funds and labor are always at a premium.
Anthropogenic influence. Sensors were primarily distributed in rural areas, but the watersheds are increasingly encroached upon by human development. This can affect historical data to a degree as runoff trends are altered by human activity. Ex: Pavement or gravel vs. forested land.

“Over the years, the number of gauges has gone up, but are in a lot more urban areas.”

Q8 – What is the pre‐history of the data set? What led to its collection? – 15:05

Prior to USGS, any water availability measurements came through a disparate network of samples, estimations, and observation.
USGS came about to create a unified network of collection hardware and procedures to gauge the available water resources and assist other agencies (i.e. the then Weather Bureau). The first USGS gauge was established on the Rio Grande in New Mexico as a proving unit. This site became a testing ground for various technologies and procedures to establish collection methods used to the present day.

“The way we do it has evolved, but the core of it is still that (NM testing Camp) at heart.”

Q9 – How is it used by the organization that created it? – 17:15

The data is used as a benchmark for underlying systems (primarily climate). Long term, continuous data enables finding trends (if any) over periods beyond typical human sphere of awareness.
The data is valuable when looking for specific event probabilities. Cited examples are flood statistics, flood probabilities, and habitat assessments.

“High resolution data allows to, really, look at an accurate picture of what’s happening because we’ve got enough data to pick up those small tends, and it’s long enough to pick up the underlying signal too.”

Q10 – How was similar data collected in the past? – 19:22

Primary utilizing a Stevens Recorder (continuous paper feed device attached to a weighted float that moved the writing needle as water raised and fell accordingly). Local staff (paid a then generous 1USD a day) maintained the devices (changing paper, fixing jams, etc.). Runners periodically collected the paper tapes for transport to National Archives.
Offices remained regional until 1913, then dissolved and moved to Washington DC. Complaints from stakeholders eventually resulted in re-establishment of regional office stations in 1938.
In 1972, utilizing satellite transmission at sensor emplacements enabled real-time collection and eliminated the necessity of paper runners.
As of 2024, an effort is underway to replace satellite transmission with cell networks where possible to further reduce latency.

Q11 – How are such data collected differently in other places? – 21:41

For the most part, data collections are standardized. USGS works closely with its international counterparts, most of whom utilize USGS methods.
All USGS technology, methods, and codes are made open source for use by agencies in and outside of the US.
USGS technology also extends to ostensibly unrelated technology. Examples include high-fidelity audio equipment.
USGS does not participate in regulation activities, allowing it to work with other agencies and stake holders who may have competing interests.

“There’s a big push now, where we have a metadata wizard, we put out to help people write you know, good, consistent, accurate metadata.”

“That’s the thing, we are not regulatory. We are scientists and technicians…”

Q12 – What are some of the logistical procedures with data collection (i.e., maintenance of gauges)? – 27:50

The Regional offices operate smaller offices and a staff of technicians to perform maintenance and field studies as needed. Currently the Louisville office houses 70 technicians for the tristate area and a fleet of 10 staffed and numerous unmanned watercrafts.
Procedures are carefully employed to make maximum use of staff availability.

“We’ve just got a good process to do it. Process is everything.”

Q13 – What type of networking is used to collect data from stream gauges? – 32:51

The complexity of a site depends on the needs. Some are quite complex, with networks of multiple sensors over a local network with fiber optic link to central mainframes. Most sites are self-maintaining sensors with a satellite link.
Currently moving sites to cell networks.
Some units in high concentration urban areas use mesh networks when it is possible for cost savings.

“It really just depends on the needs of the site and how we can make it the most cost effective.”

Visualizing Data Life Cycle

As previously mentioned, a vitally important facet of understanding the USGS process is in being a primary source, the USGS does not collect data from external entities. Rather, it generates data by gathering measurements from its own sensors and placements. As Mr. Cinotto repeatedly emphasized (00:54), the USGS mission is to provide defensible data to stakeholders, often with disparate and even competing interests. To successfully fulfill this mission, the USGS is not a stakeholder and does not consume its own data. It short, it offers no policy recommendations and no critical analysis other than strict conversion of raw data to readable units and charts.

Instead, the USGS process revolves around ensuring accuracy of the data collected. Arrays of automated sensors take measurements relay raw data to regional offices via satellite or cellular networks. These measurements are augmented and verified by manual collection efforts utilizing firmly established and reliable technologies dating back to the late 1800’s or earlier. Captured data is made immediately available in real time for use by stakeholders, clearly marked as “provisional data.” Post collection, all data then goes through a process of review utilizing an array of heuristics, comparing historical trends, known limits, and manual samples. Any inaccuracies are immediately reviewed for the offending source. The most common issue would be a temporary sensor failure, which might produce a “zero” result. If this data can be replaced with an accurate manual measurement, or an average from nearby sensors, it is marked with the type of correction applied and remitted back through the certification process. Otherwise, the data is discarded. In either case, corrective action is immediately applied at the point of collection to avoid repeat errors.

Once data is verified against all review procedures, it is then certified, and placed into permanent storage, making it publicly available for consumption by stakeholders.

The following flowchart provides a step by step of the USGS process from sensor to public display.

Sensors take measurements, passed on to regional systems.
Raw data is converted to human readable metrics.
Collected data is made available to public in real time, as clearly marked “provisional data.”
Data certification process begins. See previous summary for details.
Known good data is marked certified, remitted to storage, and made available for public consumption.
Errors are reviewed for correction or discard. Corrections are provisional and passed back through certification process.

Comparative Visualizations

Runoff data is an interesting combination of high sample volume (water years) and wildly differing scales for each of its comparative data points (amount, ranking, and percentile). For example, the scale of rankings equals 0 to the number of years recorded (124 thus far), while average runoff amounts are measured in inches, with a range of 0 to 2.5. Percentiles are measured in whole numbers of 0 – 100. As a result, the data resists charts fully inclusive of all points while still providing meaningful information to the user.

To help ameliorate this, I removed the ranking and percentile, leaving only amounts. Since the end goal is to convey the visible patterns (if any) in run off totals, the singular data point of average yearly runoff is sufficient.

Charts

After noting the surprising lack of trend in either runoff extremities or reduction in recent years I elected to create three charts, as seen below. All three disprove the anecdotal hypothesis of increased weather extremities or reduced runoff over the past 100 years.

Scatter

The scatter chart is arguably least effective in spotting trends (or lack thereof) but does clearly indicate a rhythmic pattern in the runoff spread.

Area

I consider the area chart most effective for this data analysis. I had planned to add a trend line, but a cursory glance shows there is no need. Over the last 124 years the state has maintained a consistent runoff pattern, with spikes and troughs almost rhythmic in nature.

Radar

I added a radar chart due to personal preference. It is a favorite of mine and I simply wished to see how the runoff data would appear. While the lack of trend is again visible, it is not as effective as a line or area chart. Moreover, the high sample number renders the chart overly busy and unprofessional looking.

References

Cave Run. (2023, December). Daniel Boone National Forest – Cave Run Lake. Forest Service National Website. https://www.fs.usda.gov/recarea/dbnf/recarea/?recid=39320

Drought. (2023, November 1). Climate Change Indicators: Drought. EPA. https://www.epa.gov/climate-indicators/climate-change-indicators-drought

H.R.2617 – 117th Congress (2021-2022): Consolidated Appropriations Act, 2023. (2022, December 29). https://www.congress.gov/bill/117th-congress/house-bill/2617

NRFA. (2024, January). National River Flow Archive. https://nrfa.ceh.ac.uk/

Nielsen, J. P., & Norris, J. M. (2007). From the river to you—USGS real-time streamflow information: U.S. Geological Survey Fact Sheet 2007–3043. Retrieved from http://pubs.usgs.gov/fs/2007/3043/

United States Geological Survey. (2024). USGS Current Conditions for USGS 03254520 Licking River at Hwy 536 near Alexandria, KY. Retrieved from https://waterdata.usgs.gov/monitoring-location/03254520/#parameterCode=00065&period=P7D&showMedian=false

Van der Leeden, F., Troise, F. L., & Todd, D. K. (1990). The Water Encyclopedia (Second ed.). Chelsea, Michigan: Lewis Publishers.

DC Current

Unpacking The Data – An Analysis of USGS Workflow

Foreword