How to Automate Data Preprocessing for AnEn Computation on Cheyenne

Introduction

This tutorial shows how to use the data preprocessing tools (gribConverter, windFieldCalculator) in the AnEn package to reformat the data to the correct form that can be directly used by a number of computation tools (similarityCalculator, analogGenerator, RAnEn) to generate analog ensembles. Addition to that, this tutorial also shows how to automate and parallelize the process on Cheyenne supercomputers.

This tutorial assumes the basic knowledge on bash script language and that AnEn package has already been successfully installed. More information of how to install AnEn on Cheyenne can be found here.

This tutorial also assumes that you have already built the AnEn tools. Instructions for building the tools can be found here.

Data Preparation and Goals

A large collection (~5.4TB) of data from North American Mesoscale Forecast model have been downloaded for the time period from October, 2008, to July, 2018. The original files are .g2.tar files and an example for the file name is nam_218_2008102900.g2.tar. Files have already been arranged by YearMonth in each folder as follows:

> ls
200810  200907  201004  201101  201110  201207  201304  201401  201410  201507  201604  201701  201710  201807
200811  200908  201005  201102  201111  201208  201305  201402  201411  201508  201605  201702  201711  
200812  200909  201006  201103  201112  201209  201306  201403  201412  201509  201606  201703  201712  
200901  200910  201007  201104  201201  201210  201307  201404  201501  201510  201607  201704  201801  
200902  200911  201008  201105  201202  201211  201308  201405  201502  201511  201608  201705  201802
200903  200912  201009  201106  201203  201212  201309  201406  201503  201512  201609  201706  201803
200904  201001  201010  201107  201204  201301  201310  201407  201504  201601  201610  201707  201804
200905  201002  201011  201108  201205  201302  201311  201408  201505  201602  201611  201708  201805
200906  201003  201012  201109  201206  201303  201312  201409  201506  201603  201612  201709  201806

Original NAM forecast files are organized by day, cycle time, and lead time. Each file is a compilation of parameters at all available locations/grid points. However, data that AnEn requires have a different format. This format requires the file to have parameters, grid points, times, and lead times information included. Our goal is convert the model output to this format.

Since the total file size exceeds 5 TB, it would be a better practice to avoid a huge file, but to have it broken down to chunks. Therefore, the files are grouped by month.

Scripts

I have prepared two scripts. The first script is the resource PBS script. This script does the following tasks:

  • Identifies which folders are currently being processed by search for a lock file and which folders have already been processed by searching for the expected output data file;
  • Selects only one folder that has not yet been processed;
  • Untars tarballs in a temporary folder;
  • Convert submessages to independent messages using grib_copy (reference: grib_copy example 4);
  • Converts grb2 files to NetCDF files;
  • Computes and adds wind direction and speed fields to the NetCDF file;
  • Exits normally.
#!/bin/bash

# The name of the task
#PBS -N process_each_month

# The project account
#PBS -A MY.PROJECT.ACCOUNT

# The time resources requested
#PBS -l walltime=10:00:00

# The queue type
#PBS -q regular

# Combine standard output and errors
#PBS -j oe                           

# The computing resources requested
#PBS -l select=1:ncpus=1:mem=109GB:ompthreads=1

# I would like to receive an email when tasks
# (a)bort, (b)egin, and (e)nd.
#
#PBS -m abe

# And this is the email
#PBS -M my.email@server.com

# These are the available folders. The folder names are also going to be the names of NetCDF files.
declare -a arr=("200810" "200811" "200812" "200901" "200902" "200903" "200904" "200905" "200906" "200907" "200908" "200909" "200910" "200911" "200912" "201001" "201002" "201003" "201004" "201005" "201006" "201007" "201008" "201009" "201010" "201011" "201012" "201101" "201102" "201103" "201104" "201105" "201106" "201107" "201108" "201109" "201110" "201111" "201112" "201201" "201202" "201203" "201204" "201205" "201206" "201207" "201208" "201209" "201210" "201211" "201212" "201301" "201302" "201303" "201304" "201305" "201306" "201307" "201308" "201309" "201310" "201311" "201312" "201401" "201402" "201403" "201404" "201405" "201406" "201407" "201408" "201409" "201410" "201411" "201412" "201501" "201502" "201503" "201504" "201505" "201506" "201507" "201508" "201509" "201510" "201511" "201512" "201601" "201602" "201603" "201604" "201605" "201606" "201607" "201608" "201609" "201610" "201611" "201612" "201701" "201702" "201703" "201704" "201705" "201706" "201707" "201708" "201709" "201710" "201711" "201712" "201801" "201802" "201803" "201804" "201805" "201806" "201807")

# Define the configuration file for gribConverter.
# The file can be found at 
# https://github.com/Weiming-Hu/AnalogsEnsemble/blob/master/apps/app_gribConverter/example/commonConfig.cfg
#
converterConfig=/glade/u/home/wuh20/scratch/data/forecasts/forecasts.cfg

# Define the output destination
destDir=/glade/u/home/wuh20/flash/forecasts_new/

# Define the lock file name
lockFile=.lock

for month in "${arr[@]}"; do
    # This is the data folder
    monthDir=/glade/u/home/wuh20/scratch/data/forecasts/$month
    
    # Whether this folder has already been processed
    if [ -f $destDir/$month\.nc  ]; then
        echo Month $month has been processed. Skip this month.
        continue
    fi
    
    # Check whether this directory exists
    if [ ! -d $monthDir  ]; then
        echo Directory not found: $monthDir
        exit 1
    fi
    
    cd $monthDir
    
    # Lock this directory
    if [ -f $lockFile  ]; then
        echo Directory $monthDir is in process. Skip this directory.
        continue
    else
        echo Lock directory $monthDir
        touch $lockFile
    fi
    
    # Create a folder to store the original files
    if [ ! -d original-extract-files ]; then
        echo Create folder for original extract files ...
        mkdir original-extract-files
    fi
    
    # Unpack tar files
    echo Extracting from tar files ...
    if [ -f log_extract  ]; then
        rm log_extract
    fi

    for tarFile in *.g2.tar; do
        tar --skip-old-files -xvf $tarFile -C original-extract-files >> log_extract
    done
    
    echo flattening messages with submessages ...
    for file in `ls original-extract-files`; do
        if [ ! -f $file ]; then
            /glade/u/home/wuh20/github/AnalogsEnsemble/dependency/install/bin/grib_copy original-extract-files/$file $file
        fi
    done
    
    # Convert grb2 files
    echo Converting grb2 files ...
    if [ ! -f $month-original.nc ]; then
        /glade/u/home/wuh20/github/AnalogsEnsemble/output/bin/gribConverter -c $converterConfig --folder ./ -o $month-original.nc -v 3 > log_converter
    fi

    # Add wind fields
    echo Adding wind fields ...
    if [ ! -f $month\.nc ]; then
        /glade/u/home/wuh20/github/AnalogsEnsemble/output/bin/windFieldCalculator --file-in $month-original.nc --file-type Forecasts --file-out $month\.nc -U 1000IsobaricInhPaU -V 1000IsobaricInhPaV --dir-name 1000IsobaricInhPaDir --speed-name 1000IsobaricInhPaSpeed -v 3 > log_wind
    fi

    # Move the data file elsewhere
    echo Moving data to $destDir
    mv $month\.nc $destDir
    
    # Cleaning
    echo Cleaning ...
    rm -rf original-extract-files
    rm $month-original.nc
    rm *.grb2
    
    echo Releasing the folder lock
    rm $lockFile
   
    # Each job only process one month
    echo Finished processing month $month
    exit 0
done

The first pbs script pretty much defines all the tasks that should be done. However, tasks for each month are entirely independent from each other and can be fully parallelized. Therefore, I decided that each task only processes one folder instead of continuing to the next folder available to avoid confusion between tasks. The following script simply deal with batch submitting the tasks to Cheyenne scheduler.

We have another problem here that if two tasks are started simultaneously, there is a slight possibility that they will process the same folder and the folder lock mechanism based on file creating might not work. A simple workaround for that is to ensure submitting a new job when there is no queueing tasks meaning all tasksing have been started.

#!/bin/bash

# Define the total number of jobs to create.
totalJobs=118

# Define the counter start.
submittedJobs=0

while true; do
    # Get the number of queued jobs by looking at the queue status looking for the symbols
    number=`qstat | grep "Q regular" | wc -l`
    
    echo The number of queued jobs: $number
    echo The number of submitted jobs: $submittedJobs
    if (( number == 0  )); then
        echo There is no queued jobs. Submit a new one.
        qsub batch_process.pbs
        submittedJobs=$((submittedJobs + 1))
        if (( submittedJobs == totalJobs  )); then
            echo $submittedJobs jobs submitted. Done!
            exit 0
        fi
    fi
    sleep 10
done

Results

By the completion of the scripts, we would have the following files in our destination folder:

> ls
200810.nc  200910.nc  201010.nc  201110.nc  201210.nc  201310.nc  201410.nc  201510.nc	201610.nc  201710.nc
200811.nc  200911.nc  201011.nc  201111.nc  201211.nc  201311.nc  201411.nc  201511.nc	201611.nc  201711.nc
200812.nc  200912.nc  201012.nc  201112.nc  201212.nc  201312.nc  201412.nc  201512.nc	201612.nc  201712.nc
200901.nc  201001.nc  201101.nc  201201.nc  201301.nc  201401.nc  201501.nc  201601.nc	201701.nc  201801.nc
200902.nc  201002.nc  201102.nc  201202.nc  201302.nc  201402.nc  201502.nc  201602.nc	201702.nc  201802.nc
200903.nc  201003.nc  201103.nc  201203.nc  201303.nc  201403.nc  201503.nc  201603.nc	201703.nc  201803.nc
200904.nc  201004.nc  201104.nc  201204.nc  201304.nc  201404.nc  201504.nc  201604.nc	201704.nc  201804.nc
200905.nc  201005.nc  201105.nc  201205.nc  201305.nc  201405.nc  201505.nc  201605.nc	201705.nc  201805.nc
200906.nc  201006.nc  201106.nc  201206.nc  201306.nc  201406.nc  201506.nc  201606.nc	201706.nc  201806.nc
200907.nc  201007.nc  201107.nc  201207.nc  201307.nc  201407.nc  201507.nc  201607.nc	201707.nc  201807.nc
200908.nc  201008.nc  201108.nc  201208.nc  201308.nc  201408.nc  201508.nc  201608.nc	201708.nc
200909.nc  201009.nc  201109.nc  201209.nc  201309.nc  201409.nc  201509.nc  201609.nc	201709.nc

And each file has the correct format for AnEn computation.

> ncdump -h 201801.nc 
netcdf \201801 {
dimensions:
	num_parameters = 17 ;
	num_chars = 50 ;
	num_stations = 262792 ;
	num_times = 31 ;
	num_flts = 53 ;
variables:
	char ParameterNames(num_parameters, num_chars) ;
	double ParameterWeights(num_parameters) ;
	char ParameterCirculars(num_parameters, num_chars) ;
	char StationNames(num_stations, num_chars) ;
	double Xs(num_stations) ;
	double Ys(num_stations) ;
	double Times(num_times) ;
	double FLTs(num_flts) ;
	double Data(num_flts, num_times, num_stations, num_parameters) ;
}

Comments