Browse Source

2020-10-17 by Flo Kra

* fixes/improvements:
    * detect data format (float or integer) not only by python .is_integer() method, but also check if raw data contains a dot.
	Don´t treat float values with decimals all 0s as integer as it probable is not the case when the csv contains decimals in that column.
    * detect data formats only in first data row as it should not change within a csv file, and importing different data types with the same field name to InfluxDB is problematic
    * does not import datasets which do not meet the datatype specified/detected.
	i.e. when a csv contains empty cells in some rows which would be treated as string ("") or int with value 0
    * do not exit script/stop importing on inserting errors (although this was mostly caused by different/wrong data in csv and should not happen as now the data types are checked before importing)

* added features:
    * --dryrun switch:
	Do not change anything in the DB. Also enables --showdata.

    * --showdata switch:
    Print detailed information to the console what will be done with the data (or is intended to, when using --dryrun).

    * --tspass (or -tp) switch:
    do not convert timestamps, instead pass them as they are in the csv (for use i.e. with csv exports made with Chronograf, when timestamps are for sure already in an InfluxDB compatible format)

    * --datatypes parameter:
    Force data type for each column specified in the --fieldcolumns parameter.
    The following data types can be specified: int, float, string
    usage example: --fieldcolumns temperature,humidity,barometer --datatypes temperature=float,humidity=int,barometer=float

    * allow to specify to which **retention policy** the data should be imported.
    specify with --dbname database.retentionpolicy as you would in InfluxQL.
    I missed that possibility when importing old, already aggregated data which I didn´t want to be in the default RP.
FloKra 3 years ago
parent
commit
46bdad14f6
3 changed files with 284 additions and 46 deletions
  1. 35 0
      CHANGELOG.md
  2. 95 5
      README.md
  3. 154 41
      csv-to-influxdb.py

+ 35 - 0
CHANGELOG.md

@@ -0,0 +1,35 @@
+# csv-to-influxdb
+
+forked from original version: https://github.com/fabio-miranda/csv-to-influxdb
+
+## 2020-10-17 by Flo Kra
+
+* fixes/improvements:
+    * detect data format (float or integer) not only by python .is_integer() method, but also check if raw data contains a dot. 
+	Don´t treat float values with decimals all 0s as integer as it probable is not the case when the csv contains decimals in that column. 
+    * detect data formats only in first data row as it should not change within a csv file, and importing different data types with the same field name to InfluxDB is problematic
+    * does not import datasets which do not meet the datatype specified/detected. 
+	i.e. when a csv contains empty cells in some rows which would be treated as string ("") or int with value 0
+    * do not exit script/stop importing on inserting errors (although this was mostly caused by different/wrong data in csv and should not happen as now the data types are checked before importing)
+
+* added features:
+    * --dryrun switch: 
+	Do not change anything in the DB. Also enables --showdata.
+
+    * --showdata switch: 
+    Print detailed information to the console what will be done with the data (or is intended to, when using --dryrun).
+	
+    * --tspass (or -tp) switch: 
+    do not convert timestamps, instead pass them as they are in the csv (for use i.e. with csv exports made with Chronograf, when timestamps are for sure already in an InfluxDB compatible format)
+
+    * --datatypes parameter: 
+    Force data type for each column specified in the --fieldcolumns parameter. 
+    The following data types can be specified: int, float, string
+    usage example: --fieldcolumns temperature,humidity,barometer --datatypes temperature=float,humidity=int,barometer=float
+    
+    * allow to specify to which **retention policy** the data should be imported. 
+    specify with --dbname database.retentionpolicy as you would in InfluxQL. 
+    I missed that possibility when importing old, already aggregated data which I didn´t want to be in the default RP. 
+
+  
+

+ 95 - 5
README.md

@@ -1,8 +1,24 @@
-# csv-to-influxdb
+# csv-to-influxdb-ext
 Simple python script that inserts data points read from a csv file into a influxdb database.
 
 To create a new database, specify the parameter ```--create```. This will drop any database with a name equal to the one supplied with ```--dbname```.
 
+#### Changes/Improvements compared to the original version: 
+
+* Improved guessing of data types from the data found in the first row. 
+  * number without . --> integer
+  * number that contains a . --> float
+  * string that is "true" or "false" --> bool
+Unlike in the original version, these data types are then used for the entire file for this column. If a value in another row does not fit (i.e. is empty, is a string like "NaN" or "") it will be skipped. Unlike the original code, this does not stop the rest of the import. If a row does not contain any valid data it is entirely skipped and not written to database. 
+
+* specify the data types on the commandline
+
+* Possibility to have a look on what´s going on *before* actually write anything to the database, by simply specifying *--dryrun*.
+
+* show data processed on the console
+
+* Possibility to specify a target retention policy rather than only the database name
+
 ## Usage
 
 ```
@@ -28,7 +44,7 @@ optional arguments:
                         User name.
   -p [PASSWORD], --password [PASSWORD]
                         Password.
-  --dbname [DBNAME]     Database name.
+  --dbname [DBNAME]     Database name. Specify target Retention Policy: [DBNAME].[RPNAME]
   --create              Drop database and create a new one.
   -m [METRICNAME], --metricname [METRICNAME]
                         Metric column name. Default: value
@@ -39,22 +55,32 @@ optional arguments:
                         1970-01-01 00:00:00
   -tz TIMEZONE, --timezone TIMEZONE
                         Timezone of supplied data. Default: UTC
+  -tp, --tspass         Pass the timestamp from CSV directly to InfluxDB (do no conversion).
+                        Use only if the format is compatible to InfluxDB.
   --fieldcolumns [FIELDCOLUMNS]
                         List of csv columns to use as fields, separated by
                         comma, e.g.: value1,value2. Default: value
+  --datatypes           Force specify data types for fields specified in --fieldcolumns: 
+                        i.e. value1=int,value2=float,value3=bool,name=str ...
+                        Valid types: int, float, str, bool
   --tagcolumns [TAGCOLUMNS]
                         List of csv columns to use as tags, separated by
                         comma, e.g.: host,data_center. Default: host
   -g, --gzip            Compress before sending to influxdb.
   -b BATCHSIZE, --batchsize BATCHSIZE
                         Batch size. Default: 5000.
+  --showdata            Print detailed information to the console what will be done with the data (or is intended to, when using --dryrun).
+  --dryrun              Do not change anything in the DB. Also enables --showdata.
 
 ```
 
-## Example
+## Examples
+
+#### 1. Considering the csv file:
+
 
-Considering the csv file:
 ```
+
 timestamp,value,computer
 1970-01-01 00:00:00,51.374894,0
 1970-01-01 00:00:01,74.562764,1
@@ -65,8 +91,72 @@ timestamp,value,computer
 1970-01-01 00:00:06,98.670792,3
 1970-01-01 00:00:07,69.532011,0
 1970-01-01 00:00:08,39.198964,0
+
 ```
 
+
 The following command will insert the file into a influxdb database:
 
-```python csv-to-influxdb.py --dbname test --input data.csv --tagcolumns computer --fieldcolumns value```
+```
+
+python csv-to-influxdb.py --dbname test --input data.csv --tagcolumns computer --fieldcolumns value
+
+```
+
+#### 2. Another example:
+
+
+```
+
+timestamp,temperature,humidity,sensor
+1970-01-01 00:00:00,17.2,55,garden
+1970-01-01 00:00:01,17.3,56,garden
+1970-01-01 00:00:02,17.1,57,garden
+1970-01-01 00:00:03,16.9,55,garden
+1970-01-01 00:00:04,16.7,53,garden
+1970-01-01 00:00:05,16.8,52,garden
+1970-01-01 00:00:06,17.0,55,garden
+1970-01-01 00:00:07,17.1,57,garden
+1970-01-01 00:00:08,17.2,60,garden
+
+```
+
+
+Command:
+
+
+```
+python csv-to-influxdb.py --dbname test --input data.csv --tagcolumns sensor --fieldcolumns temperature,humidity --datatypes temperature=float,humidity=int
+```
+
+
+Where --datatypes cam be omitted if they are clearly to identify in the first data row.
+
+
+#### 3. Importing historic aggregated data to a different Retention Policy named "daily":
+
+
+```
+
+timestamp,temp_avg,hum_avg,sensor
+2020-06-01 00:00:00,17.2,55,garden
+2020-06-02 00:00:00,17.3,56,garden
+2020-06-03 00:00:00,17.1,57,garden
+2020-06-04 00:00:00,16.9,55,garden
+2020-06-05 00:00:00,16.7,53,garden
+2020-06-06 00:00:00,16.8,52,garden
+2020-06-07 00:00:00,17.0,55,garden
+2020-06-08 00:00:00,17.1,57,garden
+2020-06-09 00:00:00,17.2,60,garden
+
+```
+
+
+Command:
+
+
+```
+python csv-to-influxdb.py --dbname test.daily --input data.csv --tagcolumns sensor --fieldcolumns temp_avg,hum_avg
+```
+
+

+ 154 - 41
csv-to-influxdb.py

@@ -3,6 +3,7 @@ import gzip
 import argparse
 import csv
 import datetime
+import json
 from pytz import timezone
 
 from influxdb import InfluxDBClient
@@ -38,7 +39,12 @@ def str2bool(value):
 def isinteger(value):
         try:
             if(float(value).is_integer()):
-                return True
+                # changed, don't only "guess" if it is int with .is_integer() as this also returns true for a float number with 0 decimal (ie 20.0)
+                # instead we also check the actual data and do not return it is int if contains a dot
+                if value.find('.') == -1:
+                    return True
+                else:
+                    return False
             else:
                 return False
         except:
@@ -47,13 +53,26 @@ def isinteger(value):
 
 def loadCsv(inputfilename, servername, user, password, dbname, metric, 
     timecolumn, timeformat, tagcolumns, fieldcolumns, usegzip, 
-    delimiter, batchsize, create, datatimezone, usessl):
+    delimiter, batchsize, create, datatimezone, usessl, showdata, dryrun, datatypes, tspass):
 
     host = servername[0:servername.rfind(':')]
     port = int(servername[servername.rfind(':')+1:])
+    
+    if dryrun:
+        showdata = True
+    
+    rpname = False
+    if dbname.find('.') != -1:
+        print("dbname contains a retention policy.")
+        tmpdbname = dbname.split('.')
+        dbname = tmpdbname[0]
+        rpname = tmpdbname[1]
+        print("dbname: " + dbname)
+        print("rpname: " + rpname)
+    
     client = InfluxDBClient(host, port, user, password, dbname, ssl=usessl)
 
-    if(create == True):
+    if(create == True and dryrun == False):
         print('Deleting database %s'%dbname)
         client.drop_database(dbname)
         print('Creating database %s'%dbname)
@@ -66,58 +85,131 @@ def loadCsv(inputfilename, servername, user, password, dbname, metric,
         tagcolumns = tagcolumns.split(',')
     if fieldcolumns:
         fieldcolumns = fieldcolumns.split(',')
-
+    
+    print()
+    
+    fields_datatypes = dict()
+    
+    if datatypes:
+        tmpdatatypes = datatypes.split(',')
+        print("specified data types:")
+        for tmpdatatype in tmpdatatypes:
+            dt = tmpdatatype.split('=')
+            fields_datatypes[dt[0]] = dt[1]
+            print("column '" + dt[0] + "' => " + dt[1])
+    else:
+        print("guessing data types from data in CSV row 2...")
+        
     # open csv
     datapoints = []
     inputfile = open(inputfilename, 'r')
     count = 0
+    
     with inputfile as csvfile:
         reader = csv.DictReader(csvfile, delimiter=delimiter)
+        
         for row in reader:
-            datetime_naive = datetime.datetime.strptime(row[timecolumn],timeformat)
-
-            if datetime_naive.tzinfo is None:
-                datetime_local = timezone(datatimezone).localize(datetime_naive)
+            
+            if showdata:
+                print("Input: ", row)
+            
+            if not tspass:
+                datetime_naive = datetime.datetime.strptime(row[timecolumn],timeformat)
+                
+                if datetime_naive.tzinfo is None:
+                    datetime_local = timezone(datatimezone).localize(datetime_naive)
+                else:
+                    datetime_local = datetime_naive
+                    
+                timestamp = unix_time_millis(datetime_local) * 1000000 # in nanoseconds
             else:
-                datetime_local = datetime_naive
-
-            timestamp = unix_time_millis(datetime_local) * 1000000 # in nanoseconds
-
+                timestamp = row[timecolumn]
+            
             tags = {}
             for t in tagcolumns:
                 v = 0
                 if t in row:
                     v = row[t]
                 tags[t] = v
-
+            
             fields = {}
             for f in fieldcolumns:
                 v = 0
                 if f in row:
-                    if (isfloat(row[f])):
-                        v = float(row[f])
-                    elif (isbool(row[f])):
-                        v = str2bool(row[f])
+                    skipfield = False
+                    if count == 0 and not datatypes:
+                        # first row, guess data types ONLY from there and remember them for the following rows
+                        if (isinteger(row[f])):
+                            print("column '" + f + "' = '" + str(row[f]) + "' => int")
+                            fields_datatypes[f] = "int"
+                            v = int(float(row[f]))
+                        elif (isfloat(row[f])):
+                            print("column '" + f + "' = '" + str(row[f]) + "' => float")
+                            fields_datatypes[f] = "float"
+                            v = float(row[f])
+                        elif (isbool(row[f])):
+                            print("column '" + f + "' = '" + str(row[f]) + "' => bool")
+                            fields_datatypes[f] = "bool"
+                            v = str2bool(row[f])
+                        else:
+                            print("column '" + f + "' = '" + str(row[f]) + "' => str")
+                            fields_datatypes[f] = "str"
+                            v = row[f]
                     else:
-                        v = row[f]
-                fields[f] = v
-
-
-            point = {"measurement": metric, "time": timestamp, "fields": fields, "tags": tags}
-
-            datapoints.append(point)
-            count+=1
+                        # from 2nd row only use data types guessed from row 1. 
+                        # check if datatype for each column fits and skip value if not (useful if there are a few missing values in the CSV)
+                        if (fields_datatypes[f] == "int"):
+                            if (isinteger(row[f])):
+                                v = int(float(row[f]))
+                            else:
+                                skipfield = True
+                                print("CSV row " + str(count+2) + ": skipped field '" + f + "' as it has a different data type.")
+                        elif (fields_datatypes[f] == "float"):
+                            if (isfloat(row[f])):
+                                v = float(row[f])
+                            else:
+                                skipfield = True
+                                print("CSV row " + str(count+2) + ": skipped field '" + f + "' as it has a different data type.")
+                        elif (fields_datatypes[f] == "bool"):
+                            if (isbool(row[f])):
+                                v = str2bool(row[f])
+                            else:
+                                skipfield = True
+                                print("CSV row " + str(count+2) + ": skipped field '", f, "' as it has a different data type.")
+                        elif (fields_datatypes[f] == "str"):
+                            v = row[f]
+                        
+                if not skipfield:
+                    fields[f] = v
+                
+            if len(fields) > 0:
+                point = {"measurement": metric, "time": timestamp, "fields": fields, "tags": tags}
+                if showdata:
+                    print("Output: ", json.dumps(point, indent=3))
+    
+                datapoints.append(point)
+                count+=1
+            else:
+                print("CSV row " + str(count+2) + ": skipped as it contains no field values.")
+                count+=1
             
             if len(datapoints) % batchsize == 0:
                 print('Read %d lines'%count)
                 print('Inserting %d datapoints...'%(len(datapoints)))
-                response = client.write_points(datapoints)
-
-                if not response:
-                    print('Problem inserting points, exiting...')
-                    exit(1)
-
-                print("Wrote %d points, up to %s, response: %s" % (len(datapoints), datetime_local, response))
+                
+                #if showdata:
+                #    print(json.dumps(datapoints, indent=3))
+                if not dryrun:
+                    if rpname:
+                        response = client.write_points(datapoints, retention_policy=rpname)
+                    else:
+                        response = client.write_points(datapoints)
+                
+                    if not response:
+                        print('Problem inserting points, exiting...')
+                        exit(1)
+    
+                    print("Wrote %d points, up to %s, response: %s" % (len(datapoints), datetime_local, response))
 
                 datapoints = []
             
@@ -126,15 +218,24 @@ def loadCsv(inputfilename, servername, user, password, dbname, metric,
     if len(datapoints) > 0:
         print('Read %d lines'%count)
         print('Inserting %d datapoints...'%(len(datapoints)))
-        response = client.write_points(datapoints)
-
-        if response == False:
-            print('Problem inserting points, exiting...')
-            exit(1)
+        
+        #if showdata:
+        #    print(json.dumps(datapoints, indent=3))
+        if not dryrun:
+            if rpname:
+                response = client.write_points(datapoints, retention_policy=rpname)
+            else:
+                response = client.write_points(datapoints)
+            
+            if response == False:
+                print('Problem inserting points, exiting...')
+                exit(1)
+            print("Wrote %d, response: %s" % (len(datapoints), response))
 
-        print("Wrote %d, response: %s" % (len(datapoints), response))
 
     print('Done')
+    if dryrun:
+        print('(actually did not change anything on the database, as --dryrun parameter was given.)')
     
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description='Csv to influxdb.')
@@ -158,8 +259,8 @@ if __name__ == "__main__":
                         help='Password.')
 
     parser.add_argument('--dbname', nargs='?', required=True,
-                        help='Database name.')
-
+                        help='Database name. Specify target Retention Policy: [DBNAME].[RPNAME]')
+    
     parser.add_argument('--create', action='store_true', default=False,
                         help='Drop database and create a new one.')
 
@@ -186,9 +287,21 @@ if __name__ == "__main__":
 
     parser.add_argument('-b', '--batchsize', type=int, default=5000,
                         help='Batch size. Default: 5000.')
+    
+    parser.add_argument('--showdata', action='store_true', default=False,
+                        help='Print detailed information to the console what will be done with the data (or is intended to, when using --dryrun).')
+    
+    parser.add_argument('--dryrun', action='store_true', default=False,
+                        help='Do not change anything in the DB. Also enables --showdata.')
+    
+    parser.add_argument('--datatypes', default=False,
+                        help='Force specify data types for fields specified in --fieldcolumns: value1=int,value2=float,value3=bool,name=str ... Valid types: int, float, str, bool')
+
+    parser.add_argument('-tp', '--tspass', action='store_true', default=False,
+                        help='Pass the timestamp from CSV directly to InfluxDB (do no conversion) - use only if the format is compatible to InfluxDB.')
 
     args = parser.parse_args()
     loadCsv(args.input, args.server, args.user, args.password, args.dbname, 
         args.metricname, args.timecolumn, args.timeformat, args.tagcolumns, 
         args.fieldcolumns, args.gzip, args.delimiter, args.batchsize, args.create, 
-        args.timezone, args.ssl)
+        args.timezone, args.ssl, args.showdata, args.dryrun, args.datatypes, args.tspass)