How to store strings in CSV with new line characters?

My question is: what are ways I can store strings in a CSV that contain newline characters (i.e. \n), where each data point is in one line?

Sample data

This is a sample of the data I have:

data = [
    ['some text in one line', 1],
    ['text with\nnew line character', 0],
    ['another new\nline character', 1]
]

Target CSV

I want a CSV file, where the first line is "text,category" and every subsequent line is an entry from data.

What I tried

Using the csv package from Python.

import csv
field_names = ['text', 'category']

# Writing
with open('file.csv', 'w+', encoding='utf-8') as file:
    csvwriter = csv.DictWriter(file, field_names)
    csvwriter.writeheader()
    for d in data:
        csvwriter.writerow({'text': d[0], 'category':d[1]})

# Reading
with open('file.csv', 'r', encoding='utf-8') as file:
    csvreader = csv.DictReader(file, field_names)
    data = []
    for line in csvreader:
        data.append([line['text'], line['category']])

I can read and write, but the output file.csv is the following:

text,category

some text in one line,1

"text with

new line character",0

"another new

line character",1

So not one line per data point.

Topic csv dataset python

Category Data Science


I assume that you want to keep the newlines in the strings for some reason after you have loaded the csv files from disk. Also that this is done again in Python. My solution will require Python 3, although the principle could be applied to Python 2.

The main trick

This is to replace the \n characters before writing with a weird character that otherwise wouldn't be included, then to swap that weird character back for \n after reading the file back from disk.

For my weird character, I will use the Icelandic thorn: Þ, but you can choose anything that should otherwise not appear in your text variables. Its name, as defined in the standardised Unicode specification is: LATIN SMALL LETTER THORN. You can use it in Python 3 a couple of ways:

    weird_literal = 'þ'
    weird_name = '\N{LATIN SMALL LETTER THORN}'
    weird_char = '\xfe'  # hex representation
    weird_literal == weird_name == weird_char  # True

That \N is pretty cool (and works in python 3.6 inside formatted strings too)... it basically allows you to pass the Name of a character, as per Unicode's specification.

An alternative character that may serve as a good standard is '\u2063' (INVISIBLE SEPARATOR).

Replacing \n

Now we use this weird character to replace '\n'. Here are the two ways that pop into my mind for achieving this:

  1. using a list comprehension on your list of lists: data:

     new_data = [[sample[0].replace('\n', weird_char) + weird_char, sample[1]]
                  for sample in data]
    
  2. putting the data into a dataframe, and using replace on the whole text column in one go

     df1 = pd.DataFrame(data, columns=['text', 'category'])
     df1.text = df.text.str.replace('\n', weird_char)
    

The resulting dataframe looks like this, with newlines replaced:

               text              category
0         some text in one line      1   
1  text withþnew line character      0   
2    another newþline character      1   

Writing the results to disk

Now we write either of those identical dataframes to disk. I set index=False as you said you don't want row numbers to be in the CSV:

FILE = '~/path/to/test_file.csv'
df.to_csv(FILE, index=False)

What does it look like on disk?

text,category

some text in one line,1

text withþnew line character,0

another newþline character,1

Getting the original data back from disk

Read the data back from file:

new_df = pd.read_csv(FILE)

And we can replace the Þ characters back to \n:

new_df.text = new_df.text.str.replace(weird_char, '\n')

And the final DataFrame:

new_df
               text               category
0          some text in one line      1   
1  text with\nnew line character      0   
2    another new\nline character      1   

If you want things back into your list of lists, then you can do this:

original_lists = [[text, category] for index, text, category in old_df_again.itertuples()]

Which looks like this:

[['some text in one line', 1],
 ['text with\nnew line character', 0],
 ['another new\nline character', 1]]

For anyone who's still facing the issue:

None of the other suggestions worked for me or were too much work to do. Simply replace all \n with \\n before saving to CSV and it'll preserve the newline characters.

df.loc[:, "Column_Name"] = df["Column_Name"].apply(lambda x : x.replace('\n', '\\n'))
df.to_csv("df.csv", index=False)

Note though, that if for some reason your string might contain the literal \\n regardless (e.g this answer's text), it too will turn into a newline when unpacking.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.