I assume that you want to keep the newlines in the strings for some reason after you have loaded the csv files from disk. Also that this is done again in Python. My solution will require Python 3, although the principle could be applied to Python 2.
The main trick
This is to replace the \n
characters before writing with a weird character that otherwise wouldn't be included, then to swap that weird character back for \n
after reading the file back from disk.
For my weird character, I will use the Icelandic thorn: Þ, but you can choose anything that should otherwise not appear in your text variables.
Its name, as defined in the standardised Unicode specification is: LATIN SMALL LETTER THORN. You can use it in Python 3 a couple of ways:
weird_literal = 'þ'
weird_name = '\N{LATIN SMALL LETTER THORN}'
weird_char = '\xfe' # hex representation
weird_literal == weird_name == weird_char # True
That \N
is pretty cool (and works in python 3.6 inside formatted strings too)... it basically allows you to pass the Name of a character, as per Unicode's specification.
An alternative character that may serve as a good standard is '\u2063'
(INVISIBLE SEPARATOR).
Replacing \n
Now we use this weird character to replace '\n'
. Here are the two ways that pop into my mind for achieving this:
using a list comprehension on your list of lists: data
:
new_data = [[sample[0].replace('\n', weird_char) + weird_char, sample[1]]
for sample in data]
putting the data into a dataframe, and using replace on the whole text
column in one go
df1 = pd.DataFrame(data, columns=['text', 'category'])
df1.text = df.text.str.replace('\n', weird_char)
The resulting dataframe looks like this, with newlines replaced:
text category
0 some text in one line 1
1 text withþnew line character 0
2 another newþline character 1
Writing the results to disk
Now we write either of those identical dataframes to disk. I set index=False
as you said you don't want row numbers to be in the CSV:
FILE = '~/path/to/test_file.csv'
df.to_csv(FILE, index=False)
What does it look like on disk?
text,category
some text in one line,1
text withþnew line character,0
another newþline character,1
Getting the original data back from disk
Read the data back from file:
new_df = pd.read_csv(FILE)
And we can replace the Þ
characters back to \n
:
new_df.text = new_df.text.str.replace(weird_char, '\n')
And the final DataFrame:
new_df
text category
0 some text in one line 1
1 text with\nnew line character 0
2 another new\nline character 1
If you want things back into your list of lists, then you can do this:
original_lists = [[text, category] for index, text, category in old_df_again.itertuples()]
Which looks like this:
[['some text in one line', 1],
['text with\nnew line character', 0],
['another new\nline character', 1]]