8

I have a pandas dataframe like this:

    c1      c2      c3      c4
0   1       2       3       0
1   10      20      30      1
2   100     200     300     2
3   1       2       3       0
4   10      20      30      1
5   100     200     300     2

I would like to transform in this:

    c1  c2  c3  c4  c5  c6  c7  c8  c9
0   1   2   3   10  20  30  100 200 300
1   1   2   3   10  20  30  100 200 300

The idea is to "flatten" 3 rows at the time into one based on the value of "c4".

I have been trying to create a function to apply with the .apply method but with not much luck.

6 Answers 6

7

Another possible solution (groupby + hstack):

g = df.drop('c4', axis=1).groupby(df['c4'])
cols = [f'c{i+1}' for i in range(9)]
pd.DataFrame(np.hstack([x for _, x in g]), columns=cols)

This first applies drop to remove column c4. Then, it uses groupby on column c4 to split the dataframe into subgroups, each containing rows with the same value of c4. The resulting dataframes are collected in a list comprehension, which is passed to numpy.hstack to horizontally stack them into a single 2D array. Finally, this array is wrapped back into a DataFrame, with column names generated by a list comprehension.


Yet another possible solution, maybe more efficient than the previous one (sort_values + reshape):

pd.DataFrame(
    df.sort_values('c4').drop('c4', axis=1).values.reshape(-1, 3*3, order='F'),
    columns=cols)

This first sorts the rows by c4 (sort_values), drops that column (drop), extracts the underlying numpy array (values), reshapes it with column-major order so grouped rows become side-by-side (reshape), and finally wraps the result into a new dataframe.

Output:

   c1  c2  c3  c4  c5  c6   c7   c8   c9
0   1   2   3  10  20  30  100  200  300
1   1   2   3  10  20  30  100  200  300
3

Of course, we can come up with a solution where we will use 'c4', but I see the solution this way.

We can bring to the form you need with the help of an auxiliary column 'group'. It will help us to index the values for future transformation.

Result add 'group' column

Now we will write a function that will create a pd.Series. We take the values and place them into one array using flatten().

def grouping(g):
    return pd.Series(g[['c1', 'c2', 'c3']].values.flatten(), 
                     index=[f'c{i+1}' for i in range(9)])

We apply the function to the grouped DataFrame by the auxiliary column 'group'.

Result

Full code:


import pandas as pd

data = {
    'c1': [1, 10, 100, 1, 10, 100],
    'c2': [2, 20, 200, 2, 20, 200],
    'c3': [3, 30, 300, 3, 30, 300],
    'c4': [0, 1, 2, 0, 1, 2]
}

df = pd.DataFrame(data)

df['group'] = df.index // 3

def grouping(g):
    return pd.Series(g[['c1', 'c2', 'c3']].values.flatten(), 
                     index=[f'c{i+1}' for i in range(9)])

result_df = df.groupby('group').apply(grouping).reset_index(drop=True)
4
  • 2
    your anser does not involve c4 at all. And why did you posted outputs as images?
    – strawdog
    Commented Aug 20 at 20:32
  • 1
    @strawdog I came up with an idea for a solution. If you don't like the images, you can run my code in Google Collab.(I don't think there's a difference, if you insist, I can change it to a text output format.)
    – Sindik
    Commented Aug 20 at 20:36
  • 2
    images is not the problem really. the problem is that questions clearly asks to make new df based on c4 column.
    – strawdog
    Commented Aug 20 at 20:42
  • 1
    @strawdog You are right. Maybe this idea will help the author. And it's not nice to format it in the comments.
    – Sindik
    Commented Aug 20 at 20:54
3

You can groupby your dataframe by c4 column, and concat groups into resulting datarame like this:

res = pd.DataFrame()
for i, g in df.groupby("c4"):
    g = g.reset_index(drop=True) # "forget" the original index
                                 # to put all groups in corresponding rows
    res = pd.concat([res, g], axis=1) # concatente "horizonatlly"

res = res.drop("c4", axis=1) # drop all old unnecessary "c4" columns
                             # rename them correspondently:
res.columns = [f"c{x}" for x in range(1, len(res.columns)+1)]

this is what you'll get:

   c1  c2  c3  c4  c5  c6   c7   c8   c9
0   1   2   3  10  20  30  100  200  300
1   1   2   3  10  20  30  100  200  300
2

Count the number of rows that meet the condition c4 == 0. After that drop column c4 and export the dataframe to numpy to reshape the array based on the previous count. Create a new dataframe with the reshaped array.

For the new dataframe, increment the name of each column by 1 and add the prefix 'c'.

n = df['c4'].eq(0).sum()

result = (pd.DataFrame(df.drop(columns='c4').to_numpy().reshape(n, -1))
          .pipe(lambda x: x.set_axis(x.columns+1, axis=1))
          .add_prefix('c'))

End result:

   c1  c2  c3  c4  c5  c6   c7   c8   c9
0   1   2   3  10  20  30  100  200  300
1   1   2   3  10  20  30  100  200  300
0

You could use groupby + cumcount to generate "row numbers" which can be used to pivot with.

df.groupby("c4").cumcount()
# 0    0
# 1    0
# 2    0
# 3    1
# 4    1
# 5    1
# dtype: int64
df.pivot(index=[df.groupby("c4").cumcount()], columns="c4")
#    c1          c2          c3         
# c4  0   1    2  0   1    2  0   1    2
# 0   1  10  100  2  20  200  3  30  300
# 1   1  10  100  2  20  200  3  30  300

Then it's a matter of reordering and renaming the columns.

result = (
    df.pivot(index=[df.groupby("c4").cumcount()], columns="c4")
      .sort_values(by=1, axis=1)
)

result.columns = [ f"c{n + 1}" for n in range(len(result.columns)) ]
#    c1  c2  c3  c4  c5  c6   c7   c8   c9
# 0   1   2   3  10  20  30  100  200  300
# 1   1   2   3  10  20  30  100  200  300
0

It might be easiest to do this in Numpy.

I'm assuming the values play no role in the operation. Instead it is defined by two size parameters. Also, column 'c4' and any other columns beyond the first 3 are to be completely ignored.

n_rows, n_cols = (3, 3)
data = df.iloc[:, :n_cols].to_numpy()
data_transformed = data.reshape(-1, n_rows * n_cols)
col_names = [f"c{i}" for i in range(1, data_transformed.shape[1] + 1)]
df_transformed = pd.DataFrame(data_transformed, columns=col_names)
print(df_transformed)

Output:

   c1  c2  c3  c4  c5  c6   c7   c8   c9
0   1   2   3  10  20  30  100  200  300
1   1   2   3  10  20  30  100  200  300

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.