reshape — Convert data from wide

2019-04-27  本文已影响0人  松柏林stata

Description

reshape converts data from wide to long form and vice versa.

Quick start

Create v from 2 time periods stored in v1 and v2 for observations identified by idvar and add tvar identifying time period

reshape long v, i(idvar) j(tvar)

Create v from 2 subobservations stored in v1 and v2 for observations identified by idvar and add subobs identifying each subobservation

reshape long v, i(idvar) j(subobs)

As above, but allow subobs to contain strings

reshape long v, i(idvar) j(subobs) string

Undo results from above

reshape wide

Create v1 and v2 from v with observations identified by idvar and time period identified by tvar

reshape wide v, i(idvar) j(tvar)

Undo results from above

reshape long

Create var and time identifier tvar from v1ar and v2ar with observation identifier idvar

reshape long v@ar, i(idvar) j(tvar)

Syntax

Overview

       long
    +------------+                  wide
    | i  j  stub |                 +----------------+
    |------------|                 | i  stub1 stub2 |
    | 1  1   4.1 |     reshape     |----------------|
    | 1  2   4.5 |   <--------->   | 1    4.1   4.5 |
    | 2  1   3.3 |                 | 2    3.3   3.0 |
    | 2  2   3.0 |                 +----------------+
    +------------+

从长到宽::

                                 existing variable
                               /
reshape wide stub, i(i) j(j)

从宽到长:

reshape long stub, i(i) j(j)
                            \
                              j new variable

使用 reshape wide 后返回长型:

reshape long

使用 reshape long 后返回宽型:

reshape wide

Basic syntax

将数据从宽格式转换为长格式

 reshape long stubnames, i(varlist) [options]

将数据从长格式转换为宽格式

reshape wide stubnames, i(varlist) [options]

使用reshape wide后,将数据转换回长格式

 reshape long

使用reshape long后将数据转换回宽格式

reshape wide

List problem observations when reshape fails

reshape error

options

i(varlist) :使用 varlist 作为ID变量
j(varname [values]) : long->wide: varname, 现有变量
wide->long: varname, 新变量
可选地指定子集 ** varname** 的值
string: ** varname** 是一个字符串变量(默认为数字)

i(varlist),此项是必须的。
其中值为#[ - #] [...]如果 varname 是数字(默认)。
"string" ["string" ...] 如果 varname 是字符串。
并且其中存根名称是变量名称 (long-> wide) ,或者是变量名称的存根 (wide-> long) ,并且两种方式都可以包含 @ ,表示 j 出现或将出现在名称中的位置。 在上面的例子中,当我们写 “reshape wide stub” 时,我们可以编写 “reshape wide stub @” ,因为 j 默认最终作为后缀。 如果我们写了 stu @ b ,那么宽变量将被命名为 stu1bstu2b

Advanced syntax

reshape i varlist
reshape j varname [values] [, string]
reshape xij fvarnames [, atwl(chars)]
reshape xi [varlist]
reshape [query]
reshape clear

Options

i(varlist) 指定其唯一值表示逻辑观察的变量。** i()** 是 必须的。
j(varname [values]) 指定其唯一值表示子观察的变量。values 列出了要从 varname 中使用的唯一值,这些值通常没有明确说明,因为 reshape 将自动从数据中确定它们。
string 指定 j() 可以包含字符串值。.
atwl(chars) , 只有高级语法可用且未在对话框中显示,指定在将数据从宽格式转换为长格式时用 ASCII 纯字符代替@character。

Description of basic syntax

在使用 reshape 之前,您需要确定数据是长形还是宽形。 您还必须确定用于组织的逻辑观察 (i) 和子观察 (j)
数据。 假设您有以下数据,可以按照以下方式组织为宽或长格式:

图片.png
根据这些数据,您可以使用 reshape 从一种形式转换为另一种形式:
reshape long inc, i(id) j(year) /* 从左边到右边 */
reshape wide inc, i(id) j(year) /* 从右向左*/

因为我们没有在命令中指定性别,所以 Stata 假定它在逻辑观察中是恒定的,这里是 id

Wide and long data forms

将数据视为样本 Xij 的集合,其中 i 是逻辑观察或组标识符,j 是子观察或组内标识符。通过逻辑观察来组织宽格式数据,将所有数据存储在一行中的特定观察上。 通过子观察组织长格式数据,将数据存储在多行中。

Example 1

例如,我们可能会有关于1980 - 1982年间某人的身份,性别和年收入的数据。 我们有两个 Xij 变量,数据范围很宽:

use http://www.stata-press.com/data/r15/reshape1
list
图片.png

将这些数据转换为长格式,我们可以输入

reshape long inc ue, i(id) j(year)
图片.png
在原始的宽格式数据集中没有名为年份的变量.在我们的长数据集中,年份将是一个新的变量。在这个转换之后,我们有
图片.png
我们可以返回到原始的,宽格式的数据集。
图片.png
从Wide转换到Long将创建 j(year) 变量。从长到宽的转换会删除 j(year) 变量。

Technical note

如果您的数据是宽型的,并且您没有组标识符变量(i(varlist)required 选项),您可以使用 generate 轻松创建一个; 见 [D] generate 。 例如,在最后一个示例中,如果我们的数据集中没有 id 变量,我们可以通过键入来创建它

generate id = _n

Avoiding and correcting mistakes

reshape 通常会检测数据,当数据不适合 reshape ; 将会发出 error ,但数据保持不变。

Example 2

以下宽型数据包含一个错误:


图片.png
图片.png

当数据是宽形式时,i变量必须是唯一的;我们输入了i(Id),但是我们有2个观测值,其中id是2。(第二人是男性还是女性?)

Example 3

当数据是长格式时,重复I变量并不是错误,但是下面的数据也有类似的错误:

图片.png
在长形式中,i(id)不一定是唯一的,但j(year)在i中必须是唯一的; 否则,1981年的公司价值= = 1?
reshape 告诉我们输入reshape error 来查看问题样本。
图片.png

Example 4

考虑一些没有错误的长形数据。 我们列出了前4个观察结果。

图片.png
但是,当我们将数据转换为宽形式时,我们忘记提到 ue 变量(这在人的内部是不同的)。
图片.png
这里 reshape 观察到 ue 在id中不是恒定的,因此无法重构数据,因此对id有单一的观察。 我们应该输入
reshape wide inc ue, i(id) j(year)

总之,有三种情况,reshape 将拒绝转换数据:

  1. 数据是宽型的,i()不是唯一的。
    2.数据是长型的,j在i中不是唯一的。
  2. 数据是长型的,未提及的变量在i内不是恒定的。

Example 5

由于存在一些错误,reshape 可能会转换数据并产生令人惊讶的结果。 假设我们忘记提及 ue 变量在以下宽数据中的id内变化:

图片.png
图片.png
We did not state that ue varied within i, so the variables ue80, ue81, and ue82 were left as is.
reshape did not complain. There is no real problem here because no information has been lost. In
fact, this may actually be the result we wanted. Probably, however, we simply forgot to include ue among the Xij variables. If you obtain an unexpected result, here is how to undo it:
  1. If you typed reshape long . . . to produce the result, type reshape wide (without arguments) to undo it.
  2. If you typed reshape wide . . . to produce the result, type reshape long (without arguments) to undo it.

reshape long and reshape wide without arguments

Whenever you type a reshape long or reshape wide command with arguments, reshape remembers it. Thus you might type

reshape long inc ue, i(id) j(year)

and work with the data like that. You could then type

reshape wide

to convert the data back to the wide form. Then later you could type

reshape long

to convert them back to the long form. If you save the data, you can even continue using reshape wide and reshape long without arguments during a future Stata session.Be careful. If you create new Xij variables, you must tell reshape about them by typing the
full reshape command, although no real damage will be done if you forget. If you are converting
from long to wide form, reshape will catch your error and refuse to make the conversion. If you are converting from wide to long, reshape will convert the data, but the result will be surprising:
remember what happened when we forgot to mention the ue variable and ended up with ue80, ue81,and ue82 in our long data; see example 5. You can reshape long to undo the unwanted change
and then try again.
So, we can type

reshape wide

to get back to our original, wide-form data and then type the reshape long command that we intended:

reshape long inc ue, i(id) j(year)

Missing variables

When converting data from wide form to long form, reshape does not demand that all the variables exist. Missing variables are treated as variables with missing observations.

Example 6

Let’s drop ue81 from the wide form of the data:


图片.png
图片.png

reshape placed missing values where ue81 values were unavailable. If we reshaped these data back to wide form by typing

reshape wide inc ue, i(id) j(year)

the ue81 variable would be created and would contain all missing values.

Advanced issues with basic syntax: i()

The i() option can indicate one i variable (as our past examples have illustrated) or multiple variables. An example of multiple i variables would be hospital ID and patient ID within each hospital.

reshape . . . , i(hid pid)

Unique pairs of values for hid and pid in the data define the grouping variable for reshape.

Advanced issues with basic syntax: j()

The j() option takes a variable name (as our past examples have illustrated) or a variable name and a list of values. When the values are not provided, reshape deduces them from the data. Specifying
the values with the j() option is rarely needed. reshape never makes a mistake when the data are in long form and you type reshape wide. The values are easily obtained by tabulating the j variable.
reshape can make a mistake when the data are in wide form and you type reshape long if your variables are poorly named. Say that you have the inc80, inc81, and inc82 variables, recording
income in each of the indicated years, and you have a variable named inc2, which is not income but indicates when the area was reincorporated. You type

reshape long inc, i(id) j(year)

reshape sees the inc2, inc80, inc81, and inc82 variables and decides that there are four groups in which j = 2, 80, 81, and 82.
The easiest way to solve the problem is to rename the inc2 variable to something other than “inc” followed by a number; see [D] rename.
You can also keep the name and specify the j values. To perform the reshape, you can type

reshape long inc, i(id) j(year 80-82)

or

reshape long inc, i(id) j(year 80 81 82)

You can mix the dash notation for value ranges with individual numbers. reshape would understand 80 82-87 89 91-95 as a valid values specification.
At the other extreme, you can omit the j() option altogether with reshape long. If you do, the j variable will be named -j.

Advanced issues with basic syntax: xij

When specifying variable names, you may include @ characters to indicate where the numbers go.

Example 7

Let’s reshape the following data from wide to long form:


图片.png
图片.png

At most one @ character may appear in each name. If no @ character appears, results are as if the @ character appeared at the end of the name. So, the equivalent reshape command to the one above is

reshape long inc@r ue@, i(id) j(year)

inc@r specifies variables named inc#r in the wide form and incr in the long form. The @ notation may similarly be used for converting data from long to wide format:

 reshape wide inc@r ue, i(id) j(year)

Advanced issues with basic syntax: String identifiers for j()

The string option allows j to take on string values.

Example 8

Consider the following wide data on husbands and wives. In these data, incm is the income of the man and incf is the income of the woman.


图片.png

These data can be reshaped into separate observations for males and females by typing


图片.png
The string option specifies that j take on nonnumeric values. The result is
图片.png

sex will be a string variable. Similarly, these data can be converted from long to wide form by typing

reshape wide inc, i(id) j(sex) string

Strings are not limited to being single characters or even having the same length. You can specify the location of the string identifier in the variable name by using the @ notation.

Example 9

Suppose that our variables are named id, kids, incmale, and incfem.


图片.png
图片.png

If the wide data had variables named minc and finc, the appropriate reshape command would have been

reshape long @inc, i(id) j(sex) string

The resulting variable in the long form would be named inc.
We can also place strings in the middle of the variable names. If the variables were named incMomand incFome, the reshape command would be

reshape long inc@ome, i(id) j(sex) string

Be careful with string identifiers because it is easy to be surprised by the result. Say that we have
wide data having variables named incm, incf, uem, uef, agem, and agef. To make the data long,we might type

reshape long inc ue age, i(id) j(sex) string

Along with these variables, we also have the variable agenda. reshape will decide that the sexes are m, f, and nda. This would not happen without the string option if the variables were named
inc0, inc1, ue0, ue1, age0, and age1, even with the agenda variable present in the data.
Advanced issues with basic syntax: Second-level nesting
Sometimes the data may have more than one possible j variable for reshaping. Suppose that your data have both a year variable and a sex variable. One logical observation in the data might be
represented in any of the following four forms:


图片.png

reshape can convert any of these forms to any other. Converting data from the long–long form to the wide–wide form (or any of the other forms) takes two reshape commands. Here is how we would do it:


图片.png

Description of advanced syntax

The advanced syntax is simply a different way of specifying the reshape command, and it has one seldom-used feature that provides extra control. Rather than typing one reshape command to describe the data and perform the conversion, such as

reshape long inc, i(id) j(year)

you type a sequence of reshape commands. The initial commands describe the data, and the last command performs the conversion:

reshape i id
reshape j year
reshape xij inc
reshape long

reshape i corresponds to i() in the basic syntax.
reshape j corresponds to j() in the basic syntax.
reshape xij corresponds to the variables specified in the basic syntax. reshape xij also accepts the atwl() option for use when @ characters are specified in the fvarnames. atwl stands for at-whenlong. When you specify names such as inc@r or ue@, in the long form the names become incr and ue, and the @ character is ignored. atwl() allows you to change @ into whatever you specify. For example, if you specify atwl(X), the long-form names become incXr and ueX. There is also one more specification, which has no counterpart in the basic syntax:

reshape xi varlist

In the basic syntax, Stata assumes that all unspecified variables are constant within i. The advanced syntax works the same way, unless you specify the reshape xi command, which names the constant�within-i variables. If you specify reshape xi, any variables that you do not explicitly specify are dropped from the data during the conversion. As a practical matter, you should explicitly drop the unwanted variables before conversion. For instance, suppose that the data have variables inc80, inc81, inc82, sex, age, and age2 and that you no longer want the age2 variable. You could specify

reshape xi sex age

or

drop age2

and leave reshape xi unspecified. reshape xi does have one minor advantage. It saves reshape the work of determining which
variables are unspecified. This saves a relatively small amount of computer time. Another advanced-syntax feature is reshape query, which is equivalent to typing reshape by itself. reshape query reports which reshape parameters have been defined. reshape i, reshape j, reshape xij, and reshape xi specifications may be given in any order and may be repeated to change or correct what has been specified.Finally, reshape clear clears the definitions. reshape definitions are stored with the dataset when you save it. reshape clear allows you to erase these definitions. The basic syntax of reshape is implemented in terms of the advanced syntax, so you can mix basic and advanced syntaxes.

上一篇下一篇

猜你喜欢

热点阅读