Migration is the third process (with fertility and mortality) that governs population change. For most national populations, its contribution to population change is small relative to those of births and deaths, but as the civil division of interest becomes smaller, the salience of migration typically becomes larger. Migration differs from fertility and mortality not only in magnitude, but more fundamentally in the nature of the process. Migration involves moving across some geographically-defined boundary, with the intent or result of changing place of normal residence. Thus whereas a birth and a death are largely unambiguous, a migration depends upon geographically-defined spatial units (civil divisions) and on intent or subsequent behaviour. A person can be a migrant to the analyst looking at change in provincial population but not a migrant to another analyst focusing on national population change. The first task, therefore, in any analysis of migration is to establish the geographic focus of the study. A second task is to define what counts as a migration, as opposed to broader mobility. The issue is further confused by the existence of several different types of migration. In addition to “ordinary” change of usual residence, there are circular migration flows, daily or weekly commuter flows, seasonal flows and refugee flows, all with specific characteristics. Given these definitional issues, and the fact that migrations can effectively be reversed in terms of population stocks (unlike births and deaths), it is no surprise that measurement is also complicated.
Apart from this, capturing data on migration is also more problematic. Although developing countries often lack complete systems of birth and death registration, completeness is improving and some methods have been devised to make use of the less than complete data. However, registration data on migrants/migrations in most countries cannot be relied on to produce reliable estimates of immigrants, let alone of internal migrants/migrations. In addition, for various reasons (illegal status, temporary residence of recent migrants, fear of xenophobia, etc.) migrants (especially immigrants) are usually underrepresented in censuses and surveys.
Methods for measuring migration are broadly similar for both internal migration (in- or out-migration) and international migration (immigration or emigration), except in one very important respect. A census or survey can measure international immigration by identifying persons born abroad, but it is much harder to identify emigrants because it is not possible to carry out a census/survey in all recipient countries. Approaches to estimating emigration include: (i) systematic identification of nationals in censuses of other countries (UN Population Division 2011); (ii) including census/survey questions about usual household members living abroad (e.g. in the Swaziland Censuses of 1986 and 1996); (iii) asking about the residence abroad of close relatives, especially a woman’s children or a respondent’s siblings (Zaba 1985); and (iv) using intercensal residual methods to estimate numbers of missing residents at the time of a second census. The first approach is dependent on receiving countries having, and being willing to share, relevant data and only captures migration of the native-born population; the second approach depends on the, perhaps vague, concept of household membership, and will also fail to cover entire households that have moved away; the third also fails to capture entire missing families, does not provide estimates of recent emigration, and in small experimental surveys has not proven convincing. Only the fourth can be expected to give plausible estimates of recent outflows, provided both censuses count the population reasonably accurately, but gives no potentially useful information about destination.
With these limitations and problems of accurate data collection, the field of migration analysis has developed largely independently from mainstream demography, leading to it concentrating primarily on developed countries where the quality of data available to measure migration is typically much better than it is in developing countries, and possibly because migration in these countries is often a matter of greater political and public policy concern. A further consequence of these factors is that the field has developed its own terminology and techniques, which are often quite far removed from the demography discussed elsewhere in this manual.
As noted above, a migration is defined as a move across a geographically-defined (usually administrative) boundary of interest to the analyst with the effect of changing a person’s place of usual residence. Assuming that the boundary can be clearly defined, this immediately raises two questions: how does one define usual place of residence, and how does one determine whether it has changed? Unfortunately, no very precise answers can be given to these two questions, giving rise to inevitable uncertainty in measurement. The preferred definition of usual residence is in terms of length of residence: that if one intends to live, or after one has lived, in a place for a period of time (e.g. one year) one becomes a usual resident. Note that usual residence is not the same thing as legal residence. The Principles and Recommendations for Population and Housing Censuses (UN Statistics Division 2008: 102, para. 1.463) defines usual residence as follows:
“It is recommended that countries apply a threshold of 12 months when considering place of usual residence according to one of the following two criteria:
(a) The place at which the person has lived continuously for most of the last 12 months (that is, for at least six months and one day), not including temporary absences for holidays or work assignments, or intends to live for at least six months;
(b) The place at which the person has lived continuously for at least the last 12 months, not including temporary absences for holidays or work assignments, or intends to live for at least 12 months.”
However, this definition does not deal with the situation of a person with two homes who regularly spends about six months in each. In general, we have to rely on people to self-define as residents or not, although some tests could be implemented (such as asking where their car is registered, where taxes are paid, where they voted, where the person sleeps at night on a regular basis, etc.). For most purposes, a person can distinguish between whether he or she is a usual resident and visitor, and this simple distinction suffices.
Migration has been the Cinderella of demography, kept in the background as far as possible, and dedicated migration surveys are few, far between, and specialized (an excellent example is the description of the Mexican Migration Project by Massey, Alarcon, Durand et al. (1987)). Dedicated migration surveys typically include full migration histories, which, though raising complex analytical issues, tend not to be focussed on the estimation of numbers of migrants/migrations. In this section we do not cover the analysis of such full histories (there are very few general principles that would apply to a useful number), but rather deal with the sorts of data collected by population censuses and general household surveys and sometimes, developed countries, by some form of registration.
The most widely collected data relevant to migration is place of birth. In comparison with place of residence at the time of a survey, this information describes lifetime migration. The information provides limited information about timing of migration, and is ‘net’ migration in the sense that it misses, entirely, migrations that have been reversed (back to the place of birth) and all intermediate migrations. At the time of data collection, decisions have to be taken about the granularity of the data: i.e., for those born abroad, how many countries should be explicitly recorded and for those born in the country, what level of geography should be recorded. For the analyst, of course, these decisions were made at the questionnaire design stage, but some degree of greater aggregation may be required. The analysis of data on birthplace is described below, but it is useful to make two points here. First, if data on birthplace by age and sex are available for two points in time, it is possible to estimate net migration (by age and sex) during the interval. Second, although birthplace reflects lifetime migration, the length of “lifetime” varies by age, and (provided the census data on children is reasonably accurate, which it often isn’t in many developing countries) the migration of 0-4 year olds may be used as an indicator for recent migration of their parents (Raymer and Rogers 2007).
This information is very often collected in addition to that on birthplace, with the express objective of providing data on recent migration. The time point specified is generally five years earlier, but sometimes a one year period is used. However, it tends to work better if the time point is associated with a memorable event, such as the previous census, on the assumption that the coverage of that previous census was largely complete (so that people remember being counted). The longer time period identifies more migrants, but misses intermediate moves, whereas the shorter time period is more susceptible to reference period error (I moved “about a year ago”).
This information is almost always collected as an alternative to residence at some specified time in the past, and is generally combined with an additional question about duration of current residence (or date of last move). The objective again is to provide data on recent migration.
The question refers to duration of residence in the civil division (such as a town or province), not in an individual dwelling unit. This question is of limited use on its own and tends to be paired with the one above to provide a time frame for estimates.
Though not involving a direct question about migration, intercensal population change by age and sex can, provided both censuses are reasonably accurate counts of the population, provide residual estimates of net migration between the two censuses (Hill 1987; Hill and Wong 2005; UN Population Division 1967). Intercensal population change (for cohorts or age groups) by age and sex is adjusted for the effects of intercensal fertility and mortality to provide a residual estimate of intercensal net migration (i.e., treating migration as the balancing item in the fundamental demographic balance equation). Migration is generally concentrated in the age range 20 to 40, ages at which mortality rates are, at least in the absence of HIV/AIDS, relatively low and fertility irrelevant, so residual migration estimates are insensitive to assumptions about fertility and mortality (except in populations severely affected by HIV/AIDS where using these data to estimate migration is not recommended). Such estimates are extremely sensitive, however, to even small changes in census coverage; such errors may be manifest in high age-specific migration rates over age 50, where migration is usually low.
It is not the purpose of this introduction to provide a comprehensive summary of all the measures and definitions – the interested reader is referred to the UN manual on internal migration (UN Population Division 1970) – but two are of particular importance for the chapters that follow.
Stocks of migrants are typically thought of as numbers of persons (by age group and sex) not born in the civil division of enumeration. The proportions born elsewhere (in the country or in other countries) give a good general sense of the magnitude of in-migration and immigration, but no sense of any dynamic changes that may have occurred recently. However, changes in stocks can be used to estimate immigration (net of any onward or return migration of the foreign-born).
Assuming that migration events can be fully and accurately identified, occurrence/exposure rates can be calculated for out-migration or emigration in exactly the same way as for mortality, dividing events in a period by exposure time; such rates can be crude (both sexes, all ages) or age-sex specific. The same is not the case (or at least not usefully) for in-migration or immigration, since the population exposed to the risk of migrating into a civil division is the entire population of the world living elsewhere. In-migration and immigration rates are always calculated by dividing events by the exposure time of the one population group not exposed to risk, the current residents; such rates can be crude (both sexes, all ages) or age-sex specific. Defining rates in this way has the advantage of satisfying the needs of the demographic balancing equation, since rates of gain and loss are measured relative to the same population. This confers a further advantage in that net migration rates can be estimated from the demographic balancing equation as population change between two time points (e.g. censuses) minus gains due to births in the interval plus losses due to deaths in the interval. However, this approach does have the disadvantage of removing the scale limits on “normal” occurrence/exposure rates; for example, at the extreme, a person moving into a previously unoccupied civil division creates an in-migration rate of infinity.
The chapters in this section focus on the estimation and quantitative description of immigration and internal in- and out-migration. They are not meant to provide comprehensive coverage of all measures of migration, and specifically they do not cover the important, but problematic, issue of measuring emigration (other than by mentioning that the method of estimating immigration (net of return/onward migration) of foreigners, can be applied to the data of the main countries of destination of emigrants to get some sense of the age profile and magnitude of emigration.
Chapter 35 concentrates on the basic methods of using data from censuses to estimate the numbers (net of return/onward migration) of immigrants from the change in stock of foreigners, and of internal in- and out-migration from the change in stock by place of birth and from the place of residence at some date prior to the census.
Chapter 36 describes the selection and fitting of one of the Rogers-Castro multi-exponential models to estimates of migration probabilities (or rates) derived from estimates of the number of migrants/migrations using non-linear optimisation procedures.
Chapter 37 describes the multiplicative and log-linear models for capturing, comparing and analysing the mass of inter-regional migration flows from places of origin to places of destination. The chapter also provides an introduction to the method of offsets for extending the use of these models to estimate inter-regional flows from marginal flows (i.e. total flows out of, or into, regions). The intention is to expand the material on the method of offsets into an additional chapter at a later date, which will be placed on the Tools for Demographic Estimation website.
As mentioned above, UN Manual VI (UN Population Division 1970) provides a comprehensive, if dated, introduction to the description and measurement of internal migration. Those looking for an overview of indirect methods of estimating migration are referred to the useful, if also somewhat dated, review by Zaba (1987). More specifically, Hill (1987) attempted to apply the logic underlying the Generalized Growth Balance method of adult mortality estimation (described in Chapter 24) to estimate undocumented migration, while Hill and Queiroz (2010) sought to estimate net migration in parallel with the estimation of mortality. Unfortunately neither method has proved to be particularly successful.
Those interested in reading more about the models of migration (multi-exponential, multiplicative and log-linear) or the method of offsets are referred to work by Rogers, Willekens and colleagues (e.g. Little and Rogers (2007), Raymer and Rogers (2007), Rogers (1980, 1986) and Willekens (1999)).
Hill K. 1987. "New approaches to the estimation of migration flows from census and administrative data sources", International Migration Review 21(4):1279-1303. http://dx.doi.org/10.2307/2546515 [1]
Hill K and B Queiroz. 2010. "Adjusting the general growth balance method for migration", Revista Brasileira de Estudos de População 27(1):7-20. doi: http://dx.doi.org/10.1590/S0102-30982010000100002 [2]
Hill K and R Wong. 2005. "Mexico–US migration: Views from both sides of the border", Population and Development Review 31(1):1-18. doi: http://dx.doi.org/10.1111/j.1728-4457.2005.00050.x [3]
Little JS and A Rogers. 2007. "What can the age composition of a population tell us about the age composition of its out-migrants?", Population, Space and Place 13(1):23-19. doi: http://dx.doi.org/10.1002/psp.440 [4]
Massey DS, R Alarcon, J Durand and H Gonzalez. 1987. Return to Aztlan: The Social Process of International Migration from Western Mexico. Berkeley and Los Angeles: University of California Press.
Raymer J and A Rogers. 2007. "Using age and spacial flow structures in the indirect estimation of migration streams", Demography 44(2):199–223. doi: http://dx.doi.org/10.1353/dem.2007.0016 [5]
Rogers A. 1980. "Introduction to multistate mathematical demography", Environment and Planning A 12:489-498. doi: http://dx.doi.org/10.1068/a120489 [6]
Rogers A. 1986. "Parameterized multistate population dynamics and projections", Journal of the American Statistical Association 81(393):48-61. doi: http://dx.doi.org/10.1080/01621459.1986.10478237 [7]
UN Population Division. 1967. Manual IV: Methods for Estimating Basic Demographic Measures from Incomplete Data. New York: United Nations, Department of Economic and Social Affairs, ST/SOA/Series A/42. http://www.un.org/esa/population/techcoop/DemEst/manual4/manual4.html [8]
UN Population Division. 1970. Manual VI: Methods of Measuring Internal Migration. New York: United Nations, Department of Economic and Social Affairs, ST/SOA/Series A/47. http://www.un.org/esa/population/techcoop/IntMig/manual6/manual6.html [9]
UN Population Division. 2011. International Migration Report 2009: A Global Assessment. New York: United Nations, Department of Economic and Social Affairs, ST/ESA/Series A/316. http://www.un.org/esa/population/publications/migration/WorldMigrationReport2009.pdf [10]
UN Statistics Division. 2008. Principles and Recommendations for Population and Housing Censuses v.2. New York: United Nations, Department of Economic and Social Affairs, ST/ESA/STAT/SER.M/67/Rev2. http://unstats.un.org/unsd/publication/SeriesM/Seriesm_67rev2e.pdf [11]
Willekens FJ. 1999. "Modeling approaches to the indirect estimation of migration flows: From entropy to EM", Mathematical Population Studies 7:239-278. doi: http://dx.doi.org/10.1080/08898489909525459 [12]
Zaba B. 1985. Measurement of Emigration Using Indirect Techniques: Manual for the Collection and Analysis of Data on Residence of Relatives. Liège: Belgium: Ordina Editions.
Zaba B. 1987. "The indirect estimation of migration: A critical review", International Migration Review 21(4):1395–1445. doi: http://dx.doi.org/10.2307/2546519 [13]
Estimating migration from census data is not technically complicated. Provided that the census(es) gather the appropriate information and are reasonably accurate it is possible to produce estimates of net immigration (i.e. immigration less emigration) of the foreign-born population (people born outside a particular country) and internal migration between (to and from) sub-national regions of a country, over the period between two censuses.
To estimate net immigration of foreigners one essentially subtracts from the number of foreign-born people enumerated in a census, the number of foreigners expected to have survived since being enumerated in the previous census.
In a similar way, if the censuses record the sub-national region of birth one can estimate net in-migration (i.e. net in-migration of those born outside the region less net out-migration of those born in the region) between sub-national regions of a country. However, if the census asks of people where they were living at some prior point in time, say at the time of the previous census, one is able to estimate directly the number of surviving migrants (i.e. migrants still alive at the time of the latest census) into and out of each sub-national region of the country since that prior point in time.
In order to estimate the number of migrants from the number of surviving migrants at the time of the second census one needs to add to these figures an estimate of the number of migrants who are expected to have died between moving and the time of the latest census.
If the latest census records other information such as year in which the migrant moved to the place at which the person was counted in the census, it is possible also to establish a trend of migration over time.
Migration is different from fertility and mortality both in that migrating is not final in the sense of a birth or death, but also that we are concerned not only with the population of origin, from which the migrant moved (which corresponds to a population exposed to the risk from which rates of migration akin to those of fertility and mortality can be calculated) but we also have a population to which the migrant moves, the destination population. Apart from this, in order to understand migration one is often interested in distinguishing between different types of migration (whether temporary or more permanent, whether circulatory or unidirectional, etc.). For these reasons there is a much wider range of measures and terminology associated with migration than there is with either fertility or mortality. It is not the purpose of this chapter to cover these issues and the interested reader is referred to the standard texts on the subject such as the UN Manual VI (UN Population Division 1970), Shryock and Siegel (1976), Siegel and Swanson (2004).
- Censuses identify all foreign-born people accurately
- One is able to estimate the mortality of the foreign-born population accurately (either that the life table used is appropriate, or that the mortality is the same as that implied by the censuses for the native-born (locally-born) national population)
- No return migration of locally born emigrants
- Censuses count the population by sub-national region accurately and identify the region of birth accurately
- One is able to estimate the mortality of people moving between two regions accurately (either that the life table used is appropriate, or that the mortality is the same as that implied by the censuses for the native-born national population).
- Latest census identifies correctly all people who have moved from one region to another since the prior date (e.g. previous census)
- One is able to estimate the mortality of people moving between two regions accurately (either that the life table used is appropriate, or that the mortality is the same as that implied by the censuses for the native-born national population). Since one is estimating in- and out-migration separately (as opposed to net migration) this assumption is of less importance.
Before applying this method, you should investigate the quality of the data in at least the following dimensions
Estimating migration using place of birth data from two censuses not only requires that the censuses count the population reasonably completely, but that the place of birth be accurately recorded. Often this is not the case, particularly when estimating immigration, where immigrants wish to hide the fact that they are foreign, but also in the case of internal migration where there may have been boundary changes or the respondent is ignorant about the place of birth of the person.
Estimating migration by asking questions of migrants is quite dependent on the census identifying completely all those who have migrated, as well as identifying the place from which moved correctly. To the extent that recent migrants are not yet established as residents of the region to which they have moved at the time of the census, they could be missed in the count.
Net migration, by definition, underestimates the flows of migrants into and out of a region or country. Thus, for example, people who moved into a region and then returned within the period being considered will result in zero net in-migration and yet moved twice.
This method produces estimates of the net immigration of foreigners using place of birth data. It is important to stress that this method does not take into account or measure the immigration of returning native-born people who left the country prior to the previous census and returned before the second census. Thus this method is not recommended for the measurement of immigration where significant return migration of native-born people (for example, after exile or forced migration of refugees) is in progress.
If data on the number of foreign-born people in the population are available by age group for each census then one needs to estimate the survival factors to be applied to the numbers of foreign-born in the first census to estimate the numbers surviving to the time of the second census. The user can choose between years of life lived in five-yearly age groups (5Lx) based on the standard from the General family of United Nations model life tables or one of any of the four families of Princeton model life tables or a model life table of a population experiencing an AIDS epidemic (Timæus 2004) which appear in the Models spreadsheet of the associated workbook. This spreadsheet also allows the user to input years of life lived in five-yearly age groups of an alternative life table if there is reason to assume that the life table has a similar pattern of mortality to that of the population in question, or failing this, the survival factors can be derived from the proportion of each five-year age group of the native-born population surviving from the first to the second census (assumed to be n years apart, where n is a multiple of 5). Thus
and
, the n-year survival factor for a group of people aged x to x + 5 at the previous census, A-n and older at the previous census, and born between censuses, respectively are estimated as follows:
where the superscript nb represents ‘native-born’,
represents the native-born population in the census at time t and Bnb represents the number of native-born births between time t and t + n.
If the data are not available in five-year age groups, the net number of immigrants can still be estimated in total provided we have an estimate of the crude death rate for the population (which might, in the absence of any evidence to the contrary, be assumed to be that of the native-born population).
If data on the number of foreign-born people in the population are available by age group for two censuses (n years apart) then one needs to estimate the number of deaths of foreign-born people (denoted by the superscript F) aged between x and x+5 at the first census (at time t),
, aged A-n and older at the first census,
, and those born between the censuses,
, as follows:
where
represents the number of foreign-born people according to the census at time t who were aged between x and x+5.
If data and/or survival factors are not available by age group then one can estimate the total number of deaths of the foreign-born people as follows:
where ∞m0 is an estimate of the crude mortality rate of the population in the country of the census.
However, if the age distribution of the foreign-born population is markedly different from that of the population in the country of the census, then this can produce a poor approximation to the true number of deaths.
If data are available by age group for each census then age-specific net immigration can be estimated as follows:
for x = 0, 5, … , A-5-n where
represents the net number of immigrants between times t and t+n who were aged between x and x + 5 at time t. For x > A - 5 - n
The net number of immigrants of those born between times t and t+n is estimated as follows:
If data and/or survival factors are not available by age group then one would estimate of the total net number of immigrants as follows:
Net in-migration into a particular sub-national region from other regions in the country can be estimated in exactly the same way as the international immigration, described above, by replacing the foreign-born population with the population born outside the region.
In addition, applying the same method to data on the change in the numbers of population born in (rather than outside) and living outside the region of interest allows us to estimate the net out-migration of those born in the region to other regions in the country. Subtracting this from the net in-migration of those born outside the region gives an estimate of the overall net in-migration into the region of interest.
If there is reason to suspect that there is a material difference in the mortality experienced by those born outside who moved into the region and those born in the region who moved out, and one has appropriate survival factors then one could apply different survival factors to each when estimating the net number of migrants. However, in practice it is likely that inaccuracies in the census data on place of residence at previous census are likely to outweigh any increase in accuracy achieved by using differential mortality.
Net sub-national inter-regional migration is estimated directly from the numbers of people in each region at the time of the census who moved since the previous census by place (e.g. region) they were in at a given prior date (e.g. at the time of the previous census). Confining the estimates to inter-regional flows the sum of the numbers of inter-regional in-migrants should be equal to the sum of inter-regional out-migrants; however, if the data include immigration to the sub-national regions from outside the country one can extend the estimates of in-migration to include international immigration into each region.
Since one of the major areas of interest is the magnitude of inter-regional flows of the population, one is as interested in the total numbers of migrants between regions as one is in the age distributions of particular flows.
The number of migrants is derived from the number of surviving in- and out-migrants as follows:
where the superscript (’) represents numbers surviving and 5I’x and 5O’x respectively represent the number of surviving in-migrants into, and the surviving number out-migrants from, a particular region at the time of the second census who were aged between x and x+5 at the second census.
This example uses data on the numbers of males in the population from the South African Census in 2001 and a ‘census replacement survey’, the Community Survey in 2007. (Although the survey was conducted approximately 5.35 years after the night of the census in 2001, it is assumed for the purposes of presentation here to have been exactly five years after the census in 2001.) The examples appear in the Migration_South Africa_males.xlsx workbook.
The survival factors are shown in the fifth column of Table 1. The values are derived from (the years of life lived in each age group of) the alternative life table entered in the Models spreadsheet, for those aged 20 to 24 last birthday and those aged 80 and over at the time of the first census, and those born between the two censuses, as follows:
Table 1 Estimation of deaths of foreign-born and the net number of immigrants by age group, South Africa, 2001-2006
Age |
2001 |
2006 |
x |
5Sx |
Age at 2nd census |
DF |
Net M |
|
|
|
B |
0.94151 |
|
|
|
0- 4 |
8,963 |
12,577 |
0 |
0.97896 |
0- 4 |
391 |
12,968 |
5- 9 |
10,390 |
13,724 |
5 |
0.99547 |
5- 9 |
242 |
5,003 |
10-14 |
13,508 |
13,998 |
10 |
0.99427 |
10-14 |
55 |
3,664 |
15-19 |
27,835 |
27,943 |
15 |
0.98602 |
15-19 |
119 |
14,555 |
20-24 |
69,787 |
59,493 |
20 |
0.96458 |
20-24 |
616 |
32,275 |
25-29 |
87,381 |
95,763 |
25 |
0.93161 |
25-29 |
2,994 |
28,970 |
30-34 |
73,338 |
100,450 |
30 |
0.90960 |
30-34 |
6,675 |
19,743 |
35-39 |
66,663 |
85,490 |
35 |
0.89780 |
35-39 |
7,563 |
19,715 |
40-44 |
59,152 |
75,684 |
40 |
0.89092 |
40-44 |
7,701 |
16,721 |
45-49 |
45,184 |
66,113 |
45 |
0.88633 |
45-49 |
7,274 |
14,234 |
50-54 |
40,398 |
55,913 |
50 |
0.87224 |
50-54 |
6,154 |
16,883 |
55-59 |
30,640 |
42,833 |
55 |
0.84731 |
55-59 |
5,717 |
8,153 |
60-64 |
24,376 |
34,433 |
60 |
0.80885 |
60-64 |
5,442 |
9,234 |
65-69 |
17,895 |
25,588 |
65 |
0.75468 |
65-69 |
5,353 |
6,564 |
70-74 |
13,561 |
18,989 |
70 |
0.66991 |
70-74 |
5,281 |
6,375 |
75-79 |
10,238 |
12,850 |
75 |
0.56388 |
75-79 |
5,404 |
4,693 |
80-84 |
7,658 |
7,461 |
80+ |
0.40912 |
80-84 |
5,118 |
2,341 |
85+ |
4,455 |
5,305 |
|
|
85+ |
7,410 |
602 |
Total |
611,423 |
754,608 |
|
|
Total |
79,509 |
222,693 |
Since we have data on the number of foreign-born people in the population by age group for each census we can estimate the number of deaths of foreign-born people which occurred in the period between the two censuses by age group using the numbers of foreigners in each census given in the second and third columns of Table 1. For those aged 20 to 24 last birthday and those aged 80 and over at the time of the first census, and those born between the two censuses, the calculations are as follows:
If data and/or survival factors were not available by age group then one could estimate the total number of deaths of the foreign born people as follows, given an estimate of the crude mortality rate in the population of 14 per 1,000:
Since data are available by age group for each census, age-specific net immigration of those born outside the country can be estimated as follows:If data and/or survival factors were not available by age group then one could estimate the total net number of immigrants as follows:
If data and/or survival factors were not available by age group then one could estimate the total net number of immigrants as follows:
The second and third column of Table 2 show the numbers of people living in the Western Cape province of South Africa who were born outside the province, as counted by the 2001 Census and the 2007 Community Survey, respectively. Although the same survival factors (column 5) have been used as were used in the example of Method A, this should not be the case if it was thought that the mortality experience of native-born and immigrants were very different. The final column of Table 2 gives the net numbers of migrants into the Western Cape who were born in provinces other than the Western Cape for the different age groups. Thus in total 213,911 people born outside the Western Cape moved to the Western Cape (after excluding those who moved out).
Table 2 Estimation of the net number of in-migrants of those born outside by age group, Western Cape, South Africa, 2001-2006
Age |
2001 |
2006 |
x |
5Sx |
Age at 2nd census |
DO |
Net M (born out) |
|
|
|
B |
0.94151 |
|
|
|
0- 4 |
16,443 |
19,012 |
0 |
0.97896 |
0- 4 |
591 |
19,602 |
5- 9 |
24,406 |
28,743 |
5 |
0.99547 |
5- 9 |
482 |
12,782 |
10-14 |
31,134 |
30,792 |
10 |
0.99427 |
10-14 |
125 |
6,511 |
15-19 |
44,478 |
53,933 |
15 |
0.98602 |
15-19 |
245 |
23,043 |
20-24 |
74,011 |
82,526 |
20 |
0.96458 |
20-24 |
896 |
38,944 |
25-29 |
80,187 |
89,522 |
25 |
0.93161 |
25-29 |
2,954 |
18,466 |
30-34 |
65,833 |
90,783 |
30 |
0.90960 |
30-34 |
6,074 |
16,670 |
35-39 |
56,393 |
76,475 |
35 |
0.89780 |
35-39 |
6,776 |
17,417 |
40-44 |
44,420 |
59,692 |
40 |
0.89092 |
40-44 |
6,268 |
9,567 |
45-49 |
32,862 |
47,612 |
45 |
0.88633 |
45-49 |
5,338 |
8,529 |
50-54 |
28,178 |
37,969 |
50 |
0.87224 |
50-54 |
4,303 |
9,409 |
55-59 |
19,983 |
30,205 |
55 |
0.84731 |
55-59 |
4,012 |
6,039 |
60-64 |
17,569 |
25,593 |
60 |
0.80885 |
60-64 |
3,832 |
9,442 |
65-69 |
11,216 |
20,802 |
65 |
0.75468 |
65-69 |
4,137 |
7,371 |
70-74 |
8,365 |
12,612 |
70 |
0.66991 |
70-74 |
3,426 |
4,822 |
75-79 |
5,919 |
8,434 |
75 |
0.56388 |
75-79 |
3,458 |
3,528 |
80-84 |
4,063 |
5,061 |
80+ |
0.40912 |
80-84 |
3,248 |
2,390 |
85+ |
2,152 |
2,183 |
|
|
85+ |
3,413 |
-620 |
Total |
567,613 |
721,949 |
|
|
Total |
59,576 |
213,911 |
The second and third columns of Table 3 present the numbers of people living in provinces other than the Western Cape who were born in the Western Cape, as counted by the 2001 census and the 2007 Community Survey, respectively. The net number of out-migrants of those born in the Western Cape (i.e. the number of people born in the Western Cape who moved out, less those who have returned) is given in column 8. The negative numbers mean that there was negative net out-migration (i.e. the number of those born in the Western Cape who moved to other provinces in the period was less than the number born in the Western Cape who were living outside who returned during the period). Thus the total of -19,017 means that the number of people born in the Western Cape, who returned to the Western Cape during the period having lived in another province until 2001 exceed those who were born in the Western Cape and moved to another province in the period by 19,017.
These estimates were derived using the same survival factors as were used for those born outside the Western Cape who moved into the province, but if there was reason to suppose that the mortality differed for those born in the Western Cape who moved out, then a different set of survival factors would be used to estimate the Net M (born in) numbers.
The overall net in-migration for the province is thus given in the final column of Table 3. Thus in total 232,928 more people moved into the Western Cape than left the Western Cape to live in another province.
In this example those born outside the province include those born outside the country and thus the overall net migration includes immigrants who settle in the province. Excluding the foreign-born from Table 2 would produce numbers of internal in-migrants net of internal out-migrants, and the sum of these numbers for all the provinces together would be zero.
Table 3 Estimation of the net number of out-migrants of those born inside by age group, Western Cape, South Africa, 2001-2006
Age |
2001 |
2006 |
x |
5Sx |
Age at 2nd census |
DI |
Net M (born in) |
Overall Net M |
|
|
|
B |
0.94151 |
|
|
|
|
0- 4 |
22,055 |
11,747 |
0 |
0.97896 |
0- 4 |
365 |
12,112 |
7,490 |
5- 9 |
21,895 |
12,509 |
5 |
0.99547 |
5- 9 |
367 |
-9,180 |
21,962 |
10-14 |
21,382 |
11,593 |
10 |
0.99427 |
10-14 |
76 |
-10,226 |
16,737 |
15-19 |
18,265 |
13,455 |
15 |
0.98602 |
15-19 |
100 |
-7,827 |
30,870 |
20-24 |
14,645 |
10,477 |
20 |
0.96458 |
20-24 |
202 |
-7,587 |
46,531 |
25-29 |
13,501 |
9,534 |
25 |
0.93161 |
25-29 |
434 |
-4,676 |
23,142 |
30-34 |
13,118 |
11,047 |
30 |
0.90960 |
30-34 |
867 |
-1,587 |
18,257 |
35-39 |
12,121 |
14,614 |
35 |
0.89780 |
35-39 |
1,319 |
2,815 |
14,602 |
40-44 |
11,725 |
12,195 |
40 |
0.89092 |
40-44 |
1,311 |
1,384 |
8,183 |
45-49 |
10,335 |
10,538 |
45 |
0.88633 |
45-49 |
1,285 |
98 |
8,431 |
50-54 |
9,211 |
9,881 |
50 |
0.87224 |
50-54 |
1,221 |
768 |
8,642 |
55-59 |
7,264 |
10,568 |
55 |
0.84731 |
55-59 |
1,362 |
2,720 |
3,319 |
60-64 |
6,691 |
7,723 |
60 |
0.80885 |
60-64 |
1,250 |
1,710 |
7,732 |
65-69 |
4,643 |
5,297 |
65 |
0.75468 |
65-69 |
1,265 |
-128 |
7,499 |
70-74 |
3,954 |
3,766 |
70 |
0.66991 |
70-74 |
1,182 |
304 |
4,517 |
75-79 |
2,331 |
2,384 |
75 |
0.56388 |
75-79 |
1,240 |
-330 |
3,858 |
80-84 |
1,402 |
2,140 |
80+ |
0.40912 |
80-84 |
1,336 |
1,145 |
1,244 |
85+ |
707 |
555 |
|
|
85+ |
1,024 |
-531 |
-89 |
Total |
195,246 |
160,023 |
|
|
Total |
16,206 |
-19,017 |
232,928 |
Table 4 presents the results of the answers to the question about place (province in this example) of residence at the time of the 2001 Census given by those counted in each of the provinces in the 2007 Community Survey. (In actual fact the question asked whether the person was staying at the same place at the time of the prior census and if not, where they were staying at the time they moved to the place at which they were counted in the Community Survey. However, work by Dorrington and Moultrie (2009) shows that using these data and the year of movement to back project the population in order to estimate the numbers by province of residence at the time of the previous survey suggests that the assumption that there was only one move in the five years since the previous census was reasonably accurate.)
By far the largest numbers of migrants are those that moved within each of the provinces, however, these have been excluded from Table 4 because one is usually more interested in interprovincial migration than migration within a province.
Table 4 Interprovincial migration, South Africa, 2001-2006
|
Province where counted (destination) |
|
||||||||
Previous residence (origin) |
WC |
EC |
NC |
FS |
KZ |
NW |
GT |
MP |
LM |
Total |
WC |
|
12,173 |
4,060 |
1,745 |
3,221 |
2,113 |
16,400 |
1,405 |
874 |
41,992 |
EC |
52,239 |
|
1,120 |
7,187 |
25,209 |
14,430 |
28,633 |
4,693 |
2,116 |
135,626 |
NC |
4,813 |
1,942 |
|
3,480 |
908 |
3,728 |
4,956 |
1,062 |
357 |
21,246 |
FS |
2,943 |
3,145 |
2,546 |
|
2,352 |
12,733 |
19,920 |
4,293 |
1,963 |
49,896 |
KZ |
6,762 |
7,015 |
631 |
2,358 |
|
3,573 |
50,980 |
8,886 |
1,194 |
81,399 |
NW |
1,478 |
907 |
9,811 |
5,555 |
2,329 |
|
47,633 |
3,090 |
4,337 |
75,140 |
GT |
24,891 |
12,948 |
3,962 |
11,437 |
18,145 |
32,433 |
|
18,598 |
15,133 |
137,547 |
MP |
2,134 |
1,317 |
280 |
1,724 |
4,546 |
5,767 |
42,941 |
|
8,628 |
67,338 |
LM |
2,754 |
1,583 |
255 |
1,709 |
2,209 |
9,773 |
81,394 |
24,211 |
|
123,889 |
OSA |
21,221 |
5,467 |
1,209 |
9,584 |
10,933 |
11,437 |
51,873 |
8,335 |
9,286 |
129,346 |
DNK |
500 |
3 |
15 |
124 |
132 |
78 |
228 |
89 |
0 |
1,170 |
UNS |
1,058 |
1,029 |
107 |
208 |
875 |
508 |
3,558 |
408 |
633 |
8,384 |
Total |
120,794 |
47,528 |
23,996 |
45,111 |
70,860 |
96,573 |
348,516 |
75,070 |
44,524 |
872,973 |
WC = Western Cape, EC = Eastern Cape, NC = Northern Cape, FS = Free State, KZN = KwaZulu-Natal, NW = North West, GT = Gauteng, MP = Mpumalanga, LM = Limpopo, OSA = Outside SA, DNT = Do not know, UNS = Unspecified |
In addition to the all-age numbers in Table 4 (in actual fact these numbers exclude, as is often the case, migration of those born between the census and survey) one can also produce numbers of in- and out-migration by age groups as shown in Table 5. For completeness these numbers include estimates of the number of migrants who were born since the previous census. However, relative to the other migrants these numbers look implausibly high, and the reason for this is discussed below.
The net number of migrants is estimated for those aged 25-29 at the time of the Community Survey (i.e. were aged 20-24 at the time of the 2001 census), for example, as follows:
Table 5 Estimation of the net number of in-migrants by age group, Western Cape, South Africa, 2001-2006
Age |
Surviving in- migrants (I’) |
Surviving out- migrants (O’) |
x |
5Sx |
Net in-migrants |
|
|
|
|
|
|
0- 4 |
20,846 |
11,747 |
B |
0.94151 |
9,381 |
5- 9 |
6586 |
3,554 |
0 |
0.97896 |
3,065 |
10-14 |
6685 |
2,882 |
5 |
0.99547 |
3,812 |
15-19 |
10402 |
3,967 |
10 |
0.99427 |
6,454 |
20-24 |
21266 |
4,488 |
15 |
0.98602 |
16,897 |
25-29 |
20675 |
5,649 |
20 |
0.96458 |
15,301 |
30-34 |
15584 |
6,008 |
25 |
0.93161 |
9,928 |
35-39 |
10584 |
5,098 |
30 |
0.90960 |
5,758 |
40-44 |
7264 |
3,045 |
35 |
0.89780 |
4,458 |
45-49 |
4648 |
2,714 |
40 |
0.89092 |
2,053 |
50-54 |
3095 |
1,500 |
45 |
0.88633 |
1,698 |
55-59 |
3940 |
935 |
50 |
0.87224 |
3,225 |
60-64 |
3776 |
527 |
55 |
0.84731 |
3,541 |
65-69 |
3127 |
818 |
60 |
0.80885 |
2,582 |
70-74 |
1540 |
437 |
65 |
0.75468 |
1,282 |
75-79 |
561 |
206 |
70 |
0.66991 |
442 |
80-84 |
797 |
116 |
75 |
0.56388 |
944 |
85+ |
264 |
47 |
80+ |
0.40912 |
374 |
Total |
141,640 |
53,739 |
|
|
91,194 |
Perhaps the simplest check, on the reasonableness of the ‘shape’ (i.e. distribution of the numbers by age) of the estimates but not the level, is to see if it conforms to the standard shape (or a variation thereof). Rogers and Castro (1981a; 1981b) point out that the distribution of the number (or rate) of in- and out-migrants tends to conform to standard patterns, with a peak in the young adult ages (usually associated with seeking employment), a second, usually less pronounced peak amongst very young children falling to a trough amongst young teenagers (the size depending on the extent to which it is families rather than individuals moving in the young to middle aged adults). Sometimes there is also a ‘hump’ (or trough) around retirement age if there is a strong flow of migrants moving to (or away from) the place to retire.
These patterns (not necessarily the same pattern) apply to in- and out-migration flows separately, but not necessarily to net migration (which is the difference between the two flows) unless one flow (either the in-migration or the out-migration) is much greater than the other.
Figure 1 illustrates this using some of the estimates calculated above, expressed as proportions of the total number in each case (to allow them to be presented on a single figure). From this we can see that in broad terms (with the exception in some cases, where the proportion of migrants at the very young ages looks implausibly high) each conforms to the expected shape.
The net out-migrants of those born in the Western Cape (excluded from the figure for ease of illustration) does not conform to a standard model of migration, which could indicate these numbers are not very reliable, however, they are small relative to the in-migration of those born outside the province, and thus such a deviation may tolerated. In addition to this there are two other features to be noted from Figure 1. The first is that the out-migration from the Western Cape as estimated from data on place of residence at previous census, suggests that adult out-migrants peak at a somewhat older age (and possibly are likely to represent family rather than individual migration). The second is the fact that the net immigration into the country follows the standard shape which indicates that the flow into the country is much stronger than the return flow of those migrants.
If the census asked place of birth and place of residence at the previous census then one can compare the two estimates of net in-migration into a specific sub-national region. If they are similar this gives one some confidence in the results. In the case of the place of birth data for South Africa the net number of in-migrants into the Western Cape is 232,928 (Table 3) while the estimate from the data on place of residence at the time of the previous census data produced an estimate of 92,194 (Table 4), which suggests that one or both of these sets of data are suspect.
The most basic check of the estimates of migration is to project the population (of the country or the province) at the first census to the time of the second census making use of the estimates of the number of migrants and compare that with the census estimates from the second, more recent, census to see how well the two match, especially in the age range in which migration is concentrated. In the case of the net in-migration into the Western Cape, projecting the population forward from 2001 using the estimates derived from the change in the numbers by place of birth produced a much closer fit to the population in the 20-29 year age range, suggesting that the data on place of birth are probably more complete than those on the place of residence at the date of the previous census. To some extent this is supported by a comparison of the change in the number of foreign-born in the country between the two censuses, 222,693 (Table 1) with the sum of the numbers who reported that they had moved from outside South Africa to one of the provinces since the previous census, 129,346 (Table 4).
Ideally, if one had independent estimates of the number of migrants one might compare those numbers against estimates using the above methods. Unfortunately, reliable independent estimates are rare. Although most countries try to record people entering and leaving the country, these data are often not reliable, particularly in developing countries with relative porous borders. And unless the country is extremely well regulated and maintains a complete and accurate register of the population, the only other way to measure internal migration is through migration-specific surveys, which tend to be much more useful for understanding the type of migration (whether permanent, temporary, cyclical, etc.) than for producing reliable estimates of the number of migrants, given the often less structured situation that (particularly recent) migrants find themselves living in and an understandable reluctance to identify themselves as being migrants.
Considering the numbers of migrants estimated from the data on place of residence at the previous census given in Table 4 (and taking into account the suspicion that these probably underestimate the true migration), some 2-4% of the population changed province of residence in the 5 years between the 2001 Census and the Community Survey. Had we included the number who moved within, but did not change, province then between 7 and 15 per cent of the population moved in the 5‑year period.
The main provinces of destination are Gauteng (by a big margin) and Western Cape, which are predominantly urban and the wealthiest provinces. The main provinces of origin are Gauteng (inspection of the age distribution would show that this is mainly return migration of ‘retiring’ workers) Eastern Cape and Limpopo, which are poor, mainly rural provinces, from which people seeking work migrate to the urban areas.
It appears that migration is predominantly of individuals (seeking work) rather than of families.
A particular feature of the data relying on province of birth is the apparently relatively high number of children born since the first census who have moved to another province. In all likelihood this is an artefact of the data capturing process. Scanning was used to capture the data from the questionnaires on which Western Cape was coded as a “1”, written in the appropriate space by hand. It appears that in a small percentage of cases the scanner might have had trouble distinguishing a handwritten “1” from a handwritten “7” (the code for Gauteng). The result of this is, for example, that some of the children coded as having been born outside the province in which they were counted, and thus appear to be migrants, but probably were not. Even though the percentage error in scanning is very small, the number of births can be large relative to the number migrants, and thus the error can produce noticeable errors. Since an increasing number of developing countries are using scanning to capture data, this sort of problem may be quite common.
Where scanning errors or other situations make it impossible to produce reliable estimates of the number of migrants of those born since the previous census one can use CWR from second census as follows:
for those born in the most recent five years, and
for those born in the five years before that if the censuses are 10 years apart, where CWRx represents ratio of the number of children aged between x and x+5 to the number of women in the population aged between 15+x and 45+x in the population (regional or national) at the time of the second census, and
represents the number of women migrants aged between x and x+30.
Applying this to the data for the Western Cape suggest that the number of migrants born since the previous census should be less than half the numbers being estimated from the data on place of birth.
The indirect estimation of migration derives from the balance equation for two censuses n years apart, namely:
where
is the net (i.e. in less out) number of in-migrants, aged x to x+5 at the time of the first census, surviving to the second census, and 5Dx, 5I’x and 5O’x, represent the number of deaths, surviving in-migrants and out-migrants, aged x to x+5 at the time of the first census, who died or moved in the period between the censuses.
For those born after the first census the equation becomes:
and those in the open age interval:
where B represents the number of births in the population between the two censuses, DB the number of deaths of those births in the period between the censuses and M’B the net number of surviving migrants, born outside the country in the period between the two censuses, ∞DA-n the number of deaths in the intercensal period aged A-n and older at the time of the first census, and ∞M’A-n the net number of migrants aged A-n and older at the time of the first census.
Thus
or alternatively
where 5Sx , SB and ∞SA-n represent the proportion of the populations aged x to x+5 at the time of the first census, born between the censuses, and aged A-n and older at the time of the first census, respectively, surviving to the second census.
The net number of migrants can thus be estimated from the net number surviving to the second census as follows:
Unfortunately, since the net number of migrants is usually small relative to the size of the population, age misstatement or errors in either or both census counts can lead to very poor estimates being produced. Better estimates of the net number of immigrants into a country can be produced by confining one’s attention to the population of foreigners (defined as those born outside the country) and assuming that return migration of emigrants from the country of interest is insignificant. Thus one replaces each of the symbols above by equivalents specific to the foreign-born population in the country. Since it is unlikely that one has an accurate record of the number of the foreign-born deaths these need to be estimated in one of the following ways:
where the superscript “nb”
designates native-born.
where the births and deaths are from the vital registration.
However, for most developing countries, particularly those in Africa, vital registration systems are too incomplete to be used in this way.
When it comes to internal migration one can estimate net in-migration (i.e. in-migration of those born outside the region less out-migration of those born outside the region who had previously moved into the region) into each sub-national region of those born outside the region by making use of place of birth information to identify the change in numbers of those born outside the region, in the same way as described above. However, since one also has the place of residence of those born in the region who have moved out of the region since birth (but not emigrated) one can also estimate the net out-migration of those born in the region (i.e. out-migration of those born in the region less those born in the region who have returned after having previously moved out of the region) by applying the method described above to the population born in the region (as opposed to those born outside the region).
When estimating the survival of those born in the various regions the census survival ratios could have an advantage over the life table survival ratios in that any under or over count of the population by region, may well be matched by a similar distortion in the national population and hence in the survival ratios, thus resulting in a more accurate estimate of the number of migrants than would be produced by using life table survival ratios.
Apart from place of birth a census can ask of those who moved since the previous census (or some other suitable date) where they were at that census (or some other suitable date) which allows one to measure out-migration and hence (gross) in-migration separately for each sub-national region.
If the census asks for the year when the migrant moved (or how long the person has been living in the place where counted in the second census) one can get a sense of the timing of migration, and estimate yearly migration rates. This is a complicated process and is not covered here, but the interested reader is referred to the paper by Dorrington and Moultrie (2009).
If age-specific numbers are not available or the allocation to age is considered to be unreliable one can still produce estimates by age by estimating the total number of migrants as described below, and then apportioning this total to the age groups using either an age distribution for the same population at a different time (since the age distribution of migration flows tend be consistent over time, or (more likely) an appropriate standard model Rogers and Castro (1981a; 1981b).
where
and ∞m0 is an estimate of the crude mortality rate of the population in the country of the census.
The primary limitation of using censuses to estimate immigration and net in-migration is the quality of the census, in particular the extent of undercount of the censuses, in general but more significantly one relative to the other. However, even if the census undercount is low, the census might not identify all the migrants. In general recent migrants are often difficult to include in a census because they have yet to settle. More specifically, immigrants may not be keen to identify themselves as immigrants and either avoid being counted or do not admit to being foreign-born.
Apart from this, place of birth and/or place of residence at previous census, in the case of internal migrants, might be misreported due to boundary changes or ignorance (or even bias) on the part of the respondent.
The third drawback of census data is that it cannot be used to measure emigration from the country of the census. Emigration is particularly difficult to estimate for most countries, but one option is to apply the method for identifying net immigration of the foreigners described above to the censuses of the main countries of destination to which the emigrants move to estimate the change in the numbers of emigrants to those countries. Of course, this is only useful if the censuses of these countries identify the numbers of foreign-born by their countries of birth reasonably accurately.
Generally, statistics on immigrants and particularly emigrants that are collected at border posts provide quite poor estimates of the true numbers, unless the borders of the country are quite impenetrable and there are a few well-controlled ports of entry. Even then there may still be many ‘visitors’ who end up living in the country.
A final drawback occurs when working with data aggregated over all ages. In these cases one usually has to make use of the crude death rate for the population of the country of the census in order to estimate the number of deaths of the migrant population. However, since the distribution of the migrant population by age can differ from that of the population of the country of the census quite markedly, the estimated number of deaths can be quite inaccurate.
Some censuses ask additional questions which can be of use in interpreting the patterns of migration, if not improving the estimate of the level of migration. Most common of these is probably a question asking about when the migrant moved. These data allow one to estimate annual rates of migration, however, it possible that there could be a tendency for respondents to report moves as occurring more recently than is actually the case (Dorrington and Moultrie 2009).
Where a census asks, such as the recent censuses in South Africa, of those who moved since the previous census, where they moved from most recently and when they moved, and not where they were at the time of the previous census, it is possible to back-project the numbers of migrants by applying annual rates of migration between sub-national regions to estimate the number by place at the time of the previous census (Dorrington and Moultrie 2009). However, in the case of South Africa, at least, it appears that the assumption the most migrants moved only once in the past five years, and thus that the place of residence before the most recent move is the same as the place at the time of the previous census, is quite reasonable (Dorrington and Moultrie 2009).
Where one has data on both the sub-national region of birth and the place at the time of the previous census, one can cross-tabulate the place of residence data by the place of birth and thus be able to classify recent migrants into primary, secondary and return migrants.
For general background to the topic of migration, definition of terms and detail on the analysis and interpretation of the data on internal migration the interested reader is referred to the excellent UN Manual on topic, Manual VI (UN Population Division 1970). The textbook by Shryock and Siegel (1976) or its modern replacement by Siegel and Swanson (2004) also provides an introduction to the topic of migration and cover, in particular, the estimation of international migration.
Those interested in the estimation of annual migration rates and the back-projection of migration to estimate the numbers by place of residence at the time of the previous census from data on place of residence before the most recent move and year of move are referred to the paper by Dorrington and Moultrie (2009).
Dorrington RE and TA Moultrie. 2009. "Making use of the consistency of patterns to estimate age-specific rates of interprovincial migration in South Africa," Paper presented at Annual conference of the Population Association of America. Detroit, US, 30 April - 2 May.
Rogers A and LJ Castro. 1981a. "Age patterns of migration: Cause-specific profiles," in Rogers, A (ed). Advances in Multiregional Demography (RR-81-006). Laxenburg, Austria: International Institute for Applied Systems Analysis, pp. 125-159. http://webarchive.iiasa.ac.at/Admin/PUB/Documents/RR-81-006.pdf [15]
Rogers A and LJ Castro. 1981b. Model Migration Schedules (RR-81-030). Laxenburg, Austria: International Institute for Applied Systems Analysis. http://webarchive.iiasa.ac.at/Admin/PUB/Documents/RR-81-030.pdf [16]
Shryock HS and JS Siegel. 1976. The Methods and Materials of Demography (Condensed Edition). San Diego: Academic Press.
Siegel JS and D Swanson. 2004. The Methods and Materials of Demography. Amsterdam: Elsevier.
Timæus IM. 2004. "Impact of HIV on mortality in Southern Africa: Evidence from demographic surveillance," Paper presented at Seminar of the IUSSP Committee "Emerging Health Threats" HIV, Resurgent Infections and Population Change in Africa. Ougadougou, 12-14 February.
UN Population Division. 1970. Manual VI: Methods of Measuring Internal Migration. New York: United Nations, Department of Economic and Social Affairs, ST/SOA/Series A/47. http://www.un.org/esa/population/techcoop/IntMig/manual6/manual6.html [9]
This section describes how to fit a multi-exponential model migration schedule to observed migration data.
Over the last thirty years, these schedules, devised by Rogers and Castro (1981), have been remarkably successful in representing typical age patterns of migration. Essentially the same age patterns of migration have been observed whether national and interregional migrations are considered simultaneously, or migration from a specific region is considered in isolation. The multi-exponential function was designed to reflect the dependency between migration and age, and captures the relationship through an additive sequence of exponential curves, based on 7, 9, 11 or 13 parameters, depending on the complexity of the migration patterns and the ability and robustness of the data to sustain increased parameterization.
When fitted to a schedule of single-year-of-age migration rates, the Rogers-Castro model provides a best-fit, graduated expression of the migration schedule that finds application in smoothing an observed series of migration rates, and which can be used directly to enhance understanding of migration dynamics. The results can also find application in a number of alternative uses, for example, in setting migration schedules to be used in multi-regional population projections. Ideally, the analyst will have estimates of migration by single year and single ages to which the Rogers-Castro model can be fitted. However, if – as is often the case in developing countries where the quality of the underlying data may not permit such finely grained calculations – the data are only available in five-year age groups, then single-year age rates need to be interpolated from the data using one of the methods described in this chapter before attempting to fit a Rogers-Castro model.
Ideally the data should be in the form of rates by single ages. Where they are in five-year age groups then single year observations must be interpolated from these five-year estimates before attempting to fit a multi-exponential curve. The choice of the upper age is somewhat arbitrary, but the upper bound of the data used in fitting a model schedule should – at the minimum – be greater than the modal age of retirement.
Latest census counts the population by sub-national region and place of birth accurately and identifies who have moved from one region to another since a prior date (e.g. previous census).
Before applying this method, you should investigate the quality of the data in at least the following dimensions:
Caution should be exercised in applying the method to net migration data, as the multi-exponential distribution of migration rates by age models gross migration flows (i.e. in- or out-migration) but not necessarily net migration, unless the flow in one direction significantly dominates the flow in the other at all ages.
The multi-exponential function was designed by Rogers and Castro (1981) to reflect the dependency between migration and age. High levels usually found in the first year of life. It drops to a low point during the early teenage years. Then it increases sharply to its highest point during the young adult years. After that, it declines, except for a possible increase and subsequent decrease during the ages of retirement. In some circumstances there may be an upward slope at the oldest ages (Rogers and Castro 1981; Rogers and Watkins 1987).
Over the last thirty years, the schedule (also known as the Rogers-Castro model migration schedule) has proven to be remarkably successful in representing age patterns of migration (Little and Rogers 2007; Raymer and Rogers 2008; Rogers and Castro 1981, 1986; Rogers and Little 1994; Rogers, Little and Raymer 2010; Rogers and Raymer 1999; Rogers and Watkins 1987). These same age patterns of migration have been documented for regions of different sizes and for ethnic and gender sub-populations (Rogers and Castro 1981). They appear whether national interregional migrations are considered simultaneously, or migration from a specific region is considered separately. Directional migration (i.e. from region i to region j) exhibit the same patterns as well. For example, the Rogers-Castro model migration schedule has been fitted successfully to migration flows between local authorities in England (Bates and Bracken 1982, 1987), Canada’s metropolitan and non‑metropolitan areas (Liaw and Nagnur 1985), and the regions of Japan, Korea, and Thailand (Kawabe 1990), and South Africa’s and Poland’s national patterns (Hofmeyr 1988; Potrykowska 1988).
When fitted to a schedule of single-year-of-age migration rates, the Rogers-Castro model provides a best-fit, graduated expression of the migration schedule that can be summarized by 7, 9, 11 or 13 parameters depending on the complexity of the schedule and strength of the data. In addition, the erratic fluctuations, often associated with unreliability in observed age-specific rates, are smoothed.
Rogers-Castro model migration schedules have been used in population projections in Canada (George 1994), and they have been imposed on time periods, regions, and subpopulations (Rogers, Little and Raymer 2010) when migration data were inadequate or unavailable.
The full model schedule has 13 parameters, which is the complete and most complex multi-exponential form of the model. If M(x) is defined as the migration rate for a single year of age x, the full model is defined as
It comprises five additive components. The first component,
, is a single negative exponential curve representing the migration pattern of the pre-labour force ages. The second component,
, is a left-skewed unimodal curve describing the age pattern of migration of people of working age. The third component,
, is an almost bell-shaped curve representing the age pattern of migration post-retirement, where migration increases sharply following retirement before falling off again. Associated with this component, the fourth component is a single positive exponential curve of the post-retirement ages,
, reflecting the (sometimes) observed generalised increase in migration post-retirement. This can be seen, for example, in the migration of the elderly in the US from the North-East to the “sunbelt” states in the South East and South West. The final component is a constant term, c, that represents ‘background’ migration.
Four families of multi-exponential schedules have been identified in past studies (Rogers, Little and Raymer 2010), and only one, exhibiting both a retirement peak and a post-retirement upslope, requires all 13 parameters and all five components. This family is documented in studies of elderly migration (Rogers and Watkins 1987), and is demonstrated in the bottom right panel of Figure 1.
Source: Based on Raymer and Rogers (2008)
Note: The legend indexes, in order, (1) the pre-labour force migration schedule; (2) the working age migration schedule; (3) the post-retirement migratory increase and decrease; and (4) the generalised increase in post-retirement migration.
The other families are reduced forms of the full model, which means that at least one component is omitted. For example, the most common schedule identified by Rogers, Little and Raymer (2010) requires seven parameters and consists of the first two components and the constant term. This is also called the standard schedule, and its shape is set out in the top left panel of Figure 1.
A number of schedules have exhibited a standard profile plus a retirement peak (Rogers A and LJ Castro. 1981, 1986), resulting in the 11-parameter model, including components 1, 2, 3 and 5, shown in the bottom left panel of Figure 1. In populations with significant migrant labour, particularly in the developing world, it is possible that the third component is a trough rather than a peak, as migrants return home to retire.
The 9-parameter model is used when the standard pattern is visible for the labour and pre-labour force ages, and there is an upslope to represent migration in the post-retirement years as displayed in the top right panel of Figure 1. This was found in several regions of the Netherlands in 1974 by Rogers and Castro (1981).
As should be evident from the discussion above, all parameters are interpretable and can be used to characterize the model schedule.
In their original 11-parameter specification of the multi-exponential migration model, Rogers and Castro (1981) illustrated the model using data on male out-migration rates from Stockholm in 1974. Figure 2 shows the original data (the jagged lines) and the smoothed 11-parameter schedule fitted to the original data.
Five of the 11 parameters (α1, α2, α3, λ2 and λ3) give rates of change for different pieces of the model schedule while the level parameters (a1, a2, a3 and c) correspond to the heights of the model schedule. a1 gives the peak in the first year of life, a2 is the peak of labour force migration, a3 is the peak of retirement migration, and c gives the background migration rate. μ2 and μ3 give the ages at the labour force peak and at the retirement peak, respectively.
Source: Rogers and Castro (1981). Permission to reproduce this figure granted by the International Institute for Applied Systems Analysis (IIASA).Some measures can be used to describe either the observed or the model migration schedule. For example, xl is the pre-labour force age when migration is at its low point. xh is the age when labour force migration peaks, and xr is the age of peak retirement migration. The difference between xl and xh is called the ‘labour force shift’, X, and the increase in migration rate between xl and xh is called the ‘jump’, B. A, the ‘parental shift’, is used to describe the average age difference between parent migration and the corresponding migration of children. The gross ‘migraproduction’ rate (GMR) is the sum of all rates over all ages (i.e. the area under the curve), and it is used to gauge the total level of migration out of a region or the total directional migration, i.e., from region i to region j (Rogers and Castro 1981).
The method is applied in the following steps.
The initial step in estimating a model schedule is to prepare the data. Decisions about which measure of migration to use depend upon the data sources available (registry, census, or survey) and the purpose of the research. For example, in a comparative study of migration patterns, any of the measures would be appropriate as long as they are constructed similarly across contexts. If, on the other hand, the model schedules are to be used in single-year population projections, the fitted schedule should represent single-age, single-year migration rates. However, where one does not have single-year single-age observations that produce progress relatively smoothly by age, then one must first convert the data one has into single-year single-age estimates. A number of commonly-encountered situations are described below.
When the numbers of migrants who survived a five-year migration interval are available from census data which also give the year of most recent move, single-year, single-age migration rates can be derived through a conceptually simple, yet algebraically complex, back-projecting procedure outlined by Dorrington and Moultrie (2009). Their method compensates for the effect of mortality by applying the mortality regime of the general population to the migrants and for the effect of interregional migration by applying the annual rates of migration for the most recent year to estimate the population by region one year prior to the census and using that to estimate the migration rates two years before the census, and using that to estimate the population two years before the census, etc. It requires additional region-of-birth information for those aged 0-4 at time of census, as well as single-age, yearly estimates of regional populations. Schedules derived in this manner can then be fitted and smoothed with a Rogers-Castro model schedule, and used in single-year population projections.
Regardless of the migration time interval, whether using census data or population register data, five-year age groupings generally give more reliable estimates of migration propensities than one-year age categories (Rogers, Little and Raymer 2010). In addition, counts of migrants in one-year age categories are typically only available from sample data, since national population bureaus tend to publish counts of interregional migrants in five-year age categories.
To apply the multi-exponential model when the initial migration proportions are in five-year age categories requires some method of converting the five-year rates to one-year rates. Cubic-spline interpolation (McNeil, Trussell and Turner 1977) is one such method that produces a smooth schedule for all integer values of ages. Rogers and Castro (1981) used data from Sweden, which was available in one-year and five-year age rates, to test the accuracy of the cubic-spline method, and found generally satisfactory results.
To arrive at smooth one-year age migration profiles, the initial migration proportions for the five-year age categories are assigned values close to the middle age within the five-year interval, i.e., ages 2, 7, 12, 15, … 72, 77 (or 2.5, 7.5, 12.5, …, etc., if estimating rates rather than probabilities). From this set of points, a continuous age profile of state outmigration propensities is generated with cubic-spline interpolation, which constructs third-order polynomials that pass through the set of pre-defined control points (called nodes). Commercial or freeware add-ins for Microsoft Excel, such as XlXtrFun [19], can also be used to implement cubic spline interpolation.
An alternative approach is to adapt Beers’ 6-parameter interpolation procedure (Beers 1945) to interpolate rates between the rates for the youngest and oldest age groups, which also extrapolates the rates to ages 0 and 1 (or 0.5 and 1.5 if working with rates). The extrapolation to the youngest ages is achieved by assuming that the difference between propensities for age 1 and 2 is the same as that between ages 2 and 3, and that between ages 0 and 1 is the same as that between ages 3 and 4.
Thus, to apply either approach one needs a set of migration rates in five-year age intervals from 0-4, to at least 65-69.
Once the observed schedule is prepared, a decision must be made about the form of the multi-exponential model to be adopted. The overview of the multi-exponential model migration schedule presented above described the characteristics of the 7-, 9-, 11-, and 13-parameter models. This decision should be informed by a visual inspection of the schedule, keeping in mind that the model is assumed to represent the true form of the population migration schedule. Sometimes, even after plotting the schedule, it is not apparent how best to model the retirement years and the oldest ages. For example, it may appear that either a standard 7-parameter model or a 9-parameter model (increasing migration in the oldest ages) would be appropriate. In this situation, the decision in favour of the 9-parameter model could be based on a theoretical expectation for increasing migration in the later years. On the other hand, the 9-parameter model form might be rejected, based on the goodness-of-fit measures, as being insufficiently parsimonious if it produces no better fit than the 7-parameter model. In deciding which form of the model to use, it is recommended that the goodness-of-fit of the simpler model be compared with the more complex model, (e.g. comparing the fit of a 7-parameter model versus that of an 11-parameter model). As a general rule, and always bearing in mind the likely robustness of the underlying data, substantial improvement in fit is required to justify a more complex specification.
For most developing countries, particularly where ‘retirement’ isn’t concentrated between the ages of 60 and 65 and there is age exaggeration at the older ages, the data are probably not strong enough to fit anything more than the 7-parameter version of the model.
Given the number of parameters (between 7 and 13) in the multi-exponential model migration schedule, determining a best fit ab initio using trial-and-error is not recommended. Instead, analytical algorithms have to be employed. The one described below uses an algorithm that is provided in Microsoft Excel, Solver. Solver may not be routinely loaded by standard installations of Microsoft Excel. To enable its use, proceed by selecting “File → Options → Add-ins → Manage Excel Add-ins → Go …” and then ensuring that the “Solver Add-in” is ticked.
The specifications of the Solver function, and the conditions and constraints that should be adhered to, have been set up in the workbook associated with the methods presented in this chapter. To run the routine on a given worksheet, select “Data → Solver → Solve”.
The model is fitted in the associated workbook and is set up to allow the user to set the “objective” to be minimized to be either the sum of squared differences between the observed rates and the fitted rates, or the chi-squared statistic.
The default Solver is set up to fit using all parameters. If one wants to fit a curve using only some of the parameters then one must specify only these parameters in the “By Changing Variable Cells” window, and set the other parameters to appropriate constant values (which may, or may not, be zero depending on the requirements of the fitting procedure). An instance where such constrained optimisation may be required is mentioned below.
The sum of squared differences is calculated as follows:
where Oi represents the observed rate at age i, Fi represents the fitted value at age i and n the number of age groups.
The chi-squared statistic is calculated as follows:
The chi-squared statistic is more sensitive to misfitting to age ranges where rates are lower (resulting in a proportionately larger error) and thus is a better metric to assess goodness-of-fit when trying to fit the ‘retirement hump’ (the third component).
The choice of initial parameter values is the principal difficulty in non-linear parameter estimation. Ideally, given a set of starting values, the algorithm proceeds through an iterative process, producing a revised set of “optimum” values. However, the optimum may be merely a local optimum, and not the global optimum. A better guess of the initial parameter values may produce an improved goodness-of-fit and produce a different set of final values. A poorer choice of initial parameters may prevent convergence to even a local optimum.
Bearing this in mind, the most effective method of ensuring that the results from a fitting procedure are indeed globally “optimal” is to choose parameter values previously reported for a “similar” curve. To this end one might start with the values already in the workbook which were used to fit the curves in the examples below.
Convergence may be more difficult to achieve with the 11- and 13-parameter models. Where such heavily parameterised models are justified, one approach that can be adopted is to first fit a standard 7-parameter model to the data (thereby securing the fit at the peak of the migration schedule, and at ages up to mid-adulthood). Then, one could proceed by fixing those 7 parameters to their estimates that resulted from the initial step (i.e. treat those parameters as constant from there on), and then estimate the remaining parameters. Another effective procedure is to carry out a linear estimation method first, which does not rely on an iterative algorithm. That method was first described in Rogers and Castro (1981) and later included as one of the several alternatives set out in Rogers, Castro and Lea (2005).
Another challenge in finding the optimum solution lies in choosing an appropriate stopping criterion for the iterative algorithm. As the iteration process converges on a solution, the chi-square statistic, which measures the differences between the observed and the estimated values, decreases. An indication that an acceptable solution has been found is when the chi-square value decreases by only a negligible amount from one iteration to the next. The level of this small difference is called the “tolerance” and is set by the user. The temptation is to set it to be a very small value, i.e. very close to zero, so that a true minimum chi-square value is achieved. However, the risk in this approach is that such a low tolerance may not be achievable, even when a solution has been found. Press, Flannery, Teukolsky et al. (1986) suggest a tolerance equal to .001 is a reasonable setting. If the estimation software fails to converge, the convergence criteria could be made less stringent, i.e. increase the tolerance, or try new initial estimates.
One trial-and-error method of choosing initial estimates makes use of the graphs in the accompanying Excel workbook. By substituting your schedule of observed data in one of the sheets, initial “guesses” of each parameter can be chosen and placed in the cells where the final estimates of each parameter are located. Then, by visual examination of the fit, and identification of the parameter values that are most out of line, try new initial values for those parameters and then re-evaluate the fit visually. Continue this way until the fitted schedule is reasonably close to the observed schedule. At this point, you will know you have reasonably good initial estimates and may proceed to the nonlinear least squares estimation procedure.
We evaluate the model fit by calculating the mean absolute percent error (MAPE) statistic:
The MAPE is prone to overstate inaccuracy, particularly when the observed schedule has many values that are very close to zero (Morrison, Bryan and Swanson 2004).
In addition to MAPE, we also calculate R2, the square of the correlation between the Oi and the Fi values. A heuristic that is often employed is that a reasonable fit is achieved with a MAPE of 15 per cent or less together with an R2 well above 90 per cent.
In addition, since the method assumes the estimated Rogers-Castro model schedule represents the true form of the migration schedule, the estimated model schedule should appear to represent the underlying pattern of the observed data.
If the goal is to describe the pattern of migration and a multi-exponential model has been successfully fitted to the data, any of the summary measures (e.g. GMR, X, B, and A) as well as the parameter estimates can be used to describe the schedule. The summary measures and the parameter interpretations are given in the Overview.
In the examples below, multi-exponential model migration schedules are applied to a variety of data, of varying quality and complexity and from a number of different sources. All worked examples are provided in the associated workbook on the Tools for Demographic Estimation website.
Because iterative methods are required to fit a model life table to data on conditional survivorship in adulthood, detailed worked examples are not provided in the text. The reader is directed to the description provided in the previous section on how to use Solver in Microsoft Excel to determine optimal fits. The workbook is set up to use Solver to derive the results presented.
An example of a schedule based on one-year age migration propensities measured over a one-year migration interval from census data is shown in Figure 3. The data are derived from the 2005 American Community Survey (ACS), a national survey conducted annually by the U.S. Census Bureau. Even for California, a highly populated state, the one-year age propensities over a one-year interval are quite unstable. The MAPE is 17 per cent and the R2 is 0.92.
Caution must be exercised when using one-year age propensities over one-year migration intervals. For each single age, the numbers at risk of migrating, as well as the numbers of migrants, may be small, resulting in propensities that are erratic and unstable. A better option may be to derive five-year age propensities, which have proven to be more reliable than one-year age propensities (Rogers, Little and Raymer 2010). These can be interpolated to yield one-year age propensities using cubic splines or Beers’ formula as discussed in the section describing the application of the method.
Figure 4 shows an example using census data for the state of New Hampshire. The US Census Bureau’s 1 per cent Public Use Microdata Sample (PUMS) is a relatively small sample taken from the census and New Hampshire is one of the least populated states. The one-year age propensities appear to be quite unstable with dramatic fluctuations, while the model schedule provides a smooth estimate of the true schedule form. The MAPE is 52 per cent and the R2 is 0.68.
Figure 5 shows the cubic-spline interpolation method applied to the five-year age migration propensities for New Hampshire, derived from the 2000 Census 1 per cent public use microsample data. The schedule interpolated from the five-year age rates is much smoother and provides more reliable estimates than the observed one-year age rates displayed in Figure 4, and thus is a better set of estimates against which to compare the fitted multi-exponential curve. The MAPE was reduced from 52% for the one-year age propensities to 15 per cent for the rates interpolated from the five-year age proportions, and the R2 increased from 0.68 to 0.94.
There are several reasons why the levels of the New Hampshire schedule, in Figure 5, are substantially higher than the California schedule, Figure 4. The California example gives migration over a one-year migration interval and the New Hampshire schedule is over a five-year interval. In addition, New Hampshire is a much smaller areal region than California and the expectation is that the force of migration will be more powerful in a geographically smaller region.
It is important to check visually if the age-specific migration rates have a ‘shape’ that is compatible with the Rogers-Castro models. If this is not the case then it is unlikely that these models will provide a satisfactory fit. Likewise, it is worthwhile checking whether there are any extreme values, particularly at older ages which might distort the choice of parameters or even the choice of the number of parameters to be fitted. If the observed estimates are particularly noisy, it would be better to group the data into five-year age intervals and then estimate a smoothed distribution using either the Beers 6-parameter interpolation provided or Spline curve fitting.
The formulation of the multi-exponential model was presented in the Overview and is not repeated here. In this section, we discuss in greater detail aspects that should be considered carefully before applying the method in practice.
The multi-exponential model is applied to schedules of one-year age migration rates beginning at age 0 and, typically, continuing to age 65 or higher to capture the full pattern of elderly migration. The schedules of age-specific migration might measure directional migration (i.e. from region i to region j) or total out-migration (i.e. from region i to all other regions), or all inter-regional migration with no specific origin or destination. Usually, migration data are obtained from national censuses (or, in developed countries, population registers). The multi-exponential model can be applied to a variety of measures of single-age migration propensities derived from either of these sources.
When obtained from national registration systems, the migration rate, for persons aged x at the beginning of the interval, is the ratio of the number of migrations during a given time interval divided by the average number of person-years exposed to the risk of moving. Persons can contribute more than one migration during the interval. These are occurrence-exposure rates, although migrations by non-survivors may not be included in the numerator (Rogers and Castro 1981).
The observed migration schedule in Figure 2 was derived from Sweden’s national registry for male migration out of Stockholm over a one-year interval. In contrast, Figure 6 shows the observed and estimated model schedule for all male inter-communal migration in Sweden over a five-year interval. As expected, the levels are much higher in Figure 6 due to more migration activity when all regions are combined as compared to the Stockholm region alone. Similarly, more migrations are expected over a five-year interval than over a one-year interval. Rees (1977) found migration rates over a five-year interval tend to be less than five times (between 3 and 5 times) those over a one-year interval. The observed schedule is also smoother and more similar to the model schedule in Figure 6, indicating single-age migration rates are more reliable when based on a longer interval.
Censuses, on the other hand, count surviving migrants (not migrations). Migrants are persons who reported living in one region, at the beginning of the time interval, and resided in a different region at the time of the census. A person registering multiple migrations in a national register may be a non-migrant in the census if he returned to his initial location during the time interval. In general, counts of migrants from censuses understate the number of migrations, especially for longer time intervals when there are bound to be larger numbers of return movers and non-survivors. For these reasons, a migration schedule derived from population register data is not directly comparable to one based on census data (Rogers and Castro 1986).
Censuses typically record the location of a person’s current residence and ask where the person was living either one year ago or five years ago. Given this information and the person’s age at the time of census, the numbers of surviving migrants, and the numbers of survivors who were at risk of migrating are counted. The ratio of the number of surviving migrants to the number of survivors at risk for migrating is sometimes called a ‘conditional survivorship proportion’ because migrants and persons counted as being at risk for migrating must have survived the migration time interval to be counted by the census (Rogers, Little and Raymer 2010). Since these are not occurrence-exposure rates they will be called migration propensity here.
To derive single-age migration propensities when the census question asks where a person was living one year ago, all persons are “back-cast” to the region where they lived one year earlier when they were one year younger, which gives the number of persons at risk of migrating from that region. For example, a person aged 1 last birthday in a census conducted in 2010 would have been aged 0 last birthday in 2009. If the 2010 age values ranged from 1 to 85, they would range from 0 to 84 in 2009. (Note, only persons aged 1 and older would have reported place of residence 1 year ago.) Back-casting yields the number of people who survived to be counted by the census in 2010 and who were at risk for migrating from region i, in 2009. The number of migrants would be the count of persons who reported living in region i in 2009, but were counted as residing in a different region in 2010. For each 1-year age group, the ratio of the number of migrants to the number at risk for migrating gives the age-specific out-migration propensity for the 1-year interval. When the numerator contains directional migrants, i.e. from region i to region j, the ratio gives the age-specific propensity to migrate from region i to region j.
Caution must be exercised when using one-year age propensities over one-year migration intervals. For each single age, the numbers at risk of migrating, as well as the numbers of migrants, may be small, resulting in propensities that are erratic and unstable. A better option may be to derive five-year age propensities, which have proven to be more reliable than one-year age propensities (Rogers, Little and Raymer 2010). These can be interpolated to yield one-year age propensities.
When the census question asks where a person was living five years ago, it is possible to derive one-year age propensities for migrating over a five-year interval as long as single ages are reported. It is done by back-casting all persons to the region where they lived five years earlier when they were five years younger. Persons aged 5 last birthday in a census conducted in 2000, for example, would have been aged 0 last birthday in 1995. If the age values ranged from 5 to 85 in 2000, they would range from 0 to 80 in 1995. The number of migrants is simply the count of persons who reported living in region i in 1995, but were counted as residing in a different region in 2000. For each one-year age group, the ratio of the number of migrants to the number at risk for migrating gives the age-specific out-migration propensity over the five-year interval.
When the numbers of migrants who survived a five-year migration interval are available from census data, single-year, single-age migration rates can be derived through a back-projecting procedure outlined by Dorrington and Moultrie (2009). Their method compensates for the effect of mortality by applying the mortality regime of the general population to the migrants and for the effect of onward migration by applying the annual rates of migration for the most recent year to estimate the population by region one year prior to the census and using that to estimate the migration rates two years before the census, and using that to estimate the population two years before the census, etc. It requires additional region-of-birth information for those aged 0-4 at time of census, as well as single-age, yearly estimates of regional populations. Schedules derived in this manner can then be fitted and smoothed with a Rogers-Castro model schedule, and used in single-year population projections.
Unless one has accurate and well-behaved data the multi-exponential model will not produce a very close fit and thus can be over-parameterised – i.e. many different sets of parameters can produce virtually equally good fits to the observed values. In such a situation it might help to fix one or two parameter values and fit the rest, and parsimony with the number of parameters is recommended.
Application of the multi-exponential model is not limited to schedules of migration rates or propensities. Several studies have established that age distributions of migrants (and migrations if using registration data) often have a multi-exponential form and can be accurately represented by a Rogers-Castro model schedule (Little and Rogers 2007; Rogers, Little and Raymer 2010).
The single-age numbers of migrants/migrations can be derived using any of the data sources and methods described above, because these are simply the numerators in the migration propensity and rate calculations. The observed data fitted by the model schedules are the single-age proportions of the total migrants/migrations. Note, if the numbers of migrants are reported in five-year age categories, some form of interpolation would be necessary. If cubic spline interpolation is used, the numbers associated with each node should be the migrants/migrations for each five-year age grouping divided by five.
For example, the observed age composition of Swedish migrations as a proportion is illustrated in Figure 7. From this it appears to be very smooth and reliable except in the oldest ages. A 7-parameter model schedule fits pretty closely, with an R2 of 99 per cent and MAPE of 29 per cent. However, this is an example of the how the MAPE can exaggerate the model’s lack of fit, as it becomes inflated when there is a sequence of small observed deviations.
Two alternative software options for fitting to the Excel workbook for fitting the multi-exponential curve are 1) Data Master 2003 [25], a free curve-fitting program, which applies the Levenberg–Marquardt algorithm; and 2) R [26] (R Development Core Team 2012) which is also free, but is a software environment for all-purpose statistical computing and graphics and as such requires a significant time investment before it can be used with confidence. The Appendix to this chapter on the Tools for Demographic Estimation website gives very basic commands for defining R-functions that produce estimates for the 7-parameter and the 11-parameter models using the Gauss-Newton algorithm.
Bates J and I Bracken. 1982. "Estimation of migration profiles in England and Wales", Environment and Planning A 14(7):889-900. doi: http://dx.doi.org/10.1068/a140889 [27]
Bates J and I Bracken. 1987. "Migration age profiles for local-authority areas in England, 1971-1981", Environment and Planning A 19(4):521-535. doi: http://dx.doi.org/10.1068/a190521 [28]
Beers H. 1945. "Six-term formulas for routine actuarial interpolation", The Record of the American Institute of Actuaries 33(2):245-260.
Dorrington R and TA Moultrie. 2009. "Making use of the consistency of patterns to estimate age-specific rates of interprovincial migration in South Africa," Paper presented at Annual Meeting of the Population Association of America. Detroit, Michigan, 29 April - 2 May 2009.
George MV. 1994. Population projections for Canada, provinces and territories, 1993-2016. Ottawa: Statistics Canada, Demography Division, Population Projections Section.
Hofmeyr BE. 1988. "Application of a mathematical model to South African migration data, 1975–1980", Southern African Journal of Demography 2(1):24–28.
Kawabe H. 1990. Migration rates by age group and migration patterns: Application of Rogers' migration schedule model to Japan, The Republic of Korea, and Thailand. Tokyo: Institute of Developing Economies.
Liaw K-L and DN Nagnur. 1985. "Characterization of metropolitan and nonmetropolitan outmigration schedules of the Canadian population system, 1971-1976", Canadian Studies in Population 12(1):81-102.
Little JS and A Rogers. 2007. "What can the age composition of a population tell us about the age composition of its out-migrants?", Population, Space and Place 13(1):23-19. doi: http://dx.doi.org/10.1002/psp.440 [4]
McNeil DR, TJ Trussell and JC Turner. 1977. "Spline interpolation of demographic data", Demography 14(2):245-252. doi: http://dx.doi.org/10.2307/2060581 [29]
Morrison PA, TM Bryan and DA Swanson. 2004. "Internal migration and short-distance mobility," in Siegel, JS and DA Swanson (eds). The Methods and Materials of Demography. San Diego: Elsevier pp. 493-521.
Potrykowska A. 1988. "Age patterns and model migration schedules in Poland", Geographia Polonica 54:63-80.
Press WH, BP Flannery, SA Teukolsky and WT Vetterling. 1986. Numerical Recipes: The Art of Scientific Computing. Cambridge: Cambridge University Press.
R Development Core Team. 2012. R: A language and environment for statistical computing: Reference Index. Vienna, Austria: R Foundation for Statistical Computing. http://www.mendeley.com/research/r-language-environment-statistical-computing-13/ [30]
Raymer J and A Rogers. 2008. "Applying model migration schedules to represent age-specific migration flows," in Raymer, J and F Willekens (eds). International Migration in Europe: Data, Models and Estimates. Chichester: Wiley, pp. 175-192.
Rees PH. 1977. "The measurement of migration, from census data and other sources", Environment and Planning A 9(3):247-272. doi: http://dx.doi.org/10.1068/a090247 [31]
Rogers A and LJ Castro. 1981. Model Migration Schedules. Laxenburg, Austria: International Institute for Applied Systems Analysis. http://webarchive.iiasa.ac.at/Admin/PUB/Documents/RR-81-030.pdf [16]
Rogers A and LJ Castro. 1986. "Migration," in Rogers, A and F Willekens (eds). Migration and Settlement: A Multiregional Comparative Study. Dordrecht: D. Reidel, pp. 157-208.
Rogers A, LJ Castro and M Lea. 2005. "Model migration schedules: Three alternative linear parameter estimation methods", Mathematical Population Studies 12(1):17-38. doi: http://dx.doi.org/10.1080/08898480590902145 [32]
Rogers A and JS Little. 1994. "Parameterizing age patterns of demographic rates with the multiexponential model schedule", Mathematical Population Studies 4(3):175-195. doi: http://dx.doi.org/10.1080/08898489409525372 [33]
Rogers A, JS Little and J Raymer. 2010. The Indirect Estimation of Migration: Methods for Dealing with Irregular, Inadequate, and Missing Data. Dordrecht: Springer.
Rogers A and J Raymer. 1999. "Estimating the regional migration patterns of the foreign-born population in the United States: 1950-1990", Mathematical Population Studies 7(3):181-216. doi: http://dx.doi.org/10.1080/08898489909525457 [34]
Rogers A and J Watkins. 1987. "General versus elderly interstate migration and population redistribution in the United States", Research on Aging 9(4):483-529. doi: http://dx.doi.org/10.1177/0164027587094002 [35]
The log-linear modelling framework provides several valuable techniques for studying and estimating migration flows within a network of regions. To date, these methods have been applied most often to internal migration systems where regions are defined as sub-national administrative units. However, they need not be restricted to domestic migration and may be applied to international systems of migration as well (Raymer 2007).
A migration flow is defined as the number of migrations from one region to another over the course of a specified time frame. There are several different ways to count migrations and each one could yield a different result. For example, Rees and Willekens (1986) make the distinction between registration systems that count the number of inter-regional residential moves over a reference period and censuses that count persons who reside in a place at the time of the census that is different from the place of residence at the beginning of the reference period.
Regardless of the method used to count migration flows, it is conventional to present them in contingency tables. These are square tables that report the flow counts between origin and destination regions. The flows in the migration table can be perfectly reproduced by the multiplicative component model, which is a saturated (i.e., where there are as many estimated parameters as there are data points) log-linear model. It has been used by Willekens (1983), Rogers, Willekens, Little et al. (2002)) and Rogers, Little and Raymer (2010)) to represent the matrix of flows between regions, and by Raymer and Rogers (2007), Raymer, Bonaguidi and Valentini (2006)) and Rogers, Little and Raymer (2010)) to capture the structure of inter-regional flows within age categories. The multiplicative components are interpretable and conveniently used to define the structure of migration between the regions of interest (Rogers, Willekens, Little et al. 2002). If calculated for more than one set of inter-regional flows, defined for different time periods, for example, or for different age, sex or race categories, multiplicative components are useful for comparing migration regimes across these populations.
Log-linear methods may be used to justify simplified representations of migration structure that are more parsimonious than the saturated model. The appropriateness of a reduced model is determined by fitting the predicted flows to the observed flows and by using statistical methods to evaluate the goodness of fit. If the reduced form has merit, i.e., fits the data well, the model may be used to estimate indirectly the flows. The independence model, for example, assumes inter-regional flows are distributed according to the pattern that could have been predicted based on the marginal distributions of flows across origin and destination regions. If the independence model is confirmed, inter-regional flows are predictable and can be estimated indirectly, but accurately, if the total sending and receiving flows of each region are given.
Sometimes the structure of migration is hypothesized to be invariant with respect to factors such as time, age, sex, and race. These hypotheses can be represented and tested with log-linear models. Allowing for changes in the level of migration, studies have documented remarkable stability in migration structures, in particular the rates of migration by age, over time (Mueser 1989; Nair 1985; Snickars and Weibull 1977). Other studies have shown consistency in the age patterns of inter-regional migration over time (Raymer and Rogers 2007). Moreover, the migration structure of the youngest ages, which can be inferred from birthplace-specific population stocks, has, in certain contexts, proven to be a “proxy” for the level of migration and allowed the estimation of migration of the older age groups (Raymer and Rogers 2007; Rogers, Little and Raymer 2010).
These studies have set the stage for establishing the method of offsets as a successful tool for indirectly estimating migration flows. It is a special application of log-linear modelling that forces a known migration structure on to a system that may have missing or unreliable inter-regional flow data. Using this method, the known migration structure of one time period can be borrowed from another period. In addition, when flows are disaggregated by age, the structure of age-specific inter-regional flows of one time period can be applied to another period. Furthermore Raymer and Rogers (2007) showed that the level of infant lifetime migration can be applied, using the method of offsets, to estimate indirectly the migration flows of the older ages.
Applications of log-linear models, and the related assumptions, are detailed in the sections that follow, beginning with the two-variable case, i.e., origin and destination. In this section, the log-linear model is defined in the context of two-dimensional flow tables, and multiplicative forms as well as additive forms of the saturated model are derived and interpreted. The log-linear model of independence and the “migrants only” quasi-independence model are set out, including illustrations and a brief description of the methods for evaluating goodness-of-fit.
The section concludes with an illustration of the method of offsets for indirectly estimating the inter-regional flow data of one period based on the migration flow patterns of another. When flow data are available for two periods, the period-invariance assumption can be tested with a log-linear model and the method of offsets. Models that disaggregate the origin and destination of flows into age categories are considered. This is followed by an illustration of how the multiplicative model with age can be applied, using the method of offsets, to estimate indirectly the age-specific inter-regional flows for another period.
To illustrate the two-variable log-linear model, consider the 1973 and 1976 migrations in the Netherlands between types of municipalities categorized into six different groups based on degree of urbanization. These were published by Willekens (1983)) and are presented in Table 1. In this context, there are two variables, region of origin (O) and region of destination (D). Neither is identified as the dependent variable. The outcome variable may be either the inter-regional migration flow, denoted nij, in the multiplicative form of the model, or the natural logarithm of the flow, denoted ln(nij), in the additive form of the model.
Decompositions of the saturated model, each one perfectly regenerating the observed data, are described in the subsections presenting the multiplicative component model and the additive linear model, and three indirect estimation techniques are illustrated in the three subsections describing the independence model, the quasi-independence model and the method of offsets subsections that follow.
Table 1 Migration between municipalities by degree of urbanization,* the Netherlands, 1973 and 1976
A. 1973 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
50,498 |
23,829 |
8,566 |
21,846 |
16,264 |
18,856 |
139,859 |
|
2 |
25,005 |
27,536 |
6,953 |
14,326 |
16,212 |
18,282 |
108,314 |
|
3 |
15,675 |
10,710 |
13,874 |
6,266 |
9,819 |
19,701 |
76,045 |
|
4 |
23,457 |
14,169 |
4,431 |
10,209 |
9,386 |
10,973 |
72,625 |
|
5 |
29,548 |
25,267 |
11,802 |
13,160 |
15,979 |
20,406 |
116,162 |
|
6 |
46,815 |
39,123 |
42,399 |
25,012 |
26,830 |
23,304 |
203,483 |
|
Total |
190,998 |
140,634 |
88,025 |
90,819 |
94,490 |
111,522 |
716,488 |
|
B. 1976 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
14,473 |
14,327 |
6,077 |
11,689 |
10,618 |
9,897 |
67,081 |
|
2 |
14,833 |
36,258 |
13,289 |
17,391 |
20,899 |
21,869 |
124,539 |
|
3 |
8,330 |
17,764 |
25,113 |
10,489 |
18,171 |
29,220 |
109,087 |
|
4 |
11,315 |
16,498 |
8,935 |
10,537 |
10,762 |
12,519 |
70,566 |
|
5 |
11,875 |
24,370 |
19,151 |
12,312 |
16,724 |
22,591 |
107,023 |
|
6 |
16,582 |
32,336 |
52,415 |
22,264 |
28,182 |
27,810 |
179,589 |
|
Total |
77,408 |
141,553 |
124,980 |
84,682 |
105,356 |
123,906 |
657,885 |
|
*1: rural municipalities |
||||||||
|
|
2: industrial rural municipalities |
|
|
||||
3: specific resident municipalities of commuters |
||||||||
4: rural towns and small towns |
||||||||
5. medium-sized towns |
||||||||
6. large towns of more than 100,000 inhabitants |
||||||||
Source: Central Bureau of Statistics, The Hague |
The multiplicative expression of the saturated log-linear model, called the multiplicative component model, reproduces the elements of the flow table as follows:
Like all saturated models, it is, strictly speaking, not a model but a way of representing the data. nij is the observed flow of migration from region i to region j, and the effect parameters are T, Oi, Dj, ODij. Therefore, any i to j flow found in the interior 6 by 6 sub-matrices of Table 1 can be expressed by an equation of the same form as Equation 1 with the corresponding set of parameters. T gives the overall effect, Oi gives the effect of origin i, Dj gives the effect of destination j, and ODij gives the effect of the association between Oi and Dj. Taken together, the parameters of the saturated model represent the spatial structure of migration (Rogers, Willekens, Little et al. 2002).
Two different sets of parameters that satisfy the multiplicative component model have been used in migration studies and both are presented here. Each one offers a different way of representing and interpreting the migration structure. The first is called geometric mean effect coding (Knoke and Burke 1980; Willekens 1983) and the second is called total sum reference coding (Raymer and Rogers 2007; Rogers, Little and Raymer 2010). A third multiplicative component model is derived in the subsection presenting the log-linear additive model.
Geometric mean effect coding was the first decomposition of Equation 1 used for migration analysis. It was proposed by Birch (1963) and is formally equivalent to the gravity model of migration (Willekens 1983). Table 2 shows the multiplicative components resulting from geometric mean effect coding of the Netherlands data from Table 1. Note that the overall component (T) is set out in the grand total locations of the table, the origin components (Oi) are set out in the row-total locations, the destination components (Dj) are set out in the column-total locations, and the origin-destination interaction components (ODij) are set out in the cells of the interior sub-matrices.
Table 2 Multiplicative components using geometric mean effect coding
A. 1973 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
1.457 |
0.940 |
0.656 |
1.352 |
0.933 |
0.882 |
1.180 |
|
2 |
0.885 |
1.332 |
0.653 |
1.087 |
1.140 |
1.048 |
0.962 |
|
3 |
0.771 |
0.720 |
1.811 |
0.661 |
0.959 |
1.570 |
0.692 |
|
4 |
1.275 |
1.052 |
0.639 |
1.190 |
1.014 |
0.966 |
0.627 |
|
5 |
0.943 |
1.102 |
1.000 |
0.901 |
1.013 |
1.055 |
1.067 |
|
6 |
0.838 |
0.957 |
2.015 |
0.960 |
0.954 |
0.676 |
1.903 |
|
Total |
1.711 |
1.252 |
0.644 |
0.798 |
0.861 |
1.056 |
17,168.003 |
|
B. 1976 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
1.753 |
0.984 |
0.571 |
1.317 |
0.979 |
0.787 |
0.656 |
|
2 |
0.986 |
1.366 |
0.686 |
1.075 |
1.057 |
0.954 |
1.195 |
|
3 |
0.655 |
0.792 |
1.533 |
0.767 |
1.088 |
1.508 |
1.010 |
|
4 |
1.277 |
1.055 |
0.783 |
1.106 |
0.925 |
0.927 |
0.704 |
|
5 |
0.900 |
1.047 |
1.127 |
0.868 |
0.965 |
1.124 |
1.048 |
|
6 |
0.769 |
0.850 |
1.888 |
0.960 |
0.995 |
0.847 |
1.712 |
|
Total |
0.768 |
1.354 |
0.989 |
0.825 |
1.008 |
1.169 |
16,401.919 |
The overall effect, T, is described as the constant of proportionality or the size main effect (Willekens 1983). It is the geometric mean of all inter-regional flow values:
where m is the number of origin regions (rows) = the number of destination regions (columns). T equals 17,168.003 for 1973 and 16,401.919 for 1976.
For a particular region i, the main effect of that region of origin is the ratio of the geometric mean of flows originating from i divided by the overall geometric mean:
The main effect, Oi, shows the relative importance of region i as a source of migrations (Alonso 1986). For example, based on the 1973 data, the effect of originating in Category 4 is equal to:
This is the smallest of the origin (row) effects, which suggests that Category 4 was the least important source of migrations in 1973.
Similarly, the destination main effect, Dj, gives the relative importance of region j as an attractor of migrants. It is ratio of the geometric mean of column j to the total geometric mean and the formula is:
For example, for municipalities in Category 4, the destination effect in 1973 is equal to:
All other row and column effects can be derived in the same way. Each is the geometric mean of the row (or column) elements divided by the overall geometric mean, and they are equivalent to the balancing factors in the gravity model (Willekens 1983).
They can be compared across regions and across time periods. For example, Category 6 was the most important source of migrations in 1973 (1.903 is greater than the other destination effects), and in 1976 (1.712 is greater than the other destination effects). Category 1 was less important as a destination in 1976 than in 1973 (0.768 is less than 1.711), and, in 1973, it was less important as a source of migrations than as a destination for migrations (1.180 is less than 1.711).
Panels A and B in Table 2 are sometimes called the spatial interaction matrices. The elements are the ODij interaction effects in Equation 1 and each one is equal to the observed flow between i and j divided by the expected flow, which is the product of the other three parameters. The formula is:
Each ODij expresses the departure of the observed flow, nij, from the expected flow based on the assumption of no association between the destination region j and the origin region i, i.e., (T)(Oi)(Dj). They have been interpreted as indicators of the accessibility, the ease of interaction, or the attractiveness between two regions (Rogers, Willekens, Little et al. 2002).
Values equal to 1.0 indicate independence, i.e., no association between the origin and the destination. As implied by Equation 1, if an ODij parameter is equal to 1.0, nij is determined by the values of T, Oi and Dj alone. A departure from 1.0 in either direction is an indication of an association between the destination and the origin. Values greater than 1.0 indicate higher than expected levels of accessibility/attractiveness and values less than 1.0 indicate less than expected accessibility/attractiveness.
Since the 1973 diagonal effects are generally greater than 1.0, it appears migrants were unexpectedly attracted to destinations in the same category of municipality. Category 6 was an exception. Migrants from large towns of more than 100,000 inhabitants (i.e., Category 6) were more attracted to commuter municipalities (i.e., Category 3) than to other large towns (2.015 is greater than 0.676).
Table 2 shows all the parameters necessary for reproducing the 1973 and 1976 flows. To verify that any flow in Table 1 can be reproduced by the multiplicative components, take, for example, the 1973 flow from Category 2 to Category 3:
n2,3 =6953=17168.003×0.962×0.644×0.653 .
The parameter values, however, are not all independent of each other. In other words, some parameter values can be derived from the others. For one year of data, for all i and j combinations, there are 36 interaction effects, 6 origin main effects, 6 destination main effects, and one overall effect as reported in Table 2. However, the 49 parameters, reported for each year in Table 2, were derived from only 36 observed flows, making 13 more parameters than original data points, implying that 13 parameters must be redundant. In other words, 13 of the 49 parameters can be calculated from the other 36, and the relationship between parameters is determined by the following constraints associated with geometric mean effect coding. The first set of constraints forces the products of the origin main effects (and destination effects) to be equal to 1. This is expressed as
The second set of constraints is imposed on the interaction elements of each row and column, making the products of the interior elements in each row (and column) equal to 1. In other words, if five of the interaction effects associated with a particular origin (or destination) are given, the sixth interaction effect would be implied.
This is expressed as
In general, if there are m regions there are m2 linearly independent parameters and 1+m+m+(m×m) multiplicative components. For all of the geometric mean effect coding computations, see Table 2 in the Multiplicative Components sheet of the accompanying workbook.
Geometric mean effect coding, which uses the geometric mean as the reference value, was the earliest log-linear decomposition used to describe migration (Rogers, Willekens, Little et al. 2002; Willekens 1983). Recently, however, total sum reference coding has become more standard (Raymer and Rogers 2007; Rogers, Little and Raymer 2010). While both decompositions satisfy Equation 1, the effects under total sum reference coding are more transparent. For example, the main effect, T, is now the total number of migrants, denoted n++. Oi is now the proportion of all migrants leaving from region i (i.e., ni+/n++), and Dj is the proportion of all migrants moving to region j (i.e., n+j/n++). The interaction component ODij is now defined as nij/[(T)(Oi)(Dj)] or the ratio of the observed number of migrants, nij, to the expected number, (T)(Oi)(Dj). All effects taken together provide another way to represent the spatial structure of migration.
The multiplicative components derived from total sum reference coding are set out in Table 3. Consider, for example, the 8566 migrations from Category 1 to Category 3 in 1973 disaggregated into the four multiplicative components:
The interpretations of these components are relatively straightforward. The overall component is the reported total number of migrations in 1973, i.e., 716,488. The origin component represents the share of all migrants from each region, i.e., 10 per cent of all migrations originated in the Category 1. The destination component represents the shares of all migrations to each region, i.e., 19 per cent of all migrations had Category 3 as the destination. Finally, the interaction component represents the ratio of observed migrants to expected migrants, and there were roughly 48 observed migrations between region 1 and 3 for every 100 expected. The expected flow is based on the marginal total information, i.e., (T)(O1)(D3).
Table 3 Multiplicative components using total sum reference coding
A. 1973 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
1.354 |
0.868 |
0.499 |
1.232 |
0.882 |
0.866 |
0.195 |
|
2 |
0.866 |
1.295 |
0.523 |
1.043 |
1.135 |
1.084 |
0.151 |
|
3 |
0.773 |
0.718 |
1.485 |
0.650 |
0.979 |
1.664 |
0.106 |
|
4 |
1.212 |
0.994 |
0.497 |
1.109 |
0.980 |
0.971 |
0.101 |
|
5 |
0.954 |
1.108 |
0.827 |
0.894 |
1.043 |
1.129 |
0.162 |
|
6 |
0.863 |
0.980 |
1.696 |
0.970 |
1.000 |
0.736 |
0.284 |
|
Total |
0.267 |
0.196 |
0.123 |
0.127 |
0.132 |
0.156 |
716,488 |
|
B. 1976 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
1.834 |
0.993 |
0.477 |
1.354 |
0.988 |
0.783 |
0.102 |
|
2 |
1.012 |
1.353 |
0.562 |
1.085 |
1.048 |
0.932 |
0.189 |
|
3 |
0.649 |
0.757 |
1.212 |
0.747 |
1.040 |
1.422 |
0.166 |
|
4 |
1.363 |
1.087 |
0.667 |
1.160 |
0.952 |
0.942 |
0.107 |
|
5 |
0.943 |
1.058 |
0.942 |
0.894 |
0.976 |
1.121 |
0.163 |
|
6 |
0.785 |
0.837 |
1.536 |
0.963 |
0.980 |
0.822 |
0.273 |
|
Total |
0.118 |
0.215 |
0.190 |
0.129 |
0.160 |
0.188 |
657,885 |
Like geometric mean effect coding, the decomposition based on total sum reference coding gives more parameters than original data points. The constraints that define the relationships between parameters, and thus allow the redundant parameters to be derived, are as follows:
where m is the number of regions (Raymer, Bonaguidi and Valentini 2006).
For all of the total sum reference coding computations, see Table 3 in the Multiplicative components sheet of the accompanying workbook.
If the same decomposition scheme is applied to two sets of flow data from a given system of regions, all but the T parameter are scale free. This means that taking the ratios of two sets of components provides a simple method for examining stability in migration structure without confounding the effects of growth or decline in overall levels of migration (Rogers, Willekens, Little et al. 2002). In Table 4, ratios of the 1976 to 1973 components are displayed. Several depart substantially from 1 indicating the migration structure changed in the three years between 1973 and 1976. For example, the ratio of the components for OD11 is equal to 1.354, implying that migration within Category 1 was more attractive in 1976 than in 1973. In contrast, the ratio of the components for OD33 is equal to 0.816, suggesting migration within Category 3 was less attractive in 1976 than in 1973.
Table 4 Ratios of 1976 to 1973 multiplicative components
Destination |
|||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
1 |
1.354 |
1.144 |
0.957 |
1.099 |
1.121 |
0.904 |
0.522 |
2 |
1.169 |
1.045 |
1.075 |
1.040 |
0.923 |
0.860 |
1.252 |
3 |
0.839 |
1.055 |
0.816 |
1.149 |
1.062 |
0.854 |
1.562 |
4 |
1.125 |
1.093 |
1.342 |
1.046 |
0.972 |
0.970 |
1.058 |
5 |
0.988 |
0.955 |
1.139 |
1.000 |
0.936 |
0.993 |
1.003 |
6 |
0.909 |
0.854 |
0.906 |
0.993 |
0.980 |
1.117 |
0.961 |
Total |
0.441 |
1.096 |
1.546 |
1.015 |
1.214 |
1.210 |
0.918 |
Another form of the saturated log-linear model, which is an alternative to the multiplicative component model, is the linear additive model. Whether using the linear additive or the multiplicative form of the saturated log-linear model, the parameters represent the spatial structure of migration (Rogers, Willekens, Little et al. 2002) and each flow value can be fully reproduced by the parameters.
Because the multiplicative formation is formally equivalent to the gravity model (Willekens 1983), it is considered to be more appropriate than the linear additive model for representing spatial migration structures. On the other hand, the linear additive form is often found in statistics and when a standard statistical package (e.g., SPSS, Stata, R) is used to estimate a log-linear model, the parameters are always reported in the linear additive form. For that reason, the conventional calculations and interpretations of the parameters in the linear additive model are described in this sub-section.
The additive formulation is a linear function of logarithms and it makes evident why the model came to be called the log-linear model (Knoke and Burke 1980). It is mathematically equivalent to the multiplicative component model and it results from taking logarithms of both sides of Equation 1 as follows:
which can be expressed more concisely as:
The λ values are simply the natural logarithms of the parameters appearing in Equation 1. The O, D, and OD superscripts are parameter descriptors (not exponents) and the subscripts i and j refer to the categories of the origin and destination variables, respectively.
Applying natural logarithmic transformations to the parameters in Table 2 and Table 3 would result in sets of corresponding linear additive parameters. However, just as there are at least two decompositions of the multiplicative component model, i.e., the geometric mean reference coding and the total sum effect coding, there are multiple strategies for arriving at sets of parameters that satisfy the linear additive model (Powers and Xie 2008), and the approaches taken by the standard statistical packages are not simply logarithmic transformations of the multiplicative components derived earlier.
Recall that a migration system with m regions has m×m linearly independent parameters. The multiplicative component models described above give an interpretable value for 1+m+m+(m×m) parameters, though they are not linearly independent of each other. On the other hand, statistical routines in SPSS, Stata, and R calculate and report only linearly independent parameters, resulting in 1 value for
, m-1 values for
, m-1 values for
, and
(m-1) ×(m-1)
values for
The particular set of parameter values that is calculated and reported depends on the contrast coding scheme used by the software. Contrast coding blocks out one region by fixing all linear additive parameters for that region equal to 0. SPSS, for example, fixes the parameters for the last region, i.e., the region assigned the highest numeric value, m, in this case:
The parameters of the Netherlands data reported by SPSS are displayed in Table 5. The SPSS commands that generate these results for the 1973-migration table, along with the SPSS output, are presented in Appendix 1 [36]. Table 5 with the Excel formulae for calculation of the parameters are available in the Contrast coding sheet of the accompanying workbook.
Table 5 Additive linear parameters using "last region" contrast coding
A. 1973 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
0.288 |
-0.284 |
-1.388 |
0.076 |
-0.289 |
0.000 |
-0.212 |
|
2 |
-0.384 |
-0.109 |
-1.565 |
-0.315 |
-0.261 |
0.000 |
-0.243 |
|
3 |
-0.926 |
-1.128 |
-0.949 |
-1.216 |
-0.837 |
0.000 |
-0.168 |
|
4 |
0.062 |
-0.262 |
-1.505 |
-0.143 |
-0.297 |
0.000 |
-0.753 |
|
5 |
-0.327 |
-0.304 |
-1.146 |
-0.509 |
-0.385 |
0.000 |
-0.133 |
|
6 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
|
Total |
0.698 |
0.518 |
0.598 |
0.071 |
0.141 |
0.000 |
10.056 |
|
B. 1976 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
0.897 |
0.219 |
-1.122 |
0.389 |
0.057 |
0.000 |
-1.033 |
|
2 |
0.129 |
0.355 |
-1.132 |
-0.007 |
-0.059 |
0.000 |
-0.240 |
|
3 |
-0.738 |
-0.648 |
-0.785 |
-0.802 |
-0.488 |
0.000 |
0.049 |
|
4 |
0.416 |
0.125 |
-0.971 |
0.050 |
-0.165 |
0.000 |
-0.798 |
|
5 |
-0.126 |
-0.075 |
-0.799 |
-0.385 |
-0.314 |
0.000 |
-0.208 |
|
6 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
0.000 |
|
Total |
-0.517 |
0.151 |
0.634 |
-0.222 |
0.013 |
0.000 |
10.233 |
Notice the parameters for the last region are equal to 0, and, therefore, make no contribution to Equation 2. Interpretation of the parameters in Table 5 is somewhat complicated since they are in logarithmic units. Conversion back to the multiplicative components by exponentiation gives yet another set of multiplicative components that satisfy Equation 1. These are presented in Table 6, and they are the multiplicative components associated with “last region” contrast coding. Generally, these are not used to describe the spatial structure of migration, but they are useful in describing migration systems because the interaction parameters, ODij, are equivalent to odds ratios.
Table 6 Multiplicative components using "last region" contrast coding
A. 1973 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
1.333 |
0.753 |
0.250 |
1.079 |
0.749 |
1.000 |
0.809 |
|
2 |
0.681 |
0.897 |
0.209 |
0.730 |
0.770 |
1.000 |
0.785 |
|
3 |
0.396 |
0.324 |
0.387 |
0.296 |
0.433 |
1.000 |
0.845 |
|
4 |
1.064 |
0.769 |
0.222 |
0.867 |
0.743 |
1.000 |
0.471 |
|
5 |
0.721 |
0.738 |
0.318 |
0.601 |
0.680 |
1.000 |
0.876 |
|
6 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
|
Total |
2.009 |
1.679 |
1.819 |
1.073 |
1.151 |
1.000 |
23,304 |
|
B. 1976 Migration table |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
2.453 |
1.245 |
0.326 |
1.475 |
1.059 |
1.000 |
0.356 |
|
2 |
1.138 |
1.426 |
0.322 |
0.993 |
0.943 |
1.000 |
0.786 |
|
3 |
0.478 |
0.523 |
0.456 |
0.448 |
0.614 |
1.000 |
1.051 |
|
4 |
1.516 |
1.133 |
0.379 |
1.051 |
0.848 |
1.000 |
0.450 |
|
5 |
0.882 |
0.928 |
0.450 |
0.681 |
0.731 |
1.000 |
0.812 |
|
6 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
|
Total |
0.596 |
1.163 |
1.885 |
0.801 |
1.013 |
1.000 |
27,810 |
For example, the overall parameter from the 1973-migration data reported in Table 5, λT, gives the natural logarithm of the observed migrations for the reference region:
Another illustration from the 1973-migration table in Table 5 shows how the origin main effects,
, are added to the overall parameter to reproduce the migrations from Category 1 to the reference destination, Category 6, reported in Table 1. For example:
Using the same approach, the logarithms of all the migration flows can be reproduced by applying Equation 1 with the appropriate parameters from Table 6, or the observed flows can be reproduced by applying Equation 2 using the parameters in Table 5.
The association parameters in the linear form,
, are logged odds ratios (LORs), which are the logarithm of the ratio of two odds: 1) the odds of migration to region j rather than the reference region, conditional on originating in region i; and 2) the odds of migration to region j rather than the reference region, conditional on originating in the reference region. For example, from the 1973 sub-matrix in Table 5,
= -1.565, which is calculated as:
In words, the parameter is described as the logged ratio of the odds of migration to Category 3, rather than to Category 6, between a migrant originating in Category 2 and one originating in Category 6.
Odds ratios measure the relative likelihood of one outcome to another, and because they are more standard than LOR, it may be easier to exponentiate the LORs and interpret the association parameters, presented in Table 6, as odds ratios. For example, the model parameter OD23, for the 1973 data, is calculated as:
In words, the odds that a migrant from Category 2 will choose Category 3 over Category 6 is approximately 1/5th the odds that a migrant from Category 6 will choose Category 3 over Category 6. Odds-ratios are always positive and always depend on the choice of reference category. An odds ratio equal to 1 means a null relationship, i.e., statistical independence. Values higher than 1 mean a positive association and values less than 1 indicate a negative association.
Stata and R use a different contrast coding scheme to SPSS. Both of these statistical packages use the “first region” contrast coding as opposed to the “last region” contrast coding used by SPSS. In these two programs, the parameters for the first region, i.e., the region assigned the lowest numeric value, are fixed to be equal to 0, i.e.,
The Stata and R commands for generating the linear additive parameters, as well as the corresponding output, for the 1973 migration data can be downloaded from Appendix 1 [36].
All forms of the saturated model and all statistical methods for estimating the interaction parameters are in agreement and provide substantively similar results. The formulae for the calculations of the parameters are available in the Linear Additive Parameters sheet of the accompanying workbook. Furthermore, tests that each linear additive interaction parameter is equal to 0 are done automatically by SPSS and Stata. These results are available from Appendix 1 [36] and they show that each non-redundant interaction parameter is statistically significant. See Agresti and Finlay (2009) and Powers and Xie (2008)) for descriptions of the standard errors of the estimates.
All the models presented to this point have been saturated, and, therefore, perfectly represent the observed flows. Generally, the substantively interesting parameters are the interaction parameters because they indicate associations between pairs of regions. The independence model, however, hypothesizes that the interaction parameters are uninteresting and unnecessary because all multiplicative interaction parameters, ODij, are equal to 1, or, equivalently, all linear additive interaction parameters,
, are equal to 0. The independence model implies that the interaction terms should fall out of the model, reducing it to the most parsimonious form of a two-variable model, i.e.
,or, equivalently,
Visual inspection of the interaction parameters in the saturated log-linear model is one strategy for investigating the independence hypothesis. Another method is to calculate row or column conditional distributions. If the conditional distributions within rows (origins) are identical, there is independence between origins and destinations. In addition, since independence is a symmetric property, if the conditional distributions within rows (origins) are identical, the distributions within columns (destinations) also will be identical (Agresti and Finlay 2009; Powers and Xie 2008). In the Independence sheet of the accompanying workbook, the percentages of the Netherlands migrations within columns (destinations) are calculated. The column percentages are quite varied, suggesting, like the interaction parameters, that statistical independence is unfounded in this example.
The independence hypothesis implies that each particular inter-regional flow can be determined by the sizes of the marginal flows. Let Nij be the expected flow between regions i and j if the independence hypothesis is true. Nij is then equal to the total number of flows in the migration system, n++, multiplied by the proportion of the all migrants leaving from region i, ni+/n++, times the proportion of all migrants moving to region j, n+j/n++, i.e., Nij = n++(ni+/n++)(n+j/n++). If independence can be assumed, a good estimate of an inter-regional flow is Nij, and the problem of estimating inter-regional migration flows is truly simplified.
The differences between the observed flows, nij, and the expected flows, Nij, form the basis of the goodness-of-fit evaluation and the Pearson Chi-Squared Statistic, denoted Χ2, which is widely used to summarize these discrepancies. It is calculated as:
where the summation is taken over all internal cells in the migration matrix. When there is perfect agreement between the observed and the expected flows, over all cells, the Χ2 equals 0 indicating the independence model fits the data perfectly. Larger differences between nij and Nij produce larger Χ2 values and increasingly stronger evidence that the independence model is inadequate. In general, smaller values indicate a good fit and larger values a poor fit.
If the independence hypothesis is true, the Χ2 statistic is governed by the Χ2 probability distribution with (m-1)×(m-1) degrees of freedom. This distribution provides the basis for testing the significance of the Χ2 statistic (Agresti 2007; Agresti and Finlay 2009). If the Χ2 statistic falls in the right-sided extremes of its distribution, it signifies a low probability, e.g., p<0.05, that the independence hypothesis is true, and the model is rejected. The Χ2 values associated with independence model applied to the Netherlands data in Table 1 are calculated and reported in the Independence sheet of the accompanying workbook. See Appendix 2 [36] for the SPSS, Stata and R commands for testing the independence model with the 1973 example data.
The Χ2 value associated with the 1973 example data is 47,623, and the degrees of freedom (df) are 25. The associated p-value is less than 0.000, and the hypothesis of independence is rejected. (However, see the comments below about the limitations of this test when the sample size is large.) This is not surprising given the three multiplicative decompositions of the Netherlands data, presented in Table 2, Table 3 and Table 6. The evidence consistently shows strong associations between regions and many of the multiplicative association parameters are not close to 1. Furthermore, the standard errors reported in Appendix 1 [36] by SPSS and Stata indicate the linear additive interaction parameters are significantly different from 0.
One alternative to the Χ2 statistic is called either the likelihood ratio statistic, the deviance, or the G2 statistic. All are different names for the same test statistic, and which name is used is determined by the preferences of authors of text books and software packages. For simplicity, G2 will be adopted here. It is similar to the Χ2 in that values close to 0 indicate a well-fitting model and large values indicate a poor fit. If the hypothesized independence model holds, the G2 statistic also has a Χ2 distribution.
The G2 statistic has general utility that goes well beyond the independence model in log-linear analysis. It is widely used for comparing a simpler model to a more complex model. The G2 statistic is derived from the ratio of two likelihoods: 1) the likelihood that the constrained model (here the model of independence) fits the data; and 2) the likelihood that the unconstrained model (here the saturated model) fits the data. If the ratio is close to 1, the simpler, constrained, and more parsimonious model is preferred because it represents the data as well as the more complex model does.
The ratio of the two likelihoods does not have a Χ2 distribution. However, when the ratio is transformed into natural logarithm units and multiplied by -2, it becomes G2, which is a Χ2 distributed variable with (m-1)×(m-1)degrees of freedom. If Lc is the likelihood associated with the constrained (i.e., independence) model, and Lu is the likelihood under the unconstrained (i.e., saturated) model, then G2 is calculated as:
Because the saturated model fits the data perfectly, i.e., Lu = 1, G2 = –2ln Lc. The values, based on the example and the statistical software, are reported in Appendix 2. The value is reported to be 46,477.63 and it is called “Deviance” by SPSS and Stata. It is rounded and reported to be equal to 46,480 by R, where it is called “Residual Deviance.” With 25 degrees of freedom the probability that the independence model holds is effectively 0.
The Χ2 and the G2 statistics are asymptotically equivalent (Powers and Xie 2008) and they form the bases of the Pearson Chi-square and the likelihood ratio tests, respectively. As with all inferential tests, effective use requires attention to underlying assumptions as well as limitations. Both tests rely on the assumption that each inter-regional flow count in the migration table follows an independent Poisson distribution (Powers and Xie 2008) and both tests have the important limitations that are related to sample size. The Χ2 statistic is inflated by large samples. Therefore, the Pearson Chi-square test is not appropriate to when the sample size is large. The G2 statistic and the likelihood ratio test is preferred in this situation (Powers and Xie 2008). The Pearson Chi-square test is preferred when the expected frequencies average between 1 and 10, but neither statistic works well if most of the expected frequencies are less than 5 (Agresti and Finlay 2009; Powers and Xie 2008).
Criticism has been made of the G2 statistic as well when samples are large (Raftery 1986, 1995) and there is growing consensus that information measures should be considered along with traditional significance tests in assessing model fit. The Bayesian Information Criterion (BIC) is closely related to G2, and it is calculated by Stata as: BIC = G2–df ln(mxm), and by SPSS as:
where p is the number of parameters estimated in the independence model, i.e., 2m-1. A low value suggests choosing the independence model over the saturated model (Powers and Xie 2008).
Akaike’s Information Criterion (AIC) is another alternative that takes on smaller values for better fitting models, since it judges how close the fitted values are to the expected values (Agresti 2007). In SPSS and R, it is calculated as:
where p is the number of parameters estimated in the independence model, i.e., 2m-1. In Stata, it is calculated as:
As shown in Appendix 2 [36], SPSS and Stata report the BIC and AIC, and R reports only the rounded AIC. As previously stated, there are differences in the formulae used. The BIC reported by SPSS equals 46,934.237, and the BIC reported by Stata equals 46,388.04. R reports only the AIC, which is equal to 46,920, the rounded value reported by SPSS, 46,916.818. Stata’s AIC value is substantially smaller and is equal to 1,303.245. All reported BIC and AIC values are large and add to the growing evidence that discredits the independence model for this example.
The independence model rarely provides an adequate fit to migration data. This is due, in part, to the overwhelming tendency to continue to reside in the same region. The quasi-independence model allows these “immobility” effects (Powers and Xie 2008) to be removed from the model, and this often results in improved predictions of inter-regional migration flows. The quasi-independence model has been applied effectively to migration data obtained from national censuses (Agresti 1990; Rogers, Little and Raymer 2010; Rogers, Willekens, Little et al. 2002), where persons who reported living in the same region at the time of the census as at the beginning of the reference period are represented in the diagonal elements of a migration table.
To illustrate, United States native-born migration data between 1985 and 1990 are reported in Panel A of Table 7. Clearly, the flows reported in the four diagonal elements of the interior sub-matrix are substantially larger than the off-diagonal elements, indicating that the propensity to maintain residence in the same region is much more typical than migration between regions.
The clustering along the diagonal cells contributes significantly to the poor fit of the independence model, and the dominating influence of the persons remaining in the region of origin have caused researchers to favour omitting them from the model. If migrants are defined as people changing their region of residence, this type of flow matrix is sometimes called a “migrants only” matrix. It is particularly useful for studying migration structure since it eliminates people who made no move or moved within the same region. Panel B of Table 7 displays the flow table with the diagonal elements set to 0, and the marginal totals adjusted accordingly.
Table 7 United States native-born migration flows, 1985-1990
A. Full migration table |
||||||
Destination |
||||||
Origin |
Northeast |
Midwest |
South |
West |
Total |
|
Northeast |
40,262,319 |
336,091 |
1,645,843 |
479,819 |
42,724,072 |
|
Midwest |
351,029 |
50,677,007 |
1,692,687 |
958,696 |
53,679,419 |
|
South |
778,868 |
1,197,134 |
69,563,871 |
1,150,649 |
72,690,522 |
|
West |
348,892 |
668,979 |
1,082,104 |
37,872,893 |
39,972,868 |
|
Total |
41,741,108 |
52,879,211 |
73,984,505 |
40,462,057 |
209,066,881 |
|
B. Migrants-only table |
||||||
Destination |
||||||
Origin |
Northeast |
Midwest |
South |
West |
Total |
|
Northeast |
0 |
336,091 |
1,645,843 |
479,819 |
2,461,753 |
|
Midwest |
351,029 |
0 |
1,692,687 |
958,696 |
3,002,412 |
|
South |
778,868 |
1,197,134 |
0 |
1,150,649 |
3,126,651 |
|
West |
348,892 |
668,979 |
1,082,104 |
0 |
2,099,975 |
|
Total |
1,478,789 |
2,202,204 |
4,420,634 |
2,589,164 |
10,690,791 |
The multiplicative components, using total sum reference coding, for the full migration table and the migrant-only table are reported in Table 8. The magnitude of the multiplicative component model parameters for the full data certainly departs from what is expected under the hypothesis of independence. They are substantially above 1.0 on the diagonal and the off-diagonal components are far below 1.0. In comparison, the multiplicative components for the migrants-only table are constrained to be equal to 0 in order to reproduce the structural zeros on the diagonal, and, as a result, the off-diagonal components are closer to 1.0
Table 8 Multiplicative components* of United States native-born migration flows, 1985-1990
A. Full migration table |
||||||
Destination |
||||||
Origin |
Northeast |
Midwest |
South |
West |
Total |
|
Northeast |
4.720 |
0.031 |
0.109 |
0.058 |
0.204 |
|
Midwest |
0.033 |
3.733 |
0.089 |
0.092 |
0.257 |
|
South |
0.054 |
0.065 |
2.704 |
0.082 |
0.348 |
|
West |
0.044 |
0.066 |
0.076 |
4.896 |
0.191 |
|
Total |
0.200 |
0.253 |
0.354 |
0.194 |
209,066,881 |
|
B. Migrants-only table |
||||||
Destination |
||||||
Origin |
Northeast |
Midwest |
South |
West |
Total |
|
Northeast |
0.000 |
0.663 |
1.617 |
0.805 |
0.230 |
|
Midwest |
0.845 |
0.000 |
1.363 |
1.318 |
0.281 |
|
South |
1.801 |
1.859 |
0.000 |
1.520 |
0.292 |
|
West |
1.201 |
1.547 |
1.246 |
0.000 |
0.196 |
|
Total |
0.138 |
0.206 |
0.413 |
0.242 |
10,690,791 |
|
*Total sum reference coding |
The quasi-independence model requires that only migrations between different regions satisfy the independence assumption. This is estimated in two different but equivalent ways. The first method takes the full migration table data as in Panel A of Table 8, and fixes the weights on the interactive effects, ODij , to be zero when the regions of origin and destination are the same, i.e., i=j, insuring that nij=0. These are called structural zeros. When the origin and destination regions are different, i.e.,
, the interaction effects are fixed at 1.0, which is the familiar independence model and gives the predicted off-diagonal flows under the quasi-independence hypothesis. Implementation of this method in SPSS, Stata and R is illustrated in Appendix 3 [36] (available on the Tools for Demographic Estimation website).
The second method does not use the full migration data, but uses the migrants-only data as in Panel B of Table 7. It is best presented with the additive form:
, where I is an indicator variable taking on values of 1 for the diagonal flows, i.e., when i=j, and values of 0 for the off-diagonal flows, i.e., when
(Agresti 2002). Therefore, an extra parameter,
, is necessary to estimate each diagonal flow, and for the other inter-regional flows the
term falls out and the quasi-independence model reduces to the independence model. Consequently, just like the independence model, the off-diagonal interaction terms are constrained to be equal to 0 in the additive form of the model (and equal to 1 in the multiplicative form). Application of this method in Stata is illustrated in Ap [36]pendix 3 [36].
In the first method, the quasi-independence model fixes m parameters, ODii , for i = 1 to m, to be equal to 0. In the second method, m additional parameters,
, are estimated, and when exponentiated will be very close to 0. Using either method, the quasi-independence model has m more parameters than the full independence model and the degrees of freedom are reduced by m.
Appendix 3 [36] (available on the Tools for Demographic Estimation website) illustrates how the quasi-independence model is estimated with statistical software packages SPSS, Stata and R, using the United States native-born migration flow data, 1985-1990. When the independence model is estimated with the full data, as expected, all goodness-of-fit indicators are extremely large: Χ2 =544,479,395 (df= 9); G2 = 461,411,576 (df= 9); Stata values for BIC and AIC are 461,000,000 and 28,800,000, respectively. When the quasi-independence model is estimated, all values were reduced substantially: Χ2 =327,233 (df=5); G2 =330,220(df=5); Stata values for BIC and AIC equal 330,207 and 27,535, respectively.
The inferential tests remain significant, and the quasi-independence model must be rejected as the true migration model. The independence and the quasi-independence models should not be compared, inferentially, with the likelihood ratio test because they are not nested models. However, the information measures may be compared directly. Both the BIC and AIC are reduced substantially, favouring the quasi-independence model over the independence model.
In addition, the predicted flows from the independence model can be contrasted with those from the quasi-independence model in Table 9. Visually comparing the predicted flows in Table 9 with the observed data in Table 7 reveals how much closer the quasi-independence model comes to representing the data. Two additional summary statistics are reported: R2 and Mean Absolute Percent Error (MAPE). A comparison of the R2 values shows the independence model explains 10% of the variation in the observed data and the quasi-independence model explains 95%. Furthermore, the average percent error for the quasi-independence model (MAPE=28) is dramatically reduced in comparison to the independence model (MAPE=2,492).
Since the fit of the quasi-independence model is not close enough to the observed data, it must be rejected as the “true” model. However, without observed migration data, the quasi-independence model may still offer a reasonable, but coarse, method for estimating inter-regional flows.
Table 9 Predicted United States native-born migration flows, 1985-1990, under independence and quasi-independence
A. Independence |
|||||
Destination |
|||||
Origin |
1 |
2 |
3 |
4 |
|
1 |
8,530,046 |
10,806,184 |
15,119,178 |
8,268,664 |
|
2 |
10,717,328 |
13,577,116 |
18,996,052 |
10,388,923 |
|
3 |
14,512,977 |
18,385,588 |
25,723,693 |
14,068,264 |
|
4 |
7,980,756 |
10,110,323 |
14,145,583 |
7,736,206 |
|
R2= |
0.104 |
MAPE= |
2492.322 |
||
B. Quasi-independence |
|||||
Destination |
|||||
Origin |
1 |
2 |
3 |
4 |
|
1 |
0 |
535,839 |
1,349,561 |
576,353 |
|
2 |
442,768 |
0 |
1,793,640 |
766,005 |
|
3 |
720,681 |
1,159,163 |
0 |
1,246,806 |
|
4 |
315,340 |
507,201 |
1,277,434 |
0 |
|
R2= |
0.945 |
MAPE= |
27.575 |
The validity of the independence and quasi-independence models can be evaluated with the inferential test statistics that accompany the log-linear model output, and, even when the models are not supported with significance tests, these models may be applied, in some contexts, to produce meaningful estimates of migration flows. The method of offsets assumes the auxiliary data have an implied structure of inter-regional associations that resembles the unknown migration structure. The method of offsets borrows the structure of the auxiliary data to derive the estimates of the missing migration flow data.
In past research, the auxiliary information, typically, has been a table of migration flows from another period in history (Rogers, Little and Raymer 2010; Rogers, Willekens, Little et al. 2002; Rogers, Willekens and Raymer 2003; Willekens 1983), but it could be from another age (Raymer and Rogers 2007), another sex or race group. It could be from another data source all together such as tax return data or motor vehicle registration data.
Given the auxiliary flow data,
, the log-linear-with-offsets model is specified as:
This model will estimate flows,
, that have a migration structure that comes as close as possible to that of the auxiliary flow data, and, at the same time, the estimated flows are adjusted to sum to the marginal totals pre-specified by the researcher. In this way, the method of offsets is similar to the independence and quasi-independence models in that it provides an expected distribution of the flows such that the marginal row and column totals are equal to the a priori estimates.
To illustrate the workings of the method of offsets, consider the Netherlands 1976 migration flow matrix in Table 1. Suppose we wish to keep the numerical values of the row and column marginal totals, but, at the same time, wish to replace the migration interaction effects observed during that year by those observed during 1973, using the method of offsets. What would be the corresponding set of log-linear parameters? Table 10 sets out the predicted flow matrix obtained by the method of offsets in Panel A, and Panel B presents the associated multiplicative components derived using the total sum reference coding. Note that the T, Oi and Dj values of the predicted matrix, i.e., Panel B of Table 10, are identical to those reported for the observed 1976 flow matrix in Panel B of Table 3. However, the other terms (i.e., the interaction effects, ODij) reflect the influence of the migration structure of the observed 1973 data, Panel A of Table 3, as well as the row and column totals taken from the 1976 data. Therefore, the method of offsets applies the structure of the auxiliary data, the 1973 data in this case, to the interior flows, and at the same time, preserves the total number of flows observed in the 1976 data.
Table 10 Inter-regional migration flows in the Netherlands (1976), predicted with the method of offsets from the marginal totals (1976) and the migration flow table (1973)
|
PANEL A: Predicted using method of offsets |
|||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
12,344 |
13,769 |
6,890 |
12,199 |
10,361 |
11,518 |
67,081 |
|
2 |
13,329 |
34,695 |
12,195 |
17,445 |
22,522 |
24,353 |
124,539 |
|
3 |
9,728 |
15,711 |
28,330 |
8,883 |
15,881 |
30,553 |
109,087 |
|
4 |
11,281 |
16,107 |
7,011 |
11,216 |
11,764 |
13,187 |
70,566 |
|
5 |
12,609 |
25,486 |
16,570 |
12,828 |
17,770 |
21,760 |
107,023 |
|
6 |
18,116 |
35,786 |
53,984 |
22,110 |
27,058 |
22,535 |
179,589 |
|
Total |
77,408 |
141,553 |
124,980 |
84,682 |
105,356 |
123,906 |
657,885 |
|
R2= |
0.966 |
MAPE= |
8.364 |
|||||
Panel B. Multiplicative components using total sum reference coding |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
1.564 |
0.954 |
0.541 |
1.413 |
0.964 |
0.912 |
0.102 |
|
2 |
0.910 |
1.295 |
0.515 |
1.088 |
1.129 |
1.038 |
0.189 |
|
3 |
0.758 |
0.669 |
1.367 |
0.633 |
0.909 |
1.487 |
0.166 |
|
4 |
1.359 |
1.061 |
0.523 |
1.235 |
1.041 |
0.992 |
0.107 |
|
5 |
1.001 |
1.107 |
0.815 |
0.931 |
1.037 |
1.080 |
0.163 |
|
6 |
0.857 |
0.926 |
1.582 |
0.956 |
0.941 |
0.666 |
0.273 |
|
Total |
0.118 |
0.215 |
0.190 |
0.129 |
0.160 |
0.188 |
657,885 |
The predicted results in Panel A of Table 10 were taken from the output of the SPSS, Stata, and R commands for implementing the method of offsets found in Appendix 4 [36]. See the Method of offsets sheet in the accompanying Excel spreadsheet for other calculations.
Since the flows were observed directly in 1976, there are several ways to evaluate the suitability of the method of offsets for predicting the data. One simple method is to inspect visually the ratios of the association multiplicative components, as demonstrated in Table 4. Another method is to use the inferential tests and information measures reported by the log-linear procedures. These would be testing the hypothesis that the structure of the migration flows, i.e., the interaction parameters, did not change from 1973 to 1976. In the example reported in Table 10, the corresponding G2 statistic is equal to 5,914 (df=25), and the hypothesis that the auxiliary data represent the same migration process as the observed data must be rejected. The final method, of those suggested here, relies on the standard R2 and MAPE statistics to assess the fit between the observed and the predicted flows. These are reported in Panel A of Table 10 and are equal to 0.97 and 8.36, respectively. These statistics, as well as the ratios in Table 4, suggest this application of the method of offsets offers a set of estimates for the migration flows in 1976 that may be quite satisfactory.
The importance placed on the goodness-of-fit statistics depends on the quality of the observed flows used as inputs to the method of offsets. If the method is to be useful in a practical situation, it must be applicable when the inter-regional flows are not directly observed. In the absence of flow data, the method would still require pre-estimates of the marginal totals. Furthermore, if the method is implemented as illustrated in Appendix 4 (available on the Tools for Demographic Estimation website), initial estimates of the inter-regional flows are required. Therefore, the pre-estimates of the row and column totals would need to be distributed across the internal cells of the flow matrix so they add up to the respective marginal totals. Table 11, Panel A, presents a typical scenario, albeit continuing to use the marginal totals from the Netherlands 1976 data, which were observed. A simple solution is to distribute the flows according to the independence model, i.e., , which results in the initial estimates of the flows displayed in Panel B of Table 11.
As long as the initial inter-regional flows add up to the marginal totals, the predicted flows are not affected by the method used to distribute the flows within the cells. This is true because the flows will be predicted, ultimately, from the auxiliary data through the method of offsets, using the iterative proportional fitting algorithm (Agresti 1990; Deming and Stephan 1940). In other words, the initial estimates of the 1976 Netherland flows, used as input to the offsets log-linear model, could be the internal cells of Table 1, Panel B, or those in Table 11, Panel B. Either set of initial estimates would yield the predicted flows that are reported in Table 10, Panel A.
On the other hand, it is important to note that the associated inferential test statistics and the information measures that accompany the method of offsets must be interpreted with respect to the initial flow estimates. For example, if the initial flows were taken from Panel B of Table 11, the associated X2 and G2 test statistics would be testing the hypothesis that the predicted data are distributed in a manner that is consistent with the independence model.
Table 11 The inputs to the method of offsets in the absence of observed flows
Panel A. Pre-estimation marginal totals from the Netherlands, 1976 |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
|
|
|
|
|
|
67,081 |
|
2 |
|
|
|
|
|
|
124,539 |
|
3 |
|
|
|
|
|
|
109,087 |
|
4 |
|
|
|
|
|
|
70,566 |
|
5 |
|
|
|
|
|
|
107,023 |
|
6 |
|
|
|
|
|
|
179,589 |
|
Total |
77,408 |
141,553 |
124,980 |
84,682 |
105,356 |
123,906 |
657,885 |
|
Panel B. Independence model distribution scheme for initial flow estimates |
||||||||
Destination |
||||||||
Origin |
1 |
2 |
3 |
4 |
5 |
6 |
Total |
|
1 |
7,893 |
14,433 |
12,744 |
8,635 |
10,743 |
12,634 |
67,081 |
|
2 |
14,654 |
26,796 |
23,659 |
16,030 |
19,944 |
23,456 |
124,539 |
|
3 |
12,835 |
23,472 |
20,724 |
14,042 |
17,470 |
20,545 |
109,087 |
|
4 |
8,303 |
15,183 |
13,406 |
9,083 |
11,301 |
13,290 |
70,566 |
|
5 |
12,593 |
23,027 |
20,331 |
13,776 |
17,139 |
20,157 |
107,023 |
|
6 |
21,131 |
38,641 |
34,117 |
23,116 |
28,760 |
33,824 |
179,589 |
|
Total |
77,408 |
141,553 |
124,980 |
84,682 |
105,356 |
123,906 |
657,885 |
It is a simple matter to modify the method of offsets to apply it to the problem of predicting a table of “migrants only.” The SPSS, Stata and R commands require minor modifications that are specified in comments in Appendix 4 (available on the Tools for Demographic Estimation website). A worked example is included in the Method of offsets, migrants only sheet of the accompanying workbook. It uses the observed U.S. flows, 1985-1990, to retrospectively estimate the 1975-80 migrant flows reported by Rogers, Willekens, Little et al. (2002).
Agresti A. 1990. Categorical Data Analysis. New York: Wiley.
Agresti A. 2002. Categorical Data Analysis. New York: Wiley-Interscience.
Agresti A. 2007. An Introduction to Categorical Data Analysis. Hoboken, NJ: Wiley-Interscience.
Agresti A and B Finlay. 2009. Statistical Methods for the Social Sciences. Upper Saddle River, NJ: Pearson Prentice Hall.
Alonso W. 1986. Systemic and log-linear models: From here to there, then to now, and this to that. Discussion paper 86-10. Cambridge, MA: Harvard University, Center for Population Studies.
Birch MW. 1963. "Maximum likelihood in three-way contingency tables", Journal of the Royal Statistical Society Series B-Statistical Methodology 25(1):220-233.
Deming WE and FF Stephan. 1940. "On a least squares adjustment of a sampled frequency table when the expected marginal totals are known", Annals of Mathematical Statistics 11(4):427-444. doi: http://dx.doi.org/10.1214/aoms/1177731829 [37]
Knoke D and PJ Burke. 1980. Log-linear Models. Beverly Hills, CA: Sage Publications.
Mueser P. 1989. "The spatial structure of migration: An analysis of flows between states in the USA over three decades", Regional Studies 23(3):185-200. doi: http://dx.doi.org/10.1080/00343408912331345412 [38]
Nair PS. 1985. "Estimation of period-specific gross migration flows from limited data: Bi-proportional adjustment approach", Demography 22(1):133-142. doi: http://dx.doi.org/10.2307/2060992 [39]
Powers DA and Y Xie. 2008. Statistical Methods for Categorical Data Analysis. Bingley, UK: Emerald.
Raftery AE. 1986. "Choosing models for cross-classifications", American Sociological Review 51(1):145-146. doi: http://dx.doi.org/10.2307/2095483 [40]
Raftery AE. 1995. "Bayesian model selection in social research", Sociological Methodology 25(1):111-163. doi: http://dx.doi.org/10.2307/271063 [41]
Raymer J. 2007. "The estimation of international migration flows: A general technique focused on the origin-destination association structure", Environment and Planning A 39(4):985-995. doi: http://dx.doi.org/10.1068/a38264 [42]
Raymer J, A Bonaguidi and A Valentini. 2006. "Describing and projecting the age and spatial structures of interregional migration in Italy", Population, Space and Place 12(5):371-388. doi: http://dx.doi.org/10.1002/psp.414 [43]
Raymer J and A Rogers. 2007. "Using age and spacial flow structures in the indirect estimation of migration streams", Demography 44(2):199–223. doi: http://dx.doi.org/10.1353/dem.2007.0016 [44]
Rees P and FJ Willekens. 1986. "Data and accounts," in Rogers, A and FJ Willekens (eds). Migration and Settlement: A Multiregional Comparative Study. Dordrecht: D. Reidel, pp. 19-58.
Rogers A, JS Little and J Raymer. 2010. The Indirect Estimation of Migration: Methods for Dealing with Irregular, Inadequate, and Missing Data. Dordrecht: Springer.
Rogers A, F Willekens, JS Little and J Raymer. 2002. "Describing migration spatial stucture", Papers in Regional Science 81(1):29-48.
Rogers A, FJ Willekens and J Raymer. 2003. "Imposing age and spatial structures on inadequate migration-flow datasets", The Professional Geographer 55(1):56-69.
Snickars F and JW Weibull. 1977. "A minimum information principle: Theory and practice", Regional Science and Urban Economics 7(1-2):137-168. doi: http://dx.doi.org/10.1016/0166-0462(77)90021-7 [45]
Willekens F. 1983. "Log-linear modeling of spatial interaction", Papers of the Regional Science Association 52:187-205. doi: http://dx.doi.org/10.1007/BF01944102 [46]
Links:
[1] http://dx.doi.org/10.2307/2546515
[2] http://dx.doi.org/10.1590/S0102-30982010000100002%20
[3] http://dx.doi.org/10.1111/j.1728-4457.2005.00050.x
[4] http://dx.doi.org/10.1002/psp.440%20
[5] http://dx.doi.org/10.1353/dem.2007.0016%20
[6] http://dx.doi.org/10.1068/a120489
[7] http://dx.doi.org/10.1080/01621459.1986.10478237
[8] http://www.un.org/esa/population/techcoop/DemEst/manual4/manual4.html
[9] http://www.un.org/esa/population/techcoop/IntMig/manual6/manual6.html
[10] http://www.un.org/esa/population/publications/migration/WorldMigrationReport2009.pdf
[11] http://unstats.un.org/unsd/publication/SeriesM/Seriesm_67rev2e.pdf
[12] http://dx.doi.org/10.1080/08898489909525459%20
[13] http://dx.doi.org/10.2307/2546519%20
[14] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_EstMig_01.png
[15] http://webarchive.iiasa.ac.at/Admin/PUB/Documents/RR-81-006.pdf
[16] http://webarchive.iiasa.ac.at/Admin/PUB/Documents/RR-81-030.pdf
[17] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_MEMM_01a.png
[18] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_RogersCastro.png
[19] http://www.xlxtrfun.com/XlXtrFun/XlXtrFun.htm
[20] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_MEMM_03.png
[21] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_MEMM_04.png
[22] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_MEMM_05.png
[23] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_MEMM_06.png
[24] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/imagecache/wysiwyg_imageupload_lightbox_preset/wysiwyg_imageupload/3/MI_MEMM_07.png
[25] http://www.datamaster2003.com/index.html
[26] http://www.r-project.org/
[27] http://dx.doi.org/10.1068/a140889%20
[28] http://dx.doi.org/10.1068/a190521%20
[29] http://dx.doi.org/10.2307/2060581%20
[30] http://www.mendeley.com/research/r-language-environment-statistical-computing-13/
[31] http://dx.doi.org/10.1068/a090247%20
[32] http://dx.doi.org/10.1080/08898480590902145%20
[33] http://dx.doi.org/10.1080/08898489409525372%20
[34] http://dx.doi.org/10.1080/08898489909525457%20
[35] http://dx.doi.org/10.1177/0164027587094002
[36] http://demographicestimation.iussp.org/sites/demographicestimation.iussp.org/files/MI_LLM_appendices.pdf
[37] http://dx.doi.org/10.1214/aoms/1177731829%20
[38] http://dx.doi.org/10.1080/00343408912331345412
[39] http://dx.doi.org/10.2307/2060992
[40] http://dx.doi.org/10.2307/2095483%20
[41] http://dx.doi.org/10.2307/271063%20
[42] http://dx.doi.org/10.1068/a38264%20
[43] http://dx.doi.org/10.1002/psp.414%20
[44] http://dx.doi.org/10.1353/dem.2007.0016
[45] http://dx.doi.org/10.1016/0166-0462%2877%2990021-7
[46] http://dx.doi.org/10.1007/BF01944102