Log in / Register
Home arrow Computer Science arrow Social Computing, Behavioral-Cultural Modeling and Prediction
< Prev   CONTENTS   Next >

3 Data Description

Census Tracts and Racial Segregation Data. To build a geographic scaffolding for our experiments, we use census tract polygons defined by the U.S. Census Bureau[1] in three large metropolitansNew York City, Los Angeles and

Fig. 1. Figure shows racial segregation maps for (a) New York City, (b) Los Angeles, and (c) Chicago. Colorsblue:white, green:black, red:Asian, orange:Hispanic, brown:others. Note: The maps show only a portion of the entire city.

Chicago. Census tracts are small, relatively permanent statistical subdivisions of a county or equivalent entity, typically having a population between 1,200 and 8,000 people. Census information is also available for city blocks, however, using such small geographic entities would account for scarce user movement data. Census tract shapefiles can be downloaded from the National Historical Geographic Information System (NHGIS) website [2].

To find the predominant race in each census tracts we use table P5 (Hispanic or Latino Origin by Race) in the 2010 Summary File 1 (SF1), also available on the NHGIS website. Table P5 enumerates the number of people in each census tract belonging to the racial groupsnon-Hispanic White, non-Hispanic Black, non-Hispanic Asian, Hispanic or Latino, and other. The population of these five categories sums up to the total population of the tract. The predominant race in a tract is the one having a majority (50% or more) population. Figure 1 shows census tract boundaries color coded by the majority race. The tracts with a lighter shade of color represents the ones where the race with maximum population did not have a 50% majority.

Users and Twitter Data. Once the geographic canvas is ready, we need human entities to model mobility patterns. An ideal dataset would consist of a large set of people, location of their homes and their daily movement traces on a geographic coordinate system. Since such a corpus is difficult to build and acquire, we consider an alternative sourceTwitter. A geo-tagged tweet is up to 140 characters of text and is associated with a user id, timestamp, latitude and longitude. Frequent Twitter users who use its location sharing service, would produce a close representation of their daily movement in their tweeting activity. We collected strictly geo-tagged tweets using Twitter's Streaming API[3] and limited them to the polygon bounding the three cities New York, Los Angeles and Chicago. This way, we received all tweets with location information and not just a subset. We disregard all other fields obtained from Twitter, including the user's twitter handle, and in no way use the information we retain to identify personal information about a user. Over a period of one year, beginning October 2012, we accumulated over 75 million geo-tagged tweets.

Next, we try to identify home locations of users by following a very straight forward method based on the assumptions that users generally tweet from home at night. For each unique user we start by collecting all tweets between 7:00pm and 4:00am, and apply a single pass of DBSCAN clustering algorithm [14]. The largest cluster produced by the cluster analysis is chosen as the one corresponding to the user's home, and its centroid is used as the exact coordinates. Skipping users with very few tweets, and ones for whom a cluster could not be formed, we were able to identify home locations for over 30,000 unique users. A user is assigned to a particular census tract if the coordinates of his/her home lies within its geographic bounds, and is assumed to belong to the race of majority population in that tract.

Once these preprocessing steps have been carried out, we are left with a rich set of data comprising of users, their race, their home location, and their movement activity on a geographic space. This data will act as the seed for all our following experiments.

  • [1]
  • [2]
  • [3]
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
Business & Finance
Computer Science
Language & Literature
Political science