The IMLS study ["Understanding the Social Wellbeing Impacts of the Nation's Libraries and Museums"][1] (https://www.imls.gov/publications/understanding-social-wellbeing-impacts-nations-libraries-and-museums) presents a Library Presence and Usage Index based on 12 data indicators using 2016 data on US county boundaries. This replication and extension study recreates the index with the library system boundary as the unit of analysis. It expands the index analysis by applying regional variation, and studying the difference in variable relevance by level of rurality/urbanity and library type. A further expansion leverages the expertise of library workers to ask: What library data are useful to planning and improving local service? Responses to these questions are compared to variable relevance analysis (factor loadings) to construct a worker-informed Library Performance Index which might be variably constructed based on level of rurality/urbanity of the community and library structure type.
This wiki will detail the decisions behind each step in creating this replication and designing the extension, as well as the steps followed in construction and analysis. Code level details are provided in the files.
In "Understanding..." one purpose of the Library Presence and Usage Index was to have a single measure for correlation analysis with social wellbeing dimension indexes. In our study we want to further ask, "do measures ever align with what library workers think is important?", "what is useful about comparing libraries serving different communities?", "could data align with both user and worker interests?".
**Replication at the Library System Level**
The researchers designing the "Understanding..." chose county boundaries as the unit study because counties are the smallest geographic units at which all desired data was available, and also matched to a political unit. Further, by choosing counties instead of libraries as the index unit of study, they reinforced that the index isn't for judging library performance. The index is to help us understand how it fits with other wellbeing indicators.
Of the 9,171 public library systems in the US (FY 2021), only 1,372 report county boundary aligned service areas in their annual report to their respective states (IMLS Public Library Survey). Even though county serving libraries are only 15% of systems, they account for over 30% if library buildings (5,380). In the original study, all other library data had to be aggregated, making for measurement challenges.
In order to effectively measure relationships between indexes, all indexes needed to be constructed on the same boundary (unit of analysis). But what they find is that at the county level, the impacts of individual library facilities is difficult to see - especially in densely populated counties and those with lots of different community types within.
So when designing our study, we chose to construct the Library Index at the library system level. This is the smallest unit of analysis that is captured nationally by all states in the annual report that is compiled into the IMLS's Public Library Survey data publication. Some states require library systems report all the data at the branch level. Some states have mostly "single outlet systems", or just independent libraries without a main branch in some other town or neighborhood. Some states have mostly "multi-outlet systems" where the whole state is organized with central or main branches and neighborhood or far flung town branches. We use a national map of library types to help visualize the way public library structures and governance vary across the US, with some types clustering in different regions.
Fig. Distribution of Public Libraries by Legal Basis and Structure in Contiguous US
!["Distribution of Public Libraries by Legal Basis and Structure in Contiguous US"][2]
*Three layers of symbology are displayed in this visualization taken from the 2019 IMLS Outlet File. Squares are single outlet systems. Circles are multi-outlet system central libraries. Grey triangles are multi-outlet system branches. The color of the central libraries of single and multi-outlet systems symbolize the administrative structure type: light blue: city/county, blue: municipality, pale green: county/parish, green: independent district, yellow: other, orange: non-profit, pink: multi-jurisdictional, red: tribal nation, magenta: school district. Also note that bookmobiles are not displayed because it is assumed that they are not fixed in one community as infrastructure although that might not actually be the case.*
This replication was able to be completed because of the well-written and clear [Technical Appendix][3] included in "Understanding...". The following steps include both data file preparation as well as those interpreted from the Technical Appendix.
1. **Clean each year of Public Library Survey data to be indexed**: change all negative values to missing, ensure there are no duplicate records, delete records that have missing values for central library count & population served, and that all variables that will be used in index calculation are numeric. "Understanding..." used 2016 as their data year.
2. **Transform and Generate Necessary Variables**: create binary variables for each level of rurality/urbanity from the locale code and create total outlet variable (bookmobiles excluded).
3. **Derive and Merge in Sum of System Square Footage**: each published data year has an administrative entity file inclusive of all library system financial, staffing, service provision and use data, and an outlet file for each library facility within each administrative entity inclusive of hours open and square footage; collapse sum of all library system square footage on FSCS Key and State for each year to be indexed; merge this sum into the administrative entity file 1 to 1 on FSCS Key, State, and reporting year.
4. **Generate Z scores at 2 scales**: a z-score is a number that tells us how far a raw measure is from the average and it's a common way of standardizing data in the social sciences. To calculate hypothetical *Anylibrar*y, US's circulation z-score, I need *Anylibrary*'s annual circ (1,200), the national average annual circ (168,788), and the standard deviation in that national annual circ data (720,285). I can tell right away that *Anylibrary* has well below average circulation so I can expect their z-score to be negative. I calculate it as (1,200-168,788)/720,285 giving a z-score of -0.23. From this description you might be able to tell that it matters a lot what data is included in the "nation". What if *Anylibrary* serves a rural community? Should its score be based on suburban and urban circulation figures? In "Understanding..." z-scores are calculated WITHIN each of the following geographic scales: micropolitan, rural, suburban, urban, but nationwide. Our study replicates that, and also calculates z-scores within each geographic scale WITHIN each of the following census regions: New England, Mid East, Great Lakes, Plains, Southeast, Southwest, Rocky Mountains, and Far West. We conduct this calculation, generating new variables and then collapsing them into single z-score variables for each scale, using the z-score package in Stata for library service population (`popu_lsa`), and the 12 variables included in the "Understanding..." Library Index.
5. **Population Adjusted Z-score**: This study interprets the Technical Appendix statement "A resulting set of population-adjusted measures represents the difference between a county’s overall population z-score and that county’s z-score for each library presence and usage measure" (p. 35) literally by subtracting the population z-score from the indicator z-score to generate a population-adjusted measure. To continue the circulation example, in order to create a population adjusted circulation z-score for a rural library compared to all rural libraries in the nation I type the following code: `gen totcir_stnat= totcir_znat- popu_znat`.
6. **Index of Z-scores**: There are few ways to corral these different "scored" data into a single index and then use this single index to rank the nation's libraries. Following [Rönkkö et al (2015)][4] we simply add up the 12 population-adjusted z-scores into a single index measure of library presence and usage (or service provision and utilization). When creating the index post professional input, weights derived from factor analysis based on library administrative type and census region will be applied to each index component.
7. **Decile Ranking the Index**: In "Understanding..." the purpose of the ranking is to create a sample for selecting case study libraries for further study. The present study also used the rankings as a part of partner sample selection. Where "Understanding..." researchers invited libraries from the top 2 deciles (following a learn from the best approach), the Libraries in Community Systems project chose maximum variation sampling to select libraries to invite for further study (those from the top two and bottom two deciles). More on the invited libraries (Model Testers) here: [https://osf.io/m7gqb][5]. It is this file which is exported from our statistical analysis software (Stata) to Microsoft Excel spreadsheets. The code used to group library index results into decile rankings is: `xtile [new variable]= [index variable] if [rural/suburban/town/urban]==1, nq(10)`. This is the file that is exported and placed in the dataset folder. Rankings in this project are for the project team to have a comparison snapshot, and are made available here as part of our transparent data process, but aren't intended as a judgement of library performance or worth. The file contains only FSCSkey and State for library identification. Those interested in comparing a specific library to peers would be better served by the IMLS Library Search and Compare tool.
**Interpreting the available FY Library Index File**
The ranking should be within the same comparison groupings that the z-scores were generated from. The above code is geographic scales nationwide. An additional condition would be given to limit the grouping to a specific region. The results ( 1-10) can be interpreted as the first through tenth index values. A 1 in the rank means that the sum of all the library's z-scores for the 12 population adjusted service measures fell within the lowest 10% of the sums. It DOES NOT mean that the library is worse off than 90% of libraries in all things, necessarily. If a library has really high population adjusted attendance at its programs but everything else is well below average, it could still have a very low index score. And if the library's strategic plan stipulates that attendance at programs is the most important service, then they are still doing things right. Similarly, a 10 in the index rank means that the library's summed z-scores falls in the top 10% of the comparison group.
**Deriving & Interpreting Factor Loadings**
This study takes for granted the 12 variables presented in "Understanding..." are the right measures for a nationwide index of library presence and usage. And for good reason! This set of twelve are both strongly correlated with each other and with the resulting index. Using 2016 data at the county level, those researchers find that each measure accounts for at least 70% of the variation in the overall measure of library presence and usage.
> **Quick side note about variation**: if you aren't a regular consumer of social sciences research, the importance of "variation" might not
> make sense. But much of this research is trying to mathematically
> establish the reason something is happening, or the factors when taken
> together which could predict something. The method used here is called
> factor analysis. The "factor" is library presence and usage: something
> that can't be measured directly because it's really a combination of
> things. The measures that we think combine to create the factor are
> often called "inputs". If in every place in the country libraries
> looked exactly the same, so that there was no variation, well then we
> wouldn't really need to do any fancy analysis - we could just measure
> them directly! It's the differences between them that we try to
> attribute to specific inputs. "Factor loadings" is a way of figuring
> out what percentage of the differences in the factor, the overall
> combination of inputs, that are driven by any specific input. Using
> variation in different empirical (evidence-based) models (mathematical
> approaches to the empirical question) will come up over and over in
> the technical descriptions of how we do what we do in this project.
Ok, back to our approach to analyzing our resulting indexes for reliability using factor analysis. The "Understanding..." folks did everything right at the geographic boundary (county) and scale (nationwide) that they were working with. In this study we asked: "how does the value and importance of each of these inputs change when the unit of analysis, comparison groupings, and factor groupings change?" The answer is that the changes vary with each of these types of change! Here are the steps we took to investigate this question:
1. **Explore inter-input correlation**: formally, we conduct a Pearson's correlation analysis. What that really means is that I ask the statistical software to tell me the level of correlation between each pairing of the 12 variables in the index. For example, what is the correlation between circulation and program attendance? And then circulation and total print materials in the collection? And so on. In order to create a single value that represents a collection of inputs, we want high correlation between inputs. Levels of correlation above 60% are considered alright enough, but the rigorous standard is 70% or above for every input's relationship. In our software, we use the command "pwcorr". These tables are in the summary statistics file in the Analysis Methods and Results component.
2. **Mess about with matrices**: if you want to replicate our findings, you 100% don't have to do this. Because the index could be conceptualized as distinct factors (library presence and library usage), we experimented with factor analysis with 2 factors and various matrix rotations of the resulting factor loadings to see if anything changes. Two take-aways is (1) that presence and usage are so interrelated that 1 factor analysis is more appropriate, which (2) is a vector so matrix transposition isn't meaningful.
3. **Factor Analysis & Loadings**: the code used here is: `factor [variable list] [condition statements], ipf factors(1)`. The variable list is the set of 12 inputs, condition statements are the groupings we examine during this phase of analysis, including geographic scale, region, and library system administrative structure. What we didn't do in this round is divide these by single outlet and multiple outlet systems which surfaces during our analysis as something to incorporate when a new index is created incorporating library worker priorities and insights. The results of this step highlight how the index performs differently for different comparison groupings and administrative structures or library types. Exploiting this variation will help us make an index that would be useful for library decision-makers interested in cross-library comparison measures.
**Further Extension: Surveying Library Workers & Incorporating their Insights**
Libraries in Community Systems partners directly with 17 libraries across Alaska, Georgia, Kansas, New Mexico, New York, and Texas, representing a variation sample of community make-up, library administrative structures, and policy contexts. We refer to these partners as Model Testers because they work with us to iterate models of library service and library role within broader place-based networks. Although the library index replication presented here is given as straightforward, it represents the culmination of methods discussions and interviews with our partners. And, as of June 2023, there is another massive re-creation to conduct. Here are the steps that we are working through, with in-process or future work noted where applicable.
1. **Create a Draft Index for Model Testers**: Following a similar process to that outlined above, but covering the years 2010-2020, we replicated the index for our Model Testers, presenting the Library Presence and Usage index as a Library Performance Index in a report . The report also gave trend lines for each of the 12 index inputs from per capita transformed raw measures. To see report examples for libraries that gave us permission to share their information, see the Draft Reports folder. ([https://osf.io/rfq6m/][6]) Note! These are drafts provided to begin the research conversation on measurement, not local-level decision-making documents.
2. **Distribute the Library Index Survey (https://osf.io/k8zuy)**: The survey we asked Model Testers to complete is ridiculously cumbersome for branch libraries outside of Georgia, and just plain long for all the others. We ask branch libraries in Texas and New York to give branch level service and usage data so that we can do outlet level comparisons with neighborhood level social wellbeing indexes (Georgia collected branch level data as part of its annual reporting process). All Model Testers are asked questions about internal library operations like who has agency within the organization. Finally, we ask verification, correction, and perception questions about the Library Index Report each library received. Responses were collected January 2023 - April 2023. Model Testing library workers generally ranked digital collections and services statistics usefulness lower than in-facility usage, specifically engaged usage like program attendance and visits (rather than circulation, or number of databases, for instance.)
3. **Create and Distribute "What Statistics Matter" Survey**: Model Testers also listed out measures not included in the Index that they think would be useful to their ability to evaluate local service. These were added to the list of Index inputs and published as a survey distributed nationally through the Libraries in Community Systems mailing list. The survey asks: What state are you in? How do you define your service community? (answer options micropolitan/small town, rural, suburban, tribal/indigenous/native/first nation/Indian, urban), Which of these measures matter to your local service? (matrix with list of possible inputs and a scale ranking: not important, pretty, very, absolutely essential), and finally there is an open text space for the respondent to tell us whatever they want about this topic. A pdf version is here: [https://osf.io/k5fu7][7].
4. **Collect Survey Responses**: The national survey collection was initiated on June 20 and closed June 30. We use Google Forms. During the initial collection period there were states and library types from which we had gathered no responses. We sent targeted invitations to those states and libraries nationally.
**Survey Results**: After final collection, we gathered 308 responses nationally. To analyze these data, all text options were transformed into numeric values. This was done in Stata using `replace [variable]=1 if [variable]=="Urban"`, for example, but could have been completed in Excel. This allowed for analyzing differences in means between respondent urban/rural subgroups. Responses, including anonymized verbatim answers, can be viewed here: [https://osf.io/3eyfd][8].
**New Index(es)**: Some of the inputs used in our factor analysis at each of our scale versions aren't significant enough to include. We'll begin the new construction with pairwise correlation exploration of a large number of potential inputs. We will favor those inputs that librarians found to be most indicative of good local service. We are open to the possibility that indexes should be distinct based on either administrative type, multiple-single outlet division, or some other characteristic. This new index or indexes will be distributed to Model Testers first, editied for clarity, and then distributed nationally for a final round of comments before publishing as a final work.
[4]: Rönkkö, Mikko, Cameron N. McIntosh, and John Antonakis. 2015. On the adoption of partial least squares in psychological research: Caveat emptor. *Personality and Individual Differences 87*: 76-84. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=reOq-kIAAAAJ&cstart=20&pagesize=80&sortby=pubdate&citation_for_view=reOq-kIAAAAJ:mB3voiENLucC
[1]: https://www.imls.gov/sites/default/files/2021-10/swi-report.pdf
[2]: https://mfr.osf.io/export?url=https://osf.io/download/d9ntp/?direct=&mode=render&format=2400x2400.jpeg "Distribution of Public Libraries by Legal Basis and Structure in Contiguous US"
[3]: https://www.imls.gov/sites/default/files/2021-10/swi-appendix-i.pdf
[4]: https://www.imls.gov/sites/default/files/2021-10/swi-appendix-i.pdf
[5]: https://osf.io/m7gqb
[6]: https://osf.io/rfq6m/
[7]: https://osf.io/k5fu7
[8]: https://osf.io/3eyfd