Q32. A Client diаgnоsed with аlcоhоl use disorder аsks, “How will Alcoholics Anonymous (AA) help me?” What is the nurse’s best response?
Which оf the fоllоwing is аn аbnormаlity that would lead to an increase in the Reynolds number?
Questiоn 1 ML аccelerаtiоn [22 pоints] [25 minutes] (1.1) [6 points] Why is the TPU’s power lower thаn the GPU’s? How can the TPU perform as well or better than the GPU without GPU’s overheads? (1.2) [4 points] Why is a larger systolic array better for MM in ML? Why does a systolic array not become inefficient (i.e., remain efficient) with size? (1.3) [4 points] What is the key advantage of unstructured sparsity compared to structured sparsity in ML models? What end benefits does the advantage lead to? (1.4) [8 points] Assuming unstructured, one-sided (weights-only), sparse ML models have 25% density (i.e., 75% weights are zeros), how would you modify the TPU shown below to handle specifically (a) MAC underutilization and (b) load imbalance across cells? Assume the input and output activations are dense and that a weight may be displaced down by at most one cell to achieve load balance (similar to Eureka). Recall that the TPU computes inner product with weights held stationary in the MAC systolic array, the dense input (output) activations streamed in (out) from the left (bottom). (a) (4 points) MAC underutilization: (b) (4 points) Load imbalance across cells: Modify the above figure. Question 2 PIM [24 Points] [25 minutes] (2.1) (4 points) What key workload characteristic would make a memory‐bound workload unfit for PIM? Be specific. (2.2) (4 points) In an MV multiplication on a Newton‐like PIM, what operation occurs in parallel with a column read of the matrix? What is the cost of the avoiding that operation? (2.3) (6 points) What does Newton’s interleaved layout achieve? What does the layout lose? Why is the loss acceptable? (2.4) (10 points) Assume (1) o is the DRAM bank activation time (time to read a bank’s row into the bank’s row buffer) (2) all n banks can be activated in parallel (without any tFAW{"version":"1.1","math":"tFAW"} restrictions), and (3) compute/read time per column is tCOL{"version":"1.1","math":"tCOL"} and there are c columns per DRAM row. Ignore all other overheads in an MV computation. (a) (5 points) What is the time to compute one DRAM row across all banks in a Newton‐like PIM? (b) (3 points) What is the time to compute one DRAM row across all banks in a non‐PIM (e.g., GPU + standard DRAM) assuming the only exposed time is the time to read the row out one column at a time in a standard DRAM? (c) (2 points) If we assume that o = c/4 * tCOL{"version":"1.1","math":"tCOL"} what is the speedup of PIM over non‐PIM? Question 3 Network Acceleration [6 Points] [5 minutes] (3.1) (3 points) What is the key performance requirement in network routers? (3.2) (3 points) What key flexibility does programmable router bring to networks? Question 4 Polynomial accelerator [28 Points] [30 minutes] Purdue CompE ML faculty have had a breakthrough! They have invented polynomial- based models which are far more accurate than the standard matrix-based models. In these new models, the key compute primitive is modulo polynomial multiplication. Each model involves computing trillions of this primitive for polynomial filters and features. The polynomials are of degree < 32 (degree is the power of the polynomial’s highest power term with a non-zero coefficient). Modulo multiplication here is polynomial multiplication followed by modulo using the fixed, simple, constant polynomial x32{"version":"1.1","math":"x32"} so that the result is a polynomial of degree < 32. (Yes, I can make up workloads so that the problem is neither too easy nor too hard.) For example, (x3+4x+2)*(2x2+3){"version":"1.1","math":"(x3+4x+2)*(2x2+3)"} modulo x4{"version":"1.1","math":"x4"} is (11x3+4x2+14x+6){"version":"1.1","math":"(11x3+4x2+14x+6)"}. We wish to build hardware for this primitive. This question has nothing to do with FHE or FHE’s NTT. Assume the coefficients are FP16 and are arranged in decreasing power-term order for each polynomial and the powers are non-negative integers < 32. (4.1) (4 points) What is the space and time complexity of polynomial multiplication (without any modulo)? Define the parameter in your complexity measure. Is polynomial multiplication compute- or memory-bound? (4.2) (4 points) How is one polynomial multiplication without any modulo similar to one MM? How is it different? Similar: Different: (4.3) (4 points) How would you address in hardware Q4.2’s “different” aspect? What key property of polynomial multiplication simplifies handling this aspect? (4.4) (4 points) How does including modulo x32{"version":"1.1","math":"x32"} change your solution to Q4.3? (4.5) (6 points) What is your accelerator organization for one polynomial multiplication with modulo? What is your basic strategy to implement the multiplication? While sequential designs are unacceptable, an unoptimized parallel design is acceptable. Optimizations are left for Q4.6. (4 points) Draw your accelerator organization. Label the blocks and inputs with well-known terms without digital logic/circuit-level details. (2 points) Describe your strategy in terms of what happens to the coefficients and powers of each polynomial in your accelerator? You may describe your strategy using various components of your accelerator. (4.6) (2 points) What are two key difficulties faced by your design? These difficulties are common to other accelerators as well. (4.7) (4 points) How would you solve each difficulty? How would your building block change? Question 5 And finally, a non-MM model [20 Points] [30 minutes] Gradient boosting trees (GBT) is a well-known, high-accuracy, non-MM model for classifying table-based data with numerical and non-numeric, categorical fields (e.g., gender, race, education, state of residence). Such table-based data is prevalent in the real world (e.g., relational databases and spreadsheets). Each record in a table has multiple numerical and categorical fields (e.g., 100 fields). GBT uses an ensemble of multiple weak models (e.g., 500 shallow, 5-deep binary decision trees) to produce a strong model. GB training involves binning the training data records into small histograms based on each field (e.g., a small 256-entry histogram for each of 100 fields). Binning simply increments the matching bin’s counter. While the numerical fields map to 256 bins, categorical fields may map to fewer bins (e.g., yes/no fields use two bins). For simplicity, we skip the other steps in building the decision trees based on the histograms. The histogram counts use 64 bits (8 bytes) (so a histogram is 256*8B = 2KB). In inference, each input record traverses many (e.g., 500) shallow decision trees whose decisions are combined for the final inference (the combining details are ignored for simplicity). A 5-deep binary decision tree has at most 1+2+4+8+16+32= 63 nodes where each node computes a simple decision (e.g., field4 < 25, field6 == true) which may be encoded using at most 8 bytes (so a tree is at most 64*8B = 512 bytes). The trees need not be a full binary tree but this detail can be ignored. We wish to build an accelerator for GBT that accelerates both training and inference with most of the same hardware. (5.1) (4 points) Where is the parallelism in GBT training and in inference? Training:Inference: (5.2) (4 points) Where are the dependencies in GBT training and in inference? Training:Inference: (5.3) (6 points) Why would GPUs not work for GBT training or inference? Training:Inference: (5.4) (6 points) What key observation lets you use most of the same hardware for training and inference? Describe your accelerator organization (analogous to “a 128x128 MAC array where each MAC computes a weight-activation product in training and inference”). Congratulations, you are almost done with the Final Exam. DO NOT end the Honorlock session until you have submitted your work. When you have answered all questions: Use your smartphone to scan your answer sheet and save the scan as a PDF. Make sure your scan is clear and legible. Submit your PDF as follows: Email your PDF to yourself or save it to the cloud (Google Drive, etc.). Click this link to submit your work: Final Exam Return to this window and click the button below to agree to the honor statement. Click Submit Quiz to end the exam. End the Honorlock session.
A _____________ is аn аudible sign оf turbulence in а vessel. A _____________ is an audible sign оf turbulence in the heart.